Reservoir sampling
Reservoir sampling is a family of randomized algorithms for randomly choosing a sample of k items from a list S containing n items, where n is either a very large or unknown number. Typically n is large enough that the list doesn’t fit into main memory.
——Wiki
存储采样是指从一个包含n个对象的列表S中随机抽取出k个对象作为样本,n要么很大要么未知。典型的n通常大到无法将整个列表存入主内存。
这里,我们主要考虑如何通过这个思路产生我们想要的随机样本。通常,在考虑选取数据的时候,会按照等概选取。我们接下来就针对等概选取来说明。
假设有n个对象,我们要从中等概选取k个对象,步骤如下:
- 先将第一个对象放入内存,即选中该对象
- 对每一个后面的对象i
- 有1/i的概率会用新值覆盖旧的值
- 有1-1/i的概率会丢弃新的值
- 根据上面的情况,如果总共有k个对象,那么
- 对象1会以概率被选中
- 对象2会以概率被选中
- 对象3会以概率被选中
综上,可以看到,对于序列中的n个对象,均会以等概方式被选到。
现在通过两个leetcode的例子来看下。
1.Random Pick Index
Given an array of integers with possible duplicates, randomly output the index of a given target number. You can assume that the given target number must exist in the array.
Note: The array size can be very large. Solution that uses too much extra space will not pass the judge.
Example:
int[] nums = new int[] {1,2,3,3,3};
Solution solution = new Solution(nums);
// pick(3) should return either index 2, 3, or 4 randomly. Each index should have equal probability of returning.
solution.pick(3);
// pick(1) should return 0. Since in the array only nums[0] is equal to 1.
solution.pick(1);
我的C++代码如下:
class Solution {
public:
vector<int> nums;
Solution(vector<int> nums) {
this->nums = nums;
srand(NULL);
}
int pick(int target) {
int cnt = 0;
int ind = -1;
for(int i=0;i<nums.size();i++){
if(nums[i]==target){
cnt++;
if(ind==-1)
ind = i;
else if(rand()%cnt==0)
ind = i;
}
}
return ind;
}
};
/**
\* Your Solution object will be instantiated and called as such:
\* Solution obj = new Solution(nums);
\* int param_1 = obj.pick(target);
\*/
上面这题其实就是考虑如何从n中随机取出某个值为target的下标,可以按照Reservoir Sampling的思路计算。
注:此时我们假设C++中的rand()%n函数是完全随机的,实则不是。因为rand()是一个有上限的数,它会令小一些的数更高概率出现。
2.Linked List Random Node
Given a singly linked list, return a random node’s value from the linked list. Each node must have the same probability of being chosen.
Follow up: What if the linked list is extremely large and its length is unknown to you? Could you solve this efficiently without using extra space?
Example:
// Init a singly linked list [1,2,3].
ListNode head = new ListNode(1);
head.next = new ListNode(2);
head.next.next = new ListNode(3);
Solution solution = new Solution(head);
// getRandom() should return either 1, 2, or 3 randomly. Each element should have equal probability of returning.
solution.getRandom();
我的C++代码:
/**
* Definition for singly-linked list.
* struct ListNode {
* int val;
* ListNode *next;
* ListNode(int x) : val(x), next(NULL) {}
* };
*/
class Solution {
public:
/** @param head The linked list's head. Note that the head is guanranteed to be not null, so it contains at least one node. */
Solution(ListNode* head) {
h = head;
}
/** Returns a random node's value. */
int getRandom() {
int rd = 1;
ListNode* tmp = h;
ListNode* res = NULL;
while(tmp){
if(rand()%rd==0)
res = tmp;
tmp = tmp->next;
rd++;
}
return res->val;
}
private:
ListNode* h;
};
/**
\* Your Solution object will be instantiated and called as such:
\* Solution obj = new Solution(head);
\* int param_1 = obj.getRandom();
\*/
思路类似,不做过多赘述。