url去重 --布隆过滤器 bloom filter原理及python实现

##常见URL过滤方法
###1 直接查询比较
即假设要存储url A，在入库前首先查询url库中是否存在 A,如果存在，则url A 不入库，否则存入url库。这种方法准确性高，但是一旦数据量变大，占用的存储空间也变大，同时，由于要查库，数据一多，查询时间变长，存储效率下降。

###2 基于hash的存储
对于给定的url，通过建立的hash函数，来获得对应的hash值，并将该值存入库中。当在检查url是否存在库中时，只要将要检查的url，通过hash函数获取其hash值，然后查看库中是否存在该hash值，存在则丢弃，否则入库。这种方法在数据量变大时，占用的存储空间也会增大，查询时间也会加长。但是它可以将url进行压缩。对于很长的url，hash值可以相对很短。

以上方法中，为加快查询速度，一般可以选择 Redis作为查询库。

##布隆过滤器 Bloom Filter

基本思路（网上一大堆）：
1，设数据集合 $A={a_1,a_2,....,a_n}$

2， Bloom Filter用一个长度为 $m$

3， $k$

4, 对于要加入的元素，首经过k个散列函数产生k个随机数 $h_1,h_2,......h_k$

5，对于新要加入的元素的检查，首先将该元素经过上步中类似操作，获得k个随机数 $h_1,h_2,......h_k$

算法特点：
对于已经在集合中的元素，通过5中的查找方法，一定可以判定该元素在集合中。
对于不在集合中的元素，可能会被误判在集合中。

其整个流程可以参照以下伪代码：

classBloom:
     """ Bloom Filter """
    def __init__(self,m,k,hash_fun):
    """m, size of the vector
    k, number of hash fnctions to compute
    hash_fun, hash function to use
    """ 
    self.m = m
    self.vector =[0]*m # initialize the vectorself.k = k
    self.hash_fun = hash_fun
    self.data ={}
    # data structure to store the data
    self.false_positive =0

    def insert(self,key,value):
    """ insert the pair (key,value) in the database """
        self.data[key]= value
        for i in range(self.k):
	        self.vector[self.hash_fun(key+str(i))%self.m]=1

    def contains(self,key):
    """ check if key is cointained in the database
        using the filter mechanism """
        for i in range(self.k):
    	    if self.vector[self.hash_fun(key+str(i))%self.m]==0:
                return False
                # the key doesn't existreturnTrue# the key can be in the data set
    def get(self,key):
    """ return the value associated with key """
    if self.contains(key):
        try:returnself.data[key]# actual lookup
        except KeyError: self.false_positive +=1

参数计算：
在Bloom Filter表示方法中，对于一元素，某一位被置1的概率为 $frac{1}{m}$

由于

$lim_{xto infty} (1-frac{1}{x})^{-x} approx e$

因此误判概率为：
$P =(1-P_0)^k = (1-e^frac{-kn}{m})^k$

由以上式可知，m增大，P减小；对于给定的n和m，求最小的P值。
将上式取对数有：
$ln P = kln (1-e^{frac{-kn}{m}})$

版权声明：本文来源CSDN，感谢博主原创文章，遵循 CC 4.0 by-sa 版权协议，转载请附上原文出处链接和本声明。
原文链接：https://blog.csdn.net/a1368783069/article/details/52137417
站方申明：本站部分内容来自社区用户分享，若涉及侵权，请联系站方删除。

发表于 2020-03-01 18:29:04
阅读 ( 1057 )
分类：

url去重 --布隆过滤器 bloom filter原理及python实现

你可能感兴趣的文章

精选的优质文章

0 条评论

官方社群

GO教程

推荐文章

猜你喜欢

随便看看