bloom filter这种数据结构用于判断一个元素是否在集合内,当然,这种功能也可以由HashMap来实现。bloom filter与HashMap的区别在于,HashMap会储存代表这个元素的key自身(如key为"IKnow7",那么HashMap将存储"IKnow7"这12个字节(java),其实还需要包括引用大小,但java中相同string只存一份),而bloom
filter在底层只会使用几个bit来代表这个元素。在速度上,bloom filter对比与HashMap相差不大,底层同样是hash+随机访问。由于bloom filter对空间节省的特性,bloom filter适合判断一个元素是否在海量数据集合中。
bloom filter的一些概念
bloom filter并非十全十美。bloom filter在添加元素时,会将对象hash到底层位图数组的k个位上,对这些位,bloom filter会将其值设为1。由于hash函数特性以及位图数组长度有限,不同的对象可能在某些位上有重叠。bloom filter在检查元素是否存在时,会检查该对象所对应的k个位是否为1,如果全部都为1表示存在,这里就出现问题了,这些位上的1未必是该元素之前设置的,有可能是别的元素所设置的,所以会造成一些误判,即原本不在bloom
filter中的一些元素也被判别在bloom filter中。bloom filter的这种误判被称为"积极的误判",即存在的元素的一定会通过,不存在的元素也有可能通过,而不会造成对存在的元素结果为否的判定。
可以简单猜测,误判的概率与hash的选择、位图数组的大小、当前元素的数量以及K(映射位的个数)有关。一般来说,hash值越平均、位图数组越大、元素数量越少那么误判的概率就越低。
bloom filter在web上的应用
在web应用中我们经常需要使用白名单来过滤一些请求,用以避免一些无效的数据库访问或者恶意攻击。对于允许一些误判率且存在海量数据的白名单来说,使用bloom filter是不二的选择。
使用bloom filter实现一个支持增量请求的白名单
白名单通常是需要更新的,更新的方式一般有全量和增量更新。全量不必说,重新定义个bloom filter将当前所有数据放入其中即可。增量更新的话,一般会提供一段时间内新增和删除的数据,所以需要在白名单中将数据进行合并,该添加的添加,该删除的删除。
可是...... 原生的bloom filter并不支持元素的删除操作,因为某一位可能为多个元素所用。一种不切实际的想法是为bloom filter的每一位设置一个引用计数,每删除一个元素减1。
一种可行的做法是,另外使用一个map来保存已删除的元素,在判断元素是否存在时先判断在该deletemap中是否存在,如果存在,直接false。如果不存在,再通过bloom filter进行判断。在新添加元素时,如果deletemap中存在,删除该deletemap中的该元素,再添加到bloom filter中。在实际应用中,使用白名单的场景需要删除的元素一般是较少的,所以这种方式从效率是可行的。这种方式存在一个问题,当deletemap中元素过多时,势必会造成bloom
filter的误判率上升,因为某些原本被删除元素设置为1的位并没有被归0。该问题的解决措施是,当deletemap的容量到达的一个界线时,使用全量同步更新该bloom filter。
白名单bloom filter的实现
这类构件复用性很强,可以轻松的集成到现有的代码之上。下面直接贴出来:
public class BloomFilter<E> implements Serializable {
private static final long serialVersionUID = 3507830443935243576L;
private long timestamp;//用于时间戳更新机制
private HashMap<E, Boolean> deleteMap ; //储存已删除元素
private BitSet bitset;//位图存储
private int bitSetSize;
// expected (maximum) number of elements to be added
private int expectedNumberOfFilterElements;
// number of elements actually added to the Bloom filter
private int numberOfAddedElements;
private int k; //每一个元素对应k个位
// encoding used for storing hash values as strings
static Charset charset = Charset.forName("UTF-8");
// MD5 gives good enough accuracy in most circumstances.
// Change to SHA1 if it's needed
static String hashName = "MD5";
static final MessageDigest digestFunction;
static { // The digest method is reused between instances to provide higher entropy.
MessageDigest tmp;
try {
tmp = java.security.MessageDigest.getInstance(hashName);
} catch (NoSuchAlgorithmException e) {
tmp = null;
}
digestFunction = tmp;
}
/**
* Constructs an empty Bloom filter.
*
* @param bitSetSize defines how many bits should be used for the filter.
* @param expectedNumberOfFilterElements defines the maximum
* number of elements the filter is expected to contain.
*/
public BloomFilter(int bitSetSize, int expectedNumberOfFilterElements) {
this.expectedNumberOfFilterElements = expectedNumberOfFilterElements;
this.k = (int) Math.round(
(bitSetSize / expectedNumberOfFilterElements) * Math.log(2.0));
bitset = new BitSet(bitSetSize);
deleteMap = new HashMap<E, Boolean>();
this.bitSetSize = bitSetSize;
numberOfAddedElements = 0;
}
/**
* Generates a digest based on the contents of a String.
*
* @param val specifies the input data.
* @param charset specifies the encoding of the input data.
* @return digest as long.
*/
public static long createHash(String val, Charset charset) {
try {
return createHash(val.getBytes(charset.name()));
}
catch (UnsupportedEncodingException e) {
e.printStackTrace();
// Ingore
}
return -1;
}
/**
* Generates a digest based on the contents of a String.
*
* @param val specifies the input data. The encoding is expected to be UTF-8.
* @return digest as long.
*/
public static long createHash(String val) {
return createHash(val, charset);
}
/**
* Generates a digest based on the contents of an array of bytes.
*
* @param data specifies input data.
* @return digest as long.
*/
public static long createHash(byte[] data) {
long h = 0;
byte[] res;
synchronized (digestFunction) {
res = digestFunction.digest(data);
}
for (int i = 0; i < 4; i++) {
h <<= 8;
h |= ((int) res[i]) & 0xFF;
}
return h;
}
/**
* Compares the contents of two instances to see if they are equal.
*
* @param obj is the object to compare to.
* @return True if the contents of the objects are equal.
*/
@SuppressWarnings("unchecked")
@Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (getClass() != obj.getClass()) {
return false;
}
final BloomFilter<E> other = (BloomFilter<E>) obj;
if (this.expectedNumberOfFilterElements !=
other.expectedNumberOfFilterElements) {
return false;
}
if (this.k != other.k) {
return false;
}
if (this.bitSetSize != other.bitSetSize) {
return false;
}
if (this.bitset != other.bitset &&
(this.bitset == null || !this.bitset.equals(other.bitset))) {
return false;
}
return true;
}
/**
* Calculates a hash code for this class.
* @return hash code representing the contents of an instance of this class.
*/
@Override
public int hashCode() {
int hash = 7;
hash = 61 * hash + (this.bitset != null ? this.bitset.hashCode() : 0);
hash = 61 * hash + this.expectedNumberOfFilterElements;
hash = 61 * hash + this.bitSetSize;
hash = 61 * hash + this.k;
return hash;
}
/**
* Calculates the expected probability of false positives based on
* the number of expected filter elements and the size of the Bloom filter.
* <br /><br />
* The value returned by this method is the <i>expected</i> rate of false
* positives, assuming the number of inserted elements equals the number of
* expected elements. If the number of elements in the Bloom filter is less
* than the expected value, the true probability of false positives will be lower.
*
* @return expected probability of false positives.
*/
public double expectedFalsePositiveProbability() {
return getFalsePositiveProbability(expectedNumberOfFilterElements);
}
/**
* Calculate the probability of a false positive given the specified
* number of inserted elements.
*
* @param numberOfElements number of inserted elements.
* @return probability of a false positive.
*/
public double getFalsePositiveProbability(double numberOfElements) {
// (1 - e^(-k * n / m)) ^ k
return Math.pow((1 - Math.exp(-k * (double) numberOfElements
/ (double) bitSetSize)), k);
}
/**
* Get the current probability of a false positive. The probability is calculated from
* the size of the Bloom filter and the current number of elements added to it.
*
* @return probability of false positives.
*/
public double getFalsePositiveProbability() {
return getFalsePositiveProbability(numberOfAddedElements);
}
/**
* Returns the value chosen for K.<br />
* <br />
* K is the optimal number of hash functions based on the size
* of the Bloom filter and the expected number of inserted elements.
*
* @return optimal k.
*/
public int getK() {
return k;
}
/**
* Sets all bits to false in the Bloom filter.
*/
public void clear() {
bitset.clear();
numberOfAddedElements = 0;
}
/**
* Adds an object to the Bloom filter. The output from the object's
* toString() method is used as input to the hash functions.
*
* @param element is an element to register in the Bloom filter.
*/
public void add(E element) {
deleteMap.remove(element);
long hash;
String valString = element.toString();
for (int x = 0; x < k; x++) {
hash = createHash(valString + Integer.toString(x));
hash = hash % (long)bitSetSize;
bitset.set(Math.abs((int)hash), true);
}
numberOfAddedElements ++;
}
/**
* Remove all elements from a Collection to the Bloom filter.
* @param c Collection of elements.
*/
public void removeAll(Collection<? extends E> c) {
for (E element : c)
remove(element);
}
public void remove(E element) {
deleteMap.put(element, Boolean.TRUE);
}
public int getDeleteMapSize(){
return deleteMap.size();
}
/**
* Adds all elements from a Collection to the Bloom filter.
* @param c Collection of elements.
*/
public void addAll(Collection<? extends E> c) {
for (E element : c) {
if (element != null)
add(element);
}
}
/**
* Returns true if the element could have been inserted into the Bloom filter.
* Use getFalsePositiveProbability() to calculate the probability of this
* being correct.
*
* @param element element to check.
* @return true if the element could have been inserted into the Bloom filter.
*/
public boolean contains(E element) {
Boolean contains = deleteMap.get(element);
if (contains != null && contains)
return false;
long hash;
String valString = element.toString();
for (int x = 0; x < k; x++) {
hash = createHash(valString + Integer.toString(x));
hash = hash % (long) bitSetSize;
if (!bitset.get(Math.abs((int) hash)))
return false;
}
return true;
}
/**
* Returns true if all the elements of a Collection could have been inserted
* into the Bloom filter. Use getFalsePositiveProbability() to calculate the
* probability of this being correct.
* @param c elements to check.
* @return true if all the elements in c could have been inserted into the Bloom filter.
*/
public boolean containsAll(Collection<? extends E> c) {
for (E element : c)
if (!contains(element))
return false;
return true;
}
/**
* Read a single bit from the Bloom filter.
* @param bit the bit to read.
* @return true if the bit is set, false if it is not.
*/
public boolean getBit(int bit) {
return bitset.get(bit);
}
/**
* Set a single bit in the Bloom filter.
* @param bit is the bit to set.
* @param value If true, the bit is set. If false, the bit is cleared.
*/
public void setBit(int bit, boolean value) {
bitset.set(bit, value);
}
/**
* Return the bit set used to store the Bloom filter.
* @return bit set representing the Bloom filter.
*/
public BitSet getBitSet() {
return bitset;
}
/**
* Returns the number of bits in the Bloom filter. Use count() to retrieve
* the number of inserted elements.
*
* @return the size of the bitset used by the Bloom filter.
*/
public int size() {
return this.bitSetSize;
}
/**
* Returns the number of elements added to the Bloom filter after it
* was constructed or after clear() was called.
*
* @return number of elements added to the Bloom filter.
*/
public int count() {
return this.numberOfAddedElements;
}
/**
* Returns the expected number of elements to be inserted into the filter.
* This value is the same value as the one passed to the constructor.
*
* @return expected number of elements.
*/
public int getExpectedNumberOfElements() {
return expectedNumberOfFilterElements;
}
/**
* 返回更新的时间戳机制
* @return
*/
public long getTimestamp() {
return timestamp;
}
/**
* 设置跟新的时间戳
* @param timestamp
*/
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
@Override
public String toString() {
return "BloomFilter [timestamp=" + timestamp + ", bitSetSize=" + bitSetSize
+ ", expectedNumberOfFilterElements="
+ expectedNumberOfFilterElements + ", numberOfAddedElements="
+ numberOfAddedElements + ", k="
+ k +",deleteMapSize=" +getDeleteMapSize()+"]";
}
}
作者:Troy__ 发表于2014-11-26 18:35:42
原文链接