引言

本次给大家带来一个leveldb中布隆过滤器的实现，比较简单，代码也比较短。

缓存穿透

了解布隆过滤器，我们可以先从了解缓存穿透说起。

我们知道KV数据库访问数据都是先查看缓存，如果缓存中没有我们想要的数据，我们才会到硬盘上读取。读取之后该数据就被放到缓存中，这样下次用户访问时不需要重新访问磁盘，这不仅仅是KV数据库，也是任何缓存最直接的好处。

那么试想有这么一个场景：用户不断访问一个缓存和磁盘中都没有的数据，第一次访问，KV数据库发现缓存中没有，于是我们到磁盘读取，结果是磁盘也没有，而后续的访问也都会重复上述过程，也就是说，这种情况让缓存变得毫无意义，此时我们还不如直接到磁盘读取数据，这种情况就叫做缓存穿透（有别于缓存击穿和缓存雪崩，不过击穿和雪崩都是访问磁盘中存在的数据的情况，感兴趣的读者可以去了解一下）。

如果让我们来解决这个问题，我们首先肯定会想到在缓存中直接放一个nullptr，加上一个标志位来判断究竟是还没缓存进来，还是磁盘上没有数据不就得了，但是其实这么做开销是很大的。

让我们接着往下看。

布隆过滤器

为了解决这种情况，布隆提出了布隆过滤器这种数据结构。实际上，布隆过滤器就是一连串的位加上几个哈希函数实现。

插入时，我们用多个哈希函数计算同一个Key的哈希值，并设置位数组上对应的位为1。
查找时，我们查看对应位，如果全部为1（全部划重点），说明可能存在，反之一定不存在。

因此，布隆过滤器也有一定的误判率，如果Key对应的位被别的Key占用了，查询时会得到错误的结果，当然由于哈希函数比较多，所以Key的位也比较多，想把Key的每个位全部占用概率是比较小的。

比较绕口的解释：布隆过滤器中存在的数据不一定存在，不存在的数据一定不存在。
对应的，可以再绕点：磁盘中不存在的数据布隆过滤器中可能存在，磁盘中存在的数据布隆过滤器中一定存在。

看懂这两句话，说明你已经了解布隆过滤器的思想了，维基百科上偷个图，然后我们就可以看一下leveldb是如何实现的了。
在这里插入图片描述

FilePolicy

// 过滤策略
class LEVELDB_EXPORT FilterPolicy {
   
 public:
  virtual ~FilterPolicy();

  // Return the name of this policy.  Note that if the filter encoding
  // changes in an incompatible way, the name returned by this method
  // must be changed.  Otherwise, old incompatible filters may be
  // passed to methods of this type.
  virtual const char* Name() const = 0;

  // keys[0,n-1] contains a list of keys (potentially with duplicates)
  // that are ordered according to the user supplied comparator.
  // Append a filter that summarizes keys[0,n-1] to *dst.
  //
  // Warning: do not change the initial contents of *dst.  Instead,
  // append the newly constructed filter to *dst.
  virtual void CreateFilter(const Slice* keys, int n,
                            std::string* dst) const = 0;

  // "filter" contains the data appended by a preceding call to
  // CreateFilter() on this class.  This method must return true if
  // the key was in the list of keys passed to CreateFilter().
  // This method may return true or false if the key was not on the
  // list, but it should aim to return false with a high probability.
  virtual bool KeyMayMatch(const Slice& key, const Slice& filter) const = 0;
};

BloomFilterPolicy实际上继承了FilePolicy，CreateFilter作用为生成前面提到的位数组到dst，KeyMayMatch则是用于判断过滤器中是否存在数据。

BloomFilterPolicy

成员变量

 private:
  size_t bits_per_key_;
  // 模拟哈希函数个数
  size_t k_;

上文提到，一个Key对应的多个位是由多个哈希函数计算得到的，leveldb中不是这样的，它只有一个哈希函数，是对另一个util Hash的包装。

static uint32_t BloomHash(const Slice& key) {
   
  return Hash(key.data(), key.size(), 0xbc9f1d34);
}

出于效率考虑，每一个计算出来的Hash其实都由这第一个Hash再经过简单的运算（其实就是加法，加上一个delta，后文会看到）获得的。

所以，k_其实代表了哈希的次数，也可以理解为哈希函数的个数。

bits_per_key_是每个Key有多少字节，用来确定位数组应该有多大。

BloomFilterPolicy

  explicit BloomFilterPolicy(int bits_per_key) : bits_per_key_(bits_per_key) {
   
    // We intentionally round down to reduce probing cost a little bit
    // 故意四舍五入降低开销
    k_ = static_cast<size_t>(bits_per_key * 0.69);  // 0.69 =~ ln(2)
    if (k_ < 1) k_ = 1;
    if (k_ > 30) k_ = 30;
  }

这个0.69相当于对ln(2)做了一个提前计算，至于ln(2)，它是一个经验值，表示应该有bits_per_key * ln(2)个k_。

如果k_太小或者太大都不好，所以leveldb对k_加以界定。

CreateFilter

  void CreateFilter(const Slice* keys, int n, std::string* dst) const override {
   
    // Compute bloom filter size (in both bits and bytes)
    size_t bits = n * bits_per_key_;

    // For small n, we can see a very high false positive rate.  Fix it
    // by enforcing a minimum bloom filter length.
    if (bits < 64) bits = 64;

    size_t bytes = (bits + 7) / 8;
    bits = bytes * 8;

    const size_t init_size = dst->size();
    dst->resize(init_size + bytes, 0);
    // k_被放在dst最后一字节
    dst->push_back(static_cast<char>(k_));  // Remember # of probes in filter
    char* array = &(*dst)[init_size];
    for (int i = 0; i < n; i++) {
   
      // Use double-hashing to generate a sequence of hash values.
      // See analysis in [Kirsch,Mitzenmacher 2006].
      uint32_t h = BloomHash(keys[i]);
      const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
      for (size_t j = 0; j < k_; j++) {
   
        const uint32_t bitpos = h % bits;
        array[bitpos / 8] |= (1 << (bitpos % 8));
        h += delta;
      }
    }
  }

可以看到，dst实际不止是存放那个位数组，它的最后一个字节还放着k_，供KeyMayMatch使用。

dst->push_back(static_cast<char>(k_));  // Remember # of probes in filter

形参keys相当于一个初始的Key数组，我们遍历这个数组，为每个Key生成对应的位，所以这里有两重for循环，另一重的k_次哈希的。

上文提到，leveldb的布隆过滤器没有k_个哈希函数，不过是对第一个真正计算出来的哈希值做一个加法，在这里我们就能看到：

      const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
      for (size_t j = 0; j < k_; j++) {
   
        const uint32_t bitpos = h % bits;
        array[bitpos / 8] |= (1 << (bitpos % 8));
        h += delta;
      }

h就是那个真正的哈希值，而delta不过就是Rotate得到的一个值。

KeyMayMatch

  // 布隆过滤器返回true表示可能存在，返回false一定不存在
  bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const override {
   
    const size_t len = bloom_filter.size();
    if (len < 2) return false;

    const char* array = bloom_filter.data();
    const size_t bits = (len - 1) * 8;

    // Use the encoded k so that we can read filters generated by
    // bloom filters created using different parameters.
    const size_t k = array[len - 1];
    if (k > 30) {
   
      // Reserved for potentially new encodings for short bloom filters.
      // Consider it a match.
      return true;
    }

    uint32_t h = BloomHash(key);
    // 做一个简单的移位，或者说Rotate，得到一个随机的增量
    const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
    for (size_t j = 0; j < k; j++) {
   
      const uint32_t bitpos = h % bits;
      if ((array[bitpos / 8] & (1 << (bitpos % 8))) == 0) return false;
      // 其实没有多个哈希函数
      h += delta;
    }
    return true;
  }

KeyMayMatch就是用来判断数据是否存在于布隆过滤器的函数。

我们来看一下获得了bitpos之后如何判断位是否存在：

(array[bitpos / 8] & (1 << (bitpos % 8))

bitpos / 8只取整数部分，取出对应的字节，再和bitpos % 8的余数部分做一个与运算。

写到这里感觉有点太简单了，不值得一写。。。

Leveldb：BloomFilter浅析

目录

引言

缓存穿透

布隆过滤器

FilePolicy

BloomFilterPolicy

成员变量

BloomFilterPolicy

CreateFilter

KeyMayMatch

相关链接