GroupBlog

最近在写自己的学习项目－一个简单的存储引擎(c++)，其中在对于key的压缩算法犹豫了很久,在我查找了很多资料之后找到的适合短字符串压缩的算法,决心使用smaz库的压缩算法，然而对于这个库的介绍资料少之又少,想要整理一部分资料出来

一.smaz库的介绍

基于小对象的压缩算法。由redis作者编写
https://github.com/antirez/smaz 目前应用在redis数据库上面。与传统的压缩算法不同的是，Smaz更适合小对象的压缩，比如几个字节到几十个字节不等。除开字典的硬编码部分，压缩过程和解压过程加起来120行代码，非常的短小精罕，却有不俗的表现。平均测试能够达到40%～50%左右的压缩效果．

smaz的核心只有两部分:
字典的设计
hash函数的设计

 const char * Smaz_rcb[254] = {
        " ", "the", "e", "t", "a", "of", "o", "and", "i", "n", "s", "e ", "r", " th",
        " t", "in", "he", "th", "h", "he ", "to", "\r\n", "l", "s ", "d", " a", "an",
        "er", "c", " o", "d ", "on", " of", "re", "of ", "t ", ", ", "is", "u", "at",
        "   ", "n ", "or", "which", "f", "m", "as", "it", "that", "\n", "was", "en",
        "  ", " w", "es", " an", " i", "\r", "f ", "g", "p", "nd", " s", "nd ", "ed ",
        "w", "ed", "http://", "for", "te", "ing", "y ", "The", " c", "ti", "r ", "his",
        "st", " in", "ar", "nt", ",", " to", "y", "ng", " h", "with", "le", "al", "to ",
        "b", "ou", "be", "were", " b", "se", "o ", "ent", "ha", "ng ", "their", "\"",
        "hi", "from", " f", "in ", "de", "ion", "me", "v", ".", "ve", "all", "re ",
        "ri", "ro", "is ", "co", "f t", "are", "ea", ". ", "her", " m", "er ", " p",
        "es ", "by", "they", "di", "ra", "ic", "not", "s, ", "d t", "at ", "ce", "la",
        "h ", "ne", "as ", "tio", "on ", "n t", "io", "we", " a ", "om", ", a", "s o",
        "ur", "li", "ll", "ch", "had", "this", "e t", "g ", "e\r\n", " wh", "ere",
        " co", "e o", "a ", "us", " d", "ss", "\n\r\n", "\r\n\r", "=\"", " be", " e",
        "s a", "ma", "one", "t t", "or ", "but", "el", "so", "l ", "e s", "s,", "no",
        "ter", " wa", "iv", "ho", "e a", " r", "hat", "s t", "ns", "ch ", "wh", "tr",
        "ut", "/", "have", "ly ", "ta", " ha", " on", "tha", "-", " l", "ati", "en ",
        "pe", " re", "there", "ass", "si", " fo", "wa", "ec", "our", "who", "its", "z",
        "fo", "rs", ">", "ot", "un", "<", "im", "th ", "nc", "ate", "><", "ver", "ad",
        " we", "ly", "ee", " n", "id", " cl", "ac", "il", "</", "rt", " wi", "div",
        "e, ", " it", "whi", " ma", "ge", "x", "e c", "men", ".com"
};

从这里可以看出，redis作者应该是对于常见的语句单词进行了统计分析，得出来的常用组合

将字符串尽可能的分散，提高查询速率，在压缩的过程中使用了hash算法

         h1 = h2 = in[0] << 3;
        if (inlen > 1) h2 += in[1];
        if (inlen > 2) h3 = h2 ^ in[2];

,其中这个算法的原理并不好弄懂，总之就是计算hash，这个设计应该是与大量的结果有关的，并且能够有一定的独立性.

我在自己项目中的应用，应为我无法推断出输出的缓冲区有多大，所以我将其中的接口进行了改变，使用string 来进行．如果有相关需要可以来使用我的代码
c++的smaz库实现