哈希表的C實(shí)現(xiàn)（三）

lchjczw 2013-01-08

展開全文

關(guān)于哈希表C實(shí)現(xiàn)，寫了兩篇學(xué)習(xí)筆記，不過似乎網(wǎng)上流傳最具傳奇色彩的莫過于暴雪公司的魔獸文件打包管理器里的hashTable的實(shí)現(xiàn)了；在沖突方面的處理方面，采用線性探測再散列。在添加和查找過程中進(jìn)行了三次哈希，第一個(gè)哈希值用來查找，后兩個(gè)哈希值用來校驗(yàn)，這樣可以大大減少沖突的幾率。

在網(wǎng)上找了相關(guān)代碼，但不知道其來源是否地道：

StringHash.h

[cpp] view plain copy ?

#include <StdAfx.h>
#include <string>
using namespace std;
#pragma once
#define MAXTABLELEN 1024 // 默認(rèn)哈希索引表大小
//////////////////////////////////////////////////////////////////////////
// 哈希索引表定義
typedef struct _HASHTABLE
{
long nHashA;
long nHashB;
bool bExists;
}HASHTABLE, *PHASHTABLE ;
class StringHash
{
public:
StringHash(const long nTableLength = MAXTABLELEN);
~StringHash(void);
private:
unsigned long cryptTable[0x500];
unsigned long m_tablelength; // 哈希索引表長度
HASHTABLE *m_HashIndexTable;
private:
void InitCryptTable(); // 對哈希索引表預(yù)處理
unsigned long HashString(const string &lpszString, unsigned long dwHashType); // 求取哈希值
public:
bool Hash(string url);
unsigned long Hashed(string url); // 檢測url是否被hash過
};

StringHash.cpp

[cpp] view plain copy ?

#include "StdAfx.h"
#include "StringHash.h"
StringHash::StringHash(const long nTableLength /*= MAXTABLELEN*/)
{
InitCryptTable();
m_tablelength = nTableLength;
//初始化hash表
m_HashIndexTable = new HASHTABLE[nTableLength];
for ( int i = 0; i < nTableLength; i++ )
{
m_HashIndexTable[i].nHashA = -1;
m_HashIndexTable[i].nHashB = -1;
m_HashIndexTable[i].bExists = false;
}
}
StringHash::~StringHash(void)
{
//清理內(nèi)存
if ( NULL != m_HashIndexTable )
{
delete []m_HashIndexTable;
m_HashIndexTable = NULL;
m_tablelength = 0;
}
}
/************************************************************************/
/*函數(shù)名：InitCryptTable
/*功能：對哈希索引表預(yù)處理
/*返回值：無
/************************************************************************/
void StringHash::InitCryptTable()
{
unsigned long seed = 0x00100001, index1 = 0, index2 = 0, i;
for( index1 = 0; index1 < 0x100; index1++ )
{
for( index2 = index1, i = 0; i < 5; i++, index2 += 0x100 )
{
unsigned long temp1, temp2;
seed = (seed * 125 + 3) % 0x2AAAAB;
temp1 = (seed & 0xFFFF) << 0x10;
seed = (seed * 125 + 3) % 0x2AAAAB;
temp2 = (seed & 0xFFFF);
cryptTable[index2] = ( temp1 | temp2 );
}
}
}
/************************************************************************/
/*函數(shù)名：HashString
/*功能：求取哈希值
/*返回值：返回hash值
/************************************************************************/
unsigned long StringHash::HashString(const string& lpszString, unsigned long dwHashType)
{
unsigned char *key = (unsigned char *)(const_cast<char*>(lpszString.c_str()));
unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;
int ch;
while(*key != 0)
{
ch = toupper(*key++);
seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);
seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;
}
return seed1;
}
/************************************************************************/
/*函數(shù)名：Hashed
/*功能：檢測一個(gè)字符串是否被hash過
/*返回值：如果存在，返回位置；否則，返回-1
/************************************************************************/
unsigned long StringHash::Hashed(string lpszString)
{
const unsigned long HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;
//不同的字符串三次hash還會碰撞的幾率無限接近于不可能
unsigned long nHash = HashString(lpszString, HASH_OFFSET);
unsigned long nHashA = HashString(lpszString, HASH_A);
unsigned long nHashB = HashString(lpszString, HASH_B);
unsigned long nHashStart = nHash % m_tablelength,
nHashPos = nHashStart;
while ( m_HashIndexTable[nHashPos].bExists)
{
if (m_HashIndexTable[nHashPos].nHashA == nHashA && m_HashIndexTable[nHashPos].nHashB == nHashB)
return nHashPos;
else
nHashPos = (nHashPos + 1) % m_tablelength;
if (nHashPos == nHashStart)
break;
}
return -1; //沒有找到
}
/************************************************************************/
/*函數(shù)名：Hash
/*功能：hash一個(gè)字符串
/*返回值：成功，返回true；失敗，返回false
/************************************************************************/
bool StringHash::Hash(string lpszString)
{
const unsigned long HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;
unsigned long nHash = HashString(lpszString, HASH_OFFSET);
unsigned long nHashA = HashString(lpszString, HASH_A);
unsigned long nHashB = HashString(lpszString, HASH_B);
unsigned long nHashStart = nHash % m_tablelength,
nHashPos = nHashStart;
while ( m_HashIndexTable[nHashPos].bExists)
{
nHashPos = (nHashPos + 1) % m_tablelength;
if (nHashPos == nHashStart) //一個(gè)輪回
{
//hash表中沒有空余的位置了,無法完成hash
return false;
}
}
m_HashIndexTable[nHashPos].bExists = true;
m_HashIndexTable[nHashPos].nHashA = nHashA;
m_HashIndexTable[nHashPos].nHashB = nHashB;
return true;
}

關(guān)于其中的實(shí)現(xiàn)原理，我覺得沒有比 inside MPQ說得清楚的了，于是用我蹩腳的E文，將該文的第二節(jié)翻譯了一遍（將原文和譯文都貼出來，請高手指正）：

原理

Most of the advancements throughout the history of computers have been because of particular problems which required solving. In this chapter, we'll take a look at some of these problems and their solutions as they pertain to the MPQ format.

貫穿計(jì)算機(jī)發(fā)展歷史，大多數(shù)進(jìn)步都是源于某些問題的解決，在這一節(jié)中，我們來看一看與MPQ 格式相關(guān)問題及解決方案；

Hashes

哈希表

Problem: You have a very large array of strings. You have another string and need to know if it is already in the list. You would probably begin by comparing each string in the list with the string other, but when put into application, you would find that this method is far too slow for practical use. Something else must be done. But how can you know if the string exists without comparing it to all the other strings?

問題：你有一個(gè)很大的字符串?dāng)?shù)組，同時(shí)，你另外還有一個(gè)字符串，需要知道這個(gè)字符串是否已經(jīng)存在于字符串?dāng)?shù)組中。你可能會對數(shù)組中的每一個(gè)字符串進(jìn)行比較，但是在實(shí)際項(xiàng)目中，你會發(fā)現(xiàn)這種做法對某些特殊應(yīng)用來說太慢了。必須尋求其他途徑。那么如何才能在不作遍歷比較的情況下知道這個(gè)字符串是否存在于數(shù)組中呢？

Solution: Hashes. Hashes are smaller data types (i.e. numbers) that represent other, larger, data types (usually strings). In this scenario, you could store hashes in the array with the strings. Then you could compute the hash of the other string and compare it to the stored hashes. If a hash in the array matches the new hash, the strings can be compared to verify the match. This method, called indexing, could speed things up by about 100 times, depending on the size of the array and the average length of the strings.

解決方案：哈希表。哈希表是通過更小的數(shù)據(jù)類型表示其他更大的數(shù)據(jù)類型。在這種情況下，你可以把哈希表存儲在字符串?dāng)?shù)組中，然后你可以計(jì)算字符串的哈希值，然后與已經(jīng)存儲的字符串的哈希值進(jìn)行比較。如果有匹配的哈希值，就可以通過字符串比較進(jìn)行匹配驗(yàn)證。這種方法叫索引，根據(jù)數(shù)組的大小以及字符串的平均長度可以約100倍。

unsigned long HashString(char *lpszString)
{   
    unsigned long ulHash = 0xf1e2d3c4;        
    while (*lpszString != 0)    
    {        
        ulHash <<= 1;       
        ulHash += *lpszString++;      
    }   
    return ulHash;
}

The previous code function demonstrates a very simple hashing algorithm. The function sums the characters in the string, shifting the hash value left one bit before each character is added in. Using this algorithm, the string "arr\units.dat" would hash to 0x5A858026, and "unit\neutral\acritter.grp" would hash to 0x694CD020. Now, this is, admittedly, a very simple algorithm, and it isn't very useful, because it would generate a relatively predictable output, and a lot of collisions in the lower range of numbers. Collisions are what happen when more than one string hash to the same value.

上面代碼中的函數(shù)演示了一種非常簡單的散列算法。這個(gè)函數(shù)在遍歷字符串過程中，將哈希值左移一位，然后加上字符值；通過這個(gè)算法，字符串"arr\units.dat" 的哈希值是0x5A858026，字符串"unit\neutral\acritter.grp" 的哈希值是0x694CD020；現(xiàn)在，眾所周知的，這是一個(gè)基本沒有什么實(shí)用價(jià)值的簡單算法，因?yàn)樗鼤谳^低的數(shù)據(jù)范圍內(nèi)產(chǎn)生相對可預(yù)測的輸出，從而可能會產(chǎn)生大量沖突（不同的字符串產(chǎn)生相同的哈希值）。

The MPQ format, on the other hand, uses a very complicated hash algorithm (shown below) to generate totally unpredictable hash values. In fact, the hashing algorithm is so effective that it is called a one-way hash. A one-way hash is a an algorithm that is constructed in such a way that deriving the original string (set of strings, actually) is virtually impossible. Using this particular algorithm, the filename "arr\units.dat" would hash to 0xF4E6C69D, and "unit\neutral\acritter.grp" would hash to 0xA26067F3.

MPQ格式，使用了一種非常復(fù)雜的散列算法（如下所示），產(chǎn)生完全不可預(yù)測的哈希值，這個(gè)算法十分有效，這就是所謂的單向散列算法。通過單向散列算法幾乎不可能通過哈希值來唯一的確定輸入值。使用這種算法，文件名 "arr\units.dat" 的哈希值是0xF4E6C69D，"unit\neutral\acritter.grp" 的哈希值是 0xA26067F3。

unsigned long HashString(char *lpszFileName, unsigned long dwHashType)
{   
    unsigned char *key = (unsigned char *)lpszFileName;   
    unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;   
    int ch;

    while(*key != 0)       
    {      
        ch = toupper(*key++);   
        seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);       
        seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;       
    }   
    return seed1;  
}

Hash Tables

哈希表

Problem: You tried using an index like in the previous sample, but your program absolutely demands break-neck speeds, and indexing just isn't fast enough. About the only thing you could do to make it faster is to not check all of the hashes in the array. Or, even better, if you could only make one comparison in order to be sure the string doesn't exist anywhere in the array. Sound too good to be true? It's not.

問題：您嘗試在前面的示例中使用相同索引，您的程序一定會有中斷現(xiàn)象發(fā)生，而且不夠快。如果想讓它更快，您能做的只有讓程序不去查詢數(shù)組中的所有散列值。或者您可以只做一次對比就可以得出在列表中是否存在字符串。聽起來不錯，真的么？不可能的啦

Solution: A hash table. A hash table is a special type of array in which the offset of the desired string is the hash of that string. What I mean is this. Say that you make that string array use a separate array of fixed size (let's say 1024 entries, to make it an even power of 2) for the hash table. You want to see if the new string is in that table. To get the string's place in the hash table, you compute the hash of that string, then modulo (division remainder) that hash value by the size of that table. Thus, if you used the simple hash algorithm in the previous section, "arr\units.dat" would hash to 0x5A858026, making its offset 0x26 (0x5A858026 divided by 0x400 is 0x16A160, with a remainder of 0x26). The string at this location (if there was one) would then be compared to the string to add. If the string at 0x26 doesn't match or just plain doesn't exist, then the string to add doesn't exist in the array. The following code illustrates this:

解決：一個(gè)哈希表就是以字符串的哈希值作為下標(biāo)的一類數(shù)組。我的意思是，哈希表使用一個(gè)固定長度的字符串?dāng)?shù)組（比如1024，2的偶次冪）進(jìn)行存儲；當(dāng)你要看看這個(gè)字符串是否存在于哈希表中，為了獲取這個(gè)字符串在哈希表中的位置，你首先計(jì)算字符串的哈希值，然后哈希表的長度取模。這樣如果你像上一節(jié)那樣使用簡單的哈希算法，字符串"arr\units.dat" 的哈希值是0x5A858026,偏移量0x26（0x5A858026 除于0x400等于0x16A160，模0x400等于0x26）。因此，這個(gè)位置的字符串將與新加入的字符串進(jìn)行比較。如果0X26處的字符串不匹配或不存在，那么表示新增的字符串在數(shù)組中不存在。下面是示意的代碼：

int GetHashTablePos(char *lpszString, SOMESTRUCTURE *lpTable, int nTableSize)
{   
    int nHash = HashString(lpszString), nHashPos = nHash % nTableSize;       
    if (lpTable[nHashPos].bExists && !strcmp(lpTable[nHashPos].pString, lpszString))       
        return nHashPos;   
    else        
        return -1; //Error value   
}

Now, there is one glaring flaw in that explanation. What do you think happens when a collision occurs (two different strings hash to the same value)? Obviously, they can't occupy the same entry in the hash table. Normally, this is solved by each entry in the hash table being a pointer to a linked list, and the linked list would hold all the entries that hash to that same value.

上面的說明中存在一個(gè)刺眼的缺陷。當(dāng)有沖突（兩個(gè)不同的字符串有相同的哈希值）發(fā)生的時(shí)候怎么辦？顯而易見的，它們不能占據(jù)哈希表中的同一個(gè)位置。通常的解決辦法是為每一個(gè)哈希值指向一個(gè)鏈表，用于存放所有哈希沖突的值；

MPQs use a hash table of filenames to keep track of the files inside, but the format of this table is somewhat different from the way hash tables are normally done. First of all, instead of using a hash as an offset, and storing the actually filename for verification, MPQs do not store the filename at all, but rather use three different hashes: one for the hash table offset, two for verification. These two verification hashes are used in place of the actual filename. Of course, this leaves the possibility that two different filenames would hash to the same three hashes, but the chances of this happening are, on average, 1:18889465931478580854784, which should be safe enough for just about anyone.

MPQs使用一個(gè)存放文件名的哈希表來跟蹤文件內(nèi)部，但是表的格式與通常方法有點(diǎn)不同，首先不像通常的做法使用哈希值作為偏移量，存儲實(shí)際的文件名。MPQs 根本不存儲文件名，而是使用了三個(gè)不同的哈希值：一個(gè)用做哈希表偏移量，兩個(gè)用作核對。這兩個(gè)核對的哈希值用于替代文件名。當(dāng)然從理論上說存在兩個(gè)不同的文件名得到相同的三個(gè)哈希值，但是這種情況發(fā)送的幾率是：1:18889465931478580854784,這應(yīng)該足夠安全了。

The other way that an MPQ's hash table differs from the conventional implementation is that instead of using a linked list for each entry, when a collision occurs, the entry will be shifted to the next slot, and the process repeated until a free space is found. Take a look at the following illustrational code, which is basically the way a file is located for reading in an MPQ:

MPQ's的哈希表的實(shí)現(xiàn)與傳統(tǒng)實(shí)現(xiàn)的另一個(gè)不同的地方是，相對與傳統(tǒng)做法（為每個(gè)節(jié)點(diǎn)使用一個(gè)鏈表，當(dāng)沖突發(fā)生的時(shí)候，遍歷鏈表進(jìn)行比較），看一下下面的示范代碼，在MPQ中定位一個(gè)文件進(jìn)行讀操作：

int GetHashTablePos(char *lpszString, MPQHASHTABLE *lpTable, int nTableSize)
{   
    const int HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;    
    int nHash = HashString(lpszString, HASH_OFFSET),nHashA = HashString(lpszString, HASH_A),nHashB = HashString(lpszString, HASH_B), nHashStart = nHash % nTableSize,nHashPos = nHashStart;
    while (lpTable[nHashPos].bExists)
    {
        if (lpTable[nHashPos].nHashA == nHashA && lpTable[nHashPos].nHashB == nHashB)
            return nHashPos;

        else
            nHashPos = (nHashPos + 1) % nTableSize;
            if (nHashPos == nHashStart)
                break;
    }
    return -1; //Error value

}

However convoluted that code may look, the theory behind it isn't difficult. It basically follows this process when looking to read a file:

Compute the three hashes (offset hash and two check hashes) and store them in variables.

Move to the entry of the offset hash

Is the entry unused? If so, stop the search and return 'file not found'.

Do the two check hashes match the check hashes of the file we're looking for? If so, stop the search and return the current entry.

Move to the next entry in the list, wrapping around to the beginning if we were on the last entry.

Is the entry we just moved to the same as the offset hash (did we look through the whole hash table?)? If so, stop the search and return 'file not found'.

Go back to step 3.

無論代碼看上去有多么復(fù)雜，其背后的理論并不難。讀一個(gè)文件的時(shí)候基本遵循下面這樣一個(gè)過程：

1、計(jì)算三個(gè)哈希值（一個(gè)哈希偏移量和兩個(gè)驗(yàn)證值）并保存到變量中；

2、移動到哈希偏移量對應(yīng)的值；

3、對應(yīng)的位置是否尚未使用？如果是，則停止搜尋，并返回“文件不存在”；

4、這兩個(gè)驗(yàn)證值是否與我們要找的字符串驗(yàn)證值匹配，如果是，停止搜尋，并返回當(dāng)前的節(jié)點(diǎn)；

5、移動到下一個(gè)節(jié)點(diǎn)，如果到了最后一個(gè)節(jié)點(diǎn)則返回開始；

6、Is the entry we just moved to the same as the offset hash (did we look through the whole hash table?)? If so, stop the search and return 'file not found'.

7、回到第3步；

If you were paying attention, you might have noticed from my explanation and sample code is that the MPQ's hash table has to hold all the file entries in the MPQ. But what do you think happens when every hash-table entry gets filled? The answer might surprise you with its obviousness: you can't add any more files. Several people have asked me why there is a limit (called the file limit) on the number of files that can be put in an MPQ, and if there is any way around this limit. Well, you already have the answer to the first question. As for the second; no, you cannot get around the file limit. For that matter, hash tables cannot even be resized without remaking the entire MPQ from scratch. This is because the location of each entry in the hash table may well change due to the resizing, and we would not be able to derive the new position because the position is the hash of the file name, and we may not know the file name.

如果您注意的話，您可能已經(jīng)從我們的解釋和示例代碼注意到，MPQ的哈希表已經(jīng)將所有的文件入口放入MPQ中；那么當(dāng)哈希表的每個(gè)項(xiàng)都被填充的時(shí)候，會發(fā)生什么呢？答案可能會讓你驚訝：你不能添加任何文件。有些人可能會問我為什么文件數(shù)量上有這樣的限制（文件限制），是否有辦法繞過這個(gè)限制？就此而言，如果不重新創(chuàng)建MPQ 的項(xiàng)，甚至無法調(diào)整哈希表的大小。這是因?yàn)槊總€(gè)項(xiàng)在哈希表中的位置會因?yàn)樘l尺寸而改變，而我們無法得到新的位置，因?yàn)檫@些位置值是文件名的哈希值，而我們根本不知道文件名是什么。

一連總結(jié)了3篇關(guān)于哈希表的C實(shí)現(xiàn)，都是來源于網(wǎng)絡(luò)，整理學(xué)習(xí)，以備忘之；不能說都搞得很清楚，大致知道了哈希表是怎么實(shí)現(xiàn)的；當(dāng)然還有很多開源項(xiàng)目都有自己的實(shí)現(xiàn)，如LUA、Redis、Apache等，精力有限，先挖個(gè)坑，日后有時(shí)間再補(bǔ)充吧。不管怎么說，有點(diǎn)孔乙己的嫌疑，呵呵！

本站是提供個(gè)人知識管理的網(wǎng)絡(luò)存儲空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點(diǎn)。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點(diǎn)擊一鍵舉報(bào)。

轉(zhuǎn)藏 分享

QQ空間 QQ好友新浪微博微信

獻(xiàn)花（0） +1

來自： lchjczw > 《哈希表》

舉報(bào)/認(rèn)領(lǐng)