Friday, March 27, 2015

Cuckoo hash table

As we all know search in a list is O(n), search in a binary search tree is O(logN), and hash is O(1).

There are two critical aspects for designing hash, one is the hash function; the other is the hash collision, which in worse case will turn a hash table into linked list, if the chaining approach is used in constructing hash.

There are several ways to build hash table without any collisions or pre-defined bucket size. Cuckoo hash table is one of them. Cuckoo is one kind of birds who kicks off the host's egg and put its own in. See this link http://www.nwf.org/news-and-magazines/national-wildlife/birds/archives/1997/bullies-of-the-bird-world.aspx

Cuckoo hash do exactly the same, there is one interesting paper written by a group of computer scientists from CMU, Cuckoo Filter: Practically Better Than Bloom, they developed a cuckoo algorithm which performs better than bloom function.

The below is my untested version of cuckoo hash.

import java.util.Collections;
import java.util.Random;


public class CuckooHash {
//the buckets
private Entry[] cache;
//the size of the buckets
private int capacity;
//the number of filled buckets
private int size;
//the maximum ratio of size over capacity until resizing
private final float loadFactor;
private final transient IntHashFuncI[] hashFunctions;
public CuckooHash(int iniCapacity, float loadfactor){
this(iniCapacity, loadfactor, new IntHashFuncI[]{new HashFunc(1) , new HashFunc(2)});
}
CuckooHash() {
this(16, 0.75f, new IntHashFuncI[]{new HashFunc(1) , new HashFunc(2)});
}
private CuckooHash(int iniCapacity, float loadfactor, IntHashFuncI[] hashFuncs){
cache = new Entry[iniCapacity];
loadFactor = loadfactor;
capacity = iniCapacity;
hashFunctions = hashFuncs;

}
public Object get(int id){

for(int i=0; i<2; i++){
int index = hashFunctions[i].hash(id, capacity);
if(cache[index]!=null && cache[index].id == id)
return cache[index].t;
}
return null;
}
boolean insert(int id, Object tags){
for(int i=0; i<2; i++){
int index = hashFunctions[i].hash(id, capacity);
if(cache[index]==null){
cache[index] = new Entry(id, t);
this.size++;
return true;
}
}
return false;
}
private void ensureCapacity (int id) {
if(this.size>=this.loadFactor*this.capacity){
System.out.format("ensureCapacity %d, %d %d %f", id, this.size, this.capacitythis.loadFactor);
rehash(this.capacity<<1);
}
}
 
void put(int id, Object t){
ensureCapacity(id);
if(this.insert(id, t))
return;
//start the cuckoo bullying process
Entry insert = new Entry(id, t);
Entry current = insert;
int counter = 0;
int index = hashFunctions[0].hash(id, capacity);
while(counter++<this.capacity || current!=insert ){
if(cache[index]==null){
cache[index] = current;
size++;
return;
}
Entry temp = cache[index];
cache[index] = current;
current = temp;
if(index == hashFunctions[0].hash(current.id, capacity))
index = hashFunctions[1].hash(current.id, capacity);
else
index = hashFunctions[0].hash(current.id, capacity);
}
rehash(this.capacity<<1);
put(id, t);
}
//double hash table size
private void rehash(int newSize) {
System.out.println("rehash to " + newSize);
int temp = this.size;
this.capacity = newSize;
Entry[] oldCache = cache;
cache = new Entry[newSize];
for(Entry e : oldCache){
if(e!=null)
this.put(e.id, e.tags);
}
this.size = temp;
System.out.println("rehash and size is " + this.size);
}

//int primitive HashMap entry
private class Entry {
private final int id;
private final Object t;
public Entry(int id, Object t) {
super();
this.id = id;
this.t = t;
}
}
static class HashFunc implements IntHashFuncI {
private static final Random GENERATOR = new Random();
private int round;
HashFunc(int loop){
round = loop;
}
public int hash(int key, int range){
GENERATOR.setSeed(key);
int hash = GENERATOR.nextInt(range);
for(int i=0; i<this.round; i++)
hash = GENERATOR.nextInt(range);
return hash;
}
}
private static interface IntHashFuncI {
public int hash(int key, int range);
}

}

9 comments:

  1. It looks pretty cool. As I read, cuckoo hash has a amortized time. Query would be fast. It would be cooler if you use generic type for this class.
    I'm fighting with a Spring Batch framework. It has nothing to do with algorithm…oops...

    ReplyDelete
    Replies
    1. I am working on a generic type version and a thread safe versIon as well

      Delete
  2. I mean amortized time for insertion

    ReplyDelete
  3. When you resize, for example, from 10 capacity to 20 capacity. Will this lead a O(n) operation?

    ReplyDelete
    Replies
    1. I did a quick reading, it is said add() takes amortized O(1) time even if it starts with 1 capacity.
      http://www.cs.cornell.edu/courses/cs312/2008sp/lectures/lec20.html

      Delete
  4. Another question, what if the we want to save a Long type data to hash table. But the hashcode you or java wrote is int. How can we store the Long type data with a small hashcode.

    ReplyDelete
  5. I have another question. Since put() and rehash() call each other. Is there a chance that they would call each other until forever?

    ReplyDelete
    Replies
    1. it doesnt happen. since the rehash will double size and put keys in different slots. also another improvement is to add a fixed size of list called stash, if the keys can not be inserted, then put it in stash

      Delete
    2. also further improvement is to use 3 hash functions, and fixed size of bucket to allow 2 keys per bucket.

      Delete