Saturday, January 10, 2015

Jaccard distance

When you work on any information retrieval system like SOLR, or text mining projects, there are couple distance measurements for strings or documents. Jaccard distance is the most simple one. But it doesn't take the term frequency count and character positioning. E.g. china vs anihc, china vs chine.


Recently I read book "Taming Text', its jaccard distance implementation is not optimized, in my humble opinion. The book itself is great for anyone works in text mining fields.

The followings are my implementations. The first implementation is better than second, I think. 



class JaccardDistance {
final static short ALPHABET_SIZE = 256;

public float jaccard(String s1, String s2){
boolean[] has1 = new boolean[ALPHABET_SIZE];
for(char c : s1.toCharArray())
has1[c] = true;

boolean[] has2 = new boolean[ALPHABET_SIZE];
for(char c : s2.toCharArray())
has2[c] = true;  
int intersect = 0, union = 0;
for(int i=0; i<ALPHABET_SIZE; i++){
if(has1[i]&&has2[i])
intersect++;
if(has1[i]||has2[i])
union++;
}

return (float)intersect/(float)union;
}
public float jaccard2(String s1, String s2){
Set<Character> union = new HashSet<Character>();
Set<Character> intersect = new HashSet<Character>();
for(char c: s1.toCharArray()){
union.add(c);
intersect.add(c);
}
Set<Character> set2 = new HashSet<Character>();
for(char c: s2.toCharArray())
set2.add(c);
union.addAll(set2);
intersect.retainAll(set2);
return (float)intersect.size()/(float)union.size();
}
}

1 comment: