Recently I read book "Taming Text', its jaccard distance implementation is not optimized, in my humble opinion. The book itself is great for anyone works in text mining fields.
The followings are my implementations. The first implementation is better than second, I think.
class JaccardDistance {
final static short ALPHABET_SIZE = 256;
public float jaccard(String s1, String s2){
boolean[] has1 = new boolean[ALPHABET_SIZE];
for(char c : s1.toCharArray())
has1[c] = true;
boolean[] has2 = new boolean[ALPHABET_SIZE];
for(char c : s2.toCharArray())
has2[c] = true;
int intersect = 0, union = 0;
for(int i=0; i<ALPHABET_SIZE; i++){
if(has1[i]&&has2[i])
intersect++;
if(has1[i]||has2[i])
union++;
}
return (float)intersect/(float)union;
}
public float jaccard2(String s1, String s2){
Set<Character> union = new HashSet<Character>();
Set<Character> intersect = new HashSet<Character>();
for(char c: s1.toCharArray()){
union.add(c);
intersect.add(c);
}
Set<Character> set2 = new HashSet<Character>();
for(char c: s2.toCharArray())
set2.add(c);
union.addAll(set2);
intersect.retainAll(set2);
return (float)intersect.size()/(float)union.size();
}
}
good addAll(), retainAll() functions!!
ReplyDelete