Saturday, August 22, 2015

String tokenizer

Given a function which will tell if a string is word or not, and given a string, break it into multiple words. For example: "usaway" can be broken into "us away" or "usa way".

If we think each index of characters in the string is a node in a direct graph, then there is edge from node i to j if substring from index i to j is a word. Start from node at index 0, if we can find a path to node at index of last character, then we find a way to tokenize the string.  To do this we can use depth first search. 

The following is the code. Note if there are multiple ways to break a string into words, then the method will only return one way. 

public static String tokenize(String s){
String[] result = new String[1];
result[0] = "";
dfs(s, 0, result);
return result[0];
}

private static boolean dfs(String s, int index, String[] result){
if(index == s.length())
return true;
for(int i=index+1; i<=s.length(); i++){
String substring = s.substring(index, i);
if(isWord(substring) && dfs(s, i, result)){
result[0] =  substring + " " + result[0];
return true;
}
}
return false;
}


The second tokenize method will return a list. Each item in the return list is a list of indices which form a path from index at 0 to the last index of the string. 

public List<List<Integer>> tokenize2(String input){
/*
* map character index position to the next character index position if they are part of the solution
*/
Map<Integer, List<Integer>> store = new TreeMap<Integer, List<Integer>>();
dfs(input, store, 0);
List<List<Integer>> result = new ArrayList<List<Integer>>();
List<Integer> list = new ArrayList<Integer>();
list.add(0);
result.add(list);
boolean flag = true;
while(flag){
flag = false;
List<List<Integer>> copy = new ArrayList<List<Integer>>();
for(List<Integer> li : result){
int k = li.get(li.size()-1);
if(store.containsKey(k)){
flag = true;
for(int n : store.get(k)){
List<Integer> temp = new ArrayList<Integer>(li);
temp.add(n);
copy.add(temp);
}
}else
copy.add(li);
}
result = copy;
}
return result;
}
private Integer dfs(String s, Map<Integer, List<Integer>> store, int from){
int len = s.length();
if(from == len-1)//we find the leaf node
return from;
for(int i=from+1; i<len-1; i++){
if(isWord(s, from, i)){
Integer ret = dfs(s, store, i);
if(!store.containsKey(from)){
List<Integer> list = new ArrayList<Integer>();
store.put(from, list);
}
//the edge from-ret form the path which is the solution
store.get(from).add(ret);
return ret==null?null:from;
}
}
return null;
}

No comments:

Post a Comment