Blue Ocean: string

Showing posts with label string. Show all posts

Wednesday, November 25, 2015

Longest Continuous Ones of a String contains 0 or 1 - inch worm algorithm

Given a string consisted of only 0 or 1, and a number N.

We have N times to flip 0 to 1 in array. Find the longest continuous 1 after we flipped the N zero elements. Rotation is allowed.

For example, 100100111000, n=2, the best flip is 100111111000, return 6

10001101101, n=2, the best flip is 10001111111, return 8(rotation is allowed)

The following algorithm is more like a inch worm walking on a line, it tries to go forward as it can, then shrink from the tail, then move forward again.

/**
    * we first concatenate string to itself
    *
    * then use a sliding window, keep track of number of 0s in the window
    * grow the right side of window if not exceed max 0s allowed
    * otherwise shrink the left side to reduce the 0s
    *
    * the algorithm stops when left side reach the length of original string
    *
    */
    public static int maxContinuous1sWithRotation(String s, int maxFlips){
        int p1=0, p2=0, len = s.length(), num0=0, max=0;
        while(p1<len && p2-p1<len){
            if(s.charAt(p2%len) == '0'){
                num0++;
            }

            /*
            * if num0 reaches maxFlips+1
            * we need to shrink the sliding window from left
            * until we reaches a 0 and reduce the counter
            */
            if(num0 > maxFlips){
                while(p1<len && s.charAt(p1)=='1')
                    p1++;
                p1++; num0--;
            }

            /*
            * we keep on growing the sliding window
            * to the right and update the max
            */
            p2++;
            max = Math.max(max, p2-p1);
        }
        return max;
    }

Friday, September 18, 2015

a simple regular expression implementation

Implement a simple regular expression based string match.

. - any character
* - match 0 or more times of previous character

/**
    *
    * @param target
    * @param regex
    * @return
    */
    public static boolean isMatch(String target, String regex){
        assert !target.isEmpty() && !regex.isEmpty();
        int p1 = 0;
        int p2 = 0;
        int len1 = target.length();
        int len2 = regex.length();
        char pre = '\0', c1, c2 = '\0';

        while(p1<len1 && p2<len2){
            c1 = target.charAt(p1);
            c2 = regex.charAt(p2);

            if(c2 == '.'){
                p1++;p2++;
                pre = '.';
            }else if(c2 == '*'){
                if(pre == '\0'){
                    return false;
                }
                else if(pre == '.' || pre == c1){
                    p1++;
                }else{
                    p2++;
                    pre = '\0';
                }
            }else{
                if(c1 == c2){
                    pre = c1;
                    p1++; p2++;
                }else{
                    if(p2+1<len2 && regex.charAt(p2+1) == '*'){
                        p2++;p2++;
                    }else
                        return false;
                }

            }
        }

        if(p1 == len1)
            return true;
        else
            return false;
    }

Saturday, August 22, 2015

String tokenizer

Given a function which will tell if a string is word or not, and given a string, break it into multiple words. For example: "usaway" can be broken into "us away" or "usa way".

If we think each index of characters in the string is a node in a direct graph, then there is edge from node i to j if substring from index i to j is a word. Start from node at index 0, if we can find a path to node at index of last character, then we find a way to tokenize the string. To do this we can use depth first search.

The following is the code. Note if there are multiple ways to break a string into words, then the method will only return one way.

public static String tokenize(String s){

  String[] result = new String[1];

  result[0] = "";

  dfs(s, 0, result);

  return result[0];

}

private static boolean dfs(String s, int index, String[] result){

  if(index == s.length())

   return true; 

  for(int i=index+1; i<=s.length(); i++){

   String substring = s.substring(index, i);

   if(isWord(substring) && dfs(s, i, result)){

    result[0] =  substring + " " + result[0];

    return true;

}

}

  return false;

}

The second tokenize method will return a list. Each item in the return list is a list of indices which form a path from index at 0 to the last index of the string.

public List<List<Integer>> tokenize2(String input){

/*

   * map character index position to the next character index position if they are part of the solution

*/

  Map<Integer, List<Integer>> store = new TreeMap<Integer, List<Integer>>();

  dfs(input, store, 0);

  List<List<Integer>> result = new ArrayList<List<Integer>>();

  List<Integer> list = new ArrayList<Integer>();

  list.add(0);

  result.add(list);

  boolean flag = true;

  while(flag){

   flag = false;

   List<List<Integer>> copy = new ArrayList<List<Integer>>();

   for(List<Integer> li : result){

    int k = li.get(li.size()-1);

    if(store.containsKey(k)){

     flag = true;

     for(int n : store.get(k)){

      List<Integer> temp = new ArrayList<Integer>(li);

      temp.add(n);

      copy.add(temp);

}

    }else                  

     copy.add(li);

}

   result = copy;

}

  return result;

}

private Integer dfs(String s, Map<Integer, List<Integer>> store, int from){

  int len = s.length();

  if(from == len-1)//we find the leaf node

   return from;

  for(int i=from+1; i<len-1; i++){

   if(isWord(s, from, i)){

    Integer ret = dfs(s, store, i);

    if(!store.containsKey(from)){

     List<Integer> list = new ArrayList<Integer>();

     store.put(from, list);

}

    //the edge from-ret form the path which is the solution

    store.get(from).add(ret);

    return ret==null?null:from;

}

}

  return null; 

}

Tuesday, August 4, 2015

check if one string has anagram substring of the other string

Most of the anagram related question can be solved by using int array, each element in the array representing the number of appearances of each char in the string.

    /*
    * Given 2 strings, check if any one of them
    * has any anagram of the other string, as a substring of it.
    *
    * here without losing generality s1.length() > s2.length()
    *
    * time complexity: O(L2) + O(L1)*O(1) = O(L1)
    */
    public static boolean hasAnagramSubstring(String s1, String s2){
        assert s1.length() > s2.length();
        final int ALPHA_SIZE = 256;
        int[] store2 = new int[ALPHA_SIZE];
        for(char c : s2.toCharArray())
            store2[c] ++;

        char[] chars = s1.toCharArray();
        int[] store1 = new int[ALPHA_SIZE];
        for(int i=0; i<s2.length(); i++){
            store1[chars[i]]++;
        }

        //count the number of different characters in each string
        int numOfNonZero = 0;
        for(int i=0; i<ALPHA_SIZE; i++){
           //store the difference between two int array in store1
            store1[i] = store1[i] - store2[i];
            if(store1[i]!=0)
                numOfNonZero++;
        }

        if(numOfNonZero==0)
            return true;

        for(int i = s2.length(); i<s1.length(); i++){
            store1[chars[i]]++;
            if(store1[chars[i]]==0)
                numOfNonZero--;
            store1[chars[i-1]]--;
            if(store1[chars[i-1]]==0)
                numOfNonZero--;
            if(numOfNonZero==0)
                return true;
        }
        return false;

    }

Thursday, February 19, 2015

Text Justification Algorithm - Dynamic programming and greedy algorithm

Problem: given a list of words and a window size of w, how to split them into multiple lines so that the fewest lines are used if possible.

Example: w = 10, input: "given a list of words and a window size of w"

potential optimal alignment:

given a
list of
words and
a window
size of w

not a good alignment:

given
a list
of words
and a
window
size of
w

This is a common problem faced by many editor application such as MS Words, Open Office.

First of all, given n words, there are 2^n-1 possible ways to alignment them in up to n lines.

Second, we need a way to measure the goodness or badness of each options. One way to measure goodness that is use the total length of words on each line divided by window size. The value is between 0 and 1, 1 means a perfect alignment, 0 means the worst one. Note value will always larger than 0. Another way to measure badness is to compute (w - length_of_words_on_each_line)^3, the higher the value is, the bad alignment is.

Let's assume we have a measurement of how good or bad each option is. A brute force solution is to exam all possible 2^n-1 options and find the best one based on the values mentioned above. Simple, right? (Then we got fired by boss next day!)

Many of the string related problem can be solved by dynamic programming, e.g. longest substring problem, edit distance between two strings and etc. So let's try DP here.

If there is one word, measurement is easy, its badness is:
if word.length > w, badness = INFINITY
otherwise, badness = 0

If there are two words, w1, w2, how to calculate DP(w1) and DP(w2)?
DP(w2) is easy, see one word case above
DP(w1) = Min(badness(w1) + DP(w2), badness(w1, w2))

three words, w1, w2, w3,

DP(w3) is easy, see one word case above
DP(w2) see two words case above
DP(w1) = Min(badness(w1) + DP(w2), badness(w1, w2)+DP(w3), badness(w1, w2, w3))

Ok, I think we find the recursion formula! Let's code it up. Note in the following code, I use fullness or goodness measure. For badness measurement the logic will be the same except we will the maximize each DP value. And I provide sample badness measurement at the end too.

public static void textJustification(String[] words, int width){

 int len = words.length;

 double[] table = new double[len];

 int[] indices = new int[len];

 table[len-1] = fullness(words, len-1, len-1, width);

 indices[len-1] = -1;

 for(int k = len - 2; k<=0; k--){

  table[k] = Integer.MIN_VALUE;

  for(int i=0; k+i<len; i++){

   double fullness = fullness(words, k, k+i, width);

   if(fullness==0d)

    break;

   double dp = k+i+1<len?table[k+i+1]:0;

   if(fullness + dp > table[k]){

    table[k] = fullness + dp;

    indices[k] = k+i+1;

}

}

}

 for(int i = 0; i<len; i = indices[i]){

  System.out.println(i);

}

}

public static double fullness(String[] words, int i, int j, int w){

 int length = words[i].length();

 for(int m = i+1; m<=j; m++)

  length += words[i].length()+1;

 return length<=w? (double)length/w :0;

}

 
//sample code for badness measurement 

public static double badness(String[] words, int i, int j, int w){

 int length = words[i].length();

 for(int m = i+1; m<=j; m++)

  length += words[i].length()+1;

 return length<=w? Math.pow(w - length, 3) : Integer.MAX_VALUE;

}

Saturday, January 10, 2015

Jaccard distance

When you work on any information retrieval system like SOLR, or text mining projects, there are couple distance measurements for strings or documents. Jaccard distance is the most simple one. But it doesn't take the term frequency count and character positioning. E.g. china vs anihc, china vs chine.

Recently I read book "Taming Text', its jaccard distance implementation is not optimized, in my humble opinion. The book itself is great for anyone works in text mining fields.

The followings are my implementations. The first implementation is better than second, I think.

class JaccardDistance {

  final static short ALPHABET_SIZE = 256;

  public float jaccard(String s1, String s2){  

   boolean[] has1 = new boolean[ALPHABET_SIZE];

   for(char c : s1.toCharArray())

    has1[c] = true;  

   boolean[] has2 = new boolean[ALPHABET_SIZE];

   for(char c : s2.toCharArray())

    has2[c] = true;     

 

   int intersect = 0, union = 0;

   for(int i=0; i<ALPHABET_SIZE; i++){

    if(has1[i]&&has2[i])

     intersect++;

    if(has1[i]||has2[i])

     union++;

}

   return (float)intersect/(float)union;

}

  public float jaccard2(String s1, String s2){  

   Set<Character> union = new HashSet<Character>();

   Set<Character> intersect = new HashSet<Character>();

   for(char c: s1.toCharArray()){

    union.add(c);

    intersect.add(c);

}

   Set<Character> set2 = new HashSet<Character>();

   for(char c: s2.toCharArray())

    set2.add(c);

   union.addAll(set2);

   intersect.retainAll(set2);

   return (float)intersect.size()/(float)union.size();

}

}