Thursday, February 19, 2015

Text Justification Algorithm - Dynamic programming and greedy algorithm

Problem: given a list of words and a window size of w, how to split them into multiple lines so that the fewest lines are used if possible.

Example: w = 10, input:  "given a list of words and a window size of w"

potential optimal alignment:

given      a
list         of
words and
a  window
size of    w

not a good alignment:

given
a  list
of words
and a
window
size of
w

This is a common problem faced by many editor application such as MS Words, Open Office.

First of all, given n words, there are 2^n-1 possible ways to alignment them in up to n lines.

Second, we need a way to measure the goodness or badness of each options. One way to measure goodness that is use the total length of words on each line divided by window size. The value is between 0 and 1, 1 means a perfect alignment, 0 means the worst one. Note value will always larger than 0. Another way to measure badness is to compute (w - length_of_words_on_each_line)^3, the higher the value is, the bad alignment is.

Let's assume we have a measurement of how good or bad each option is. A brute force solution is to exam all possible 2^n-1 options and find the best one based on the values mentioned above.  Simple, right? (Then we got fired by boss next day!)

Many of the string related problem can be solved by dynamic programming, e.g. longest substring problem, edit distance between two strings and etc. So let's try DP here.

If there is one word, measurement is easy, its badness is:
 if word.length > w, badness = INFINITY
 otherwise, badness = 0

If there are two words, w1, w2, how to calculate DP(w1) and DP(w2)?
DP(w2) is easy, see one word case above
DP(w1) = Min(badness(w1) + DP(w2), badness(w1, w2))

three words, w1, w2, w3,

DP(w3) is easy, see one word case above
DP(w2) see two words case above
DP(w1) = Min(badness(w1) + DP(w2), badness(w1, w2)+DP(w3), badness(w1, w2, w3))

Ok, I think we find the recursion formula! Let's code it up. Note in the following code, I use fullness or goodness measure. For badness measurement the logic will be the same except we will the maximize each DP value. And I provide sample badness measurement at the end too.


public static void textJustification(String[] words, int width){
int len = words.length;
double[] table = new double[len];
int[] indices = new int[len];
table[len-1] = fullness(words, len-1, len-1, width);
indices[len-1] = -1;
for(int k = len - 2; k<=0; k--){
table[k] = Integer.MIN_VALUE;
for(int i=0; k+i<len; i++){
double fullness = fullness(words, k, k+i, width);
if(fullness==0d)
break;
double dp = k+i+1<len?table[k+i+1]:0;
if(fullness + dp > table[k]){
table[k] = fullness + dp;
indices[k] = k+i+1;
}
}
}
for(int i = 0; i<len; i = indices[i]){
System.out.println(i);
}
}
public static double fullness(String[] words, int i, int j, int w){
int length = words[i].length();
for(int m = i+1; m<=j; m++)
length += words[i].length()+1;
return length<=w? (double)length/w :0;
}

//sample code for badness measurement

public static double badness(String[] words, int i, int j, int w){
 int length = words[i].length();
for(int m = i+1; m<=j; m++)
length += words[i].length()+1;
return length<=w? Math.pow(w - length, 3) : Integer.MAX_VALUE;
}









3 comments:

  1. I think your problem is the same as this one, right? http://www.geeksforgeeks.org/dynamic-programming-set-18-word-wrap/

    I think it is super smart to define the badness function. Because it can both measure fewest line, as well as how well a line balances.
    Do you know why badness is (w - length_of_words_on_each_line)^3, but not (w - length_of_words_on_each_line)^2?

    ReplyDelete
    Replies
    1. right. the badness measurement is to map the scores to big space, we can think it as hash function, the bigger the space is, the less likely the collision happens.

      Delete
  2. My try: http://allenlipeng47.com/PersonalPage/index/view/136/nkey

    ReplyDelete