Thursday, February 19, 2015

Text Justification Algorithm - Dynamic programming and greedy algorithm

Problem: given a list of words and a window size of w, how to split them into multiple lines so that the fewest lines are used if possible.

Example: w = 10, input:  "given a list of words and a window size of w"

potential optimal alignment:

given      a
list         of
words and
a  window
size of    w

not a good alignment:

given
a  list
of words
and a
window
size of
w

This is a common problem faced by many editor application such as MS Words, Open Office.

First of all, given n words, there are 2^n-1 possible ways to alignment them in up to n lines.

Second, we need a way to measure the goodness or badness of each options. One way to measure goodness that is use the total length of words on each line divided by window size. The value is between 0 and 1, 1 means a perfect alignment, 0 means the worst one. Note value will always larger than 0. Another way to measure badness is to compute (w - length_of_words_on_each_line)^3, the higher the value is, the bad alignment is.

Let's assume we have a measurement of how good or bad each option is. A brute force solution is to exam all possible 2^n-1 options and find the best one based on the values mentioned above.  Simple, right? (Then we got fired by boss next day!)

Many of the string related problem can be solved by dynamic programming, e.g. longest substring problem, edit distance between two strings and etc. So let's try DP here.

If there is one word, measurement is easy, its badness is:
 if word.length > w, badness = INFINITY
 otherwise, badness = 0

If there are two words, w1, w2, how to calculate DP(w1) and DP(w2)?
DP(w2) is easy, see one word case above
DP(w1) = Min(badness(w1) + DP(w2), badness(w1, w2))

three words, w1, w2, w3,

DP(w3) is easy, see one word case above
DP(w2) see two words case above
DP(w1) = Min(badness(w1) + DP(w2), badness(w1, w2)+DP(w3), badness(w1, w2, w3))

Ok, I think we find the recursion formula! Let's code it up. Note in the following code, I use fullness or goodness measure. For badness measurement the logic will be the same except we will the maximize each DP value. And I provide sample badness measurement at the end too.


public static void textJustification(String[] words, int width){
int len = words.length;
double[] table = new double[len];
int[] indices = new int[len];
table[len-1] = fullness(words, len-1, len-1, width);
indices[len-1] = -1;
for(int k = len - 2; k<=0; k--){
table[k] = Integer.MIN_VALUE;
for(int i=0; k+i<len; i++){
double fullness = fullness(words, k, k+i, width);
if(fullness==0d)
break;
double dp = k+i+1<len?table[k+i+1]:0;
if(fullness + dp > table[k]){
table[k] = fullness + dp;
indices[k] = k+i+1;
}
}
}
for(int i = 0; i<len; i = indices[i]){
System.out.println(i);
}
}
public static double fullness(String[] words, int i, int j, int w){
int length = words[i].length();
for(int m = i+1; m<=j; m++)
length += words[i].length()+1;
return length<=w? (double)length/w :0;
}

//sample code for badness measurement

public static double badness(String[] words, int i, int j, int w){
 int length = words[i].length();
for(int m = i+1; m<=j; m++)
length += words[i].length()+1;
return length<=w? Math.pow(w - length, 3) : Integer.MAX_VALUE;
}









Tuesday, February 17, 2015

Given a number A and a set of given digits D, find the smallest number which is larger than A and consists of the digits only from the set D.



Problem statement:
Given digit[] D, and a number A. Find the smallest number which is larger than A, 
* and is consisted by the digit in D.

For example:
D = [0, 1, 8, 3], A = 8821. Result: 8830
D = [0, 1, 8, 3], A = 8310. Result: 8311
D = [0, 1, 8, 3], A = 12345. Result: 13000


Two steps algorithm:
case 1: some of the digits in the number A are not found in D
1. scan from left to right of the digits in the given number, find the digit is not in the set D, try to replace it with bigger digit in the set D, then replace all rest of the digits in the number with the smallest digit in the set D.

1a. if we can't find bigger digit for the digit not found in D, starting from left neighbor of current digit, scan from right to left, try to replace the digit with bigger digit in set D,

if 1a fail, then generate the smallest number which has more digits than A. 

case 2: all of the digits in the number A are found in D
2. If all digits in the number are found in the set, then go from right to left to scan all digits in the number, find the digit in the set which is larger than it, replace current digit with it and replace all rest of the digits on the right with smallest digit in the set D.

2a. if we can't find bigger digits in D for any digits in A, then generate the smallest number which has more digits than A. 


/*
 * Given digit[] D, and a number A. Find the smallest number which is larger than A, 
 * and is consisted by the digit in D.

For example:
D = [0, 1, 8, 3], A = 8821. Result: 8830
D = [0, 1, 8, 3], A = 8310. Result: 8311
D = [0, 1, 8, 3], A = 12345. Result: 13000
*/



public static int[] findSmallestNumberUsingDigits(int[] D, int source){
int len = 0, copy = source;
while(copy>0){
copy = copy/10; len++;
}
int[] digits = new int[len];
copy = source;
while(copy>0){
digits[--len] = copy%10;
copy = copy/10;
}
//java uses quick-sort if size is small then actually insertion sort
Arrays.sort(D);
//step 1
//scan from left to right find the one digit in the number not in D and try to replace it //with larger digit in D
//if succeed set flag to true, then replace the rest of digits on the right with smallest //digits in D
//if failed, then call getNextGreater()
boolean flag = false;
for(int i=0; i<digits.length; i++){
        if(flag){
        digits[i] = D[0];
continue;
}

int target = Arrays.binarySearch(D, digits[i]);
if(target<0){
int insertion = (target+1)*(-1);
if(insertion == D.length){
//throw new IllegalArgumentException("Not possible");
//now we scan to the left to increase the digit
int k = i-1;
for(; k>=0; k--){
target = Arrays.binarySearch(D, digits[k]);
if(target+1<D.length){
digits[k] = D[target+1];
i = k;
break;
}
}
if(k<0){
return getNextGreater(digits, D);
}

}else{
digits[i] = D[insertion];
}
flag = true;
}
}

if(flag)
return digits;

//step 2
//if all digits are in D
//then starting from right, find the larger digit in D than current digit,
//if found, replace current digit with the found digit and replace the rest of the digits on //the right with smallest digit in D
//otherwise call getNextGreater()
int i = digits.length-1;
for(; i>=0; i--){
int target = Arrays.binarySearch(D, digits[i]);
if(target+1<D.length){
digits[i] = D[target+1];
flag = true;
break;
}
}

if(!flag)
return getNextGreater(digits, D);

for(int j = i+1; j<digits.length; j++)
digits[j] = D[0];

return digits;

}

public static int[] getNextGreater(int[] digits, int[] D){
int[] ds = new int[digits.length+1];
ds[0] = D[0]==0?D[1]:D[0];
for(int id=1; id<ds.length; id++)
ds[id] = D[0];
return ds;
}


Monday, February 16, 2015

Find single-sourced shortest path - bellman-ford algorithm

Continue the journey from the last blog (http://blueocean-penn.blogspot.com/2015/02/dijkstra-algorithm-implementation.html) about find single-sourced shortest path in a graph, last time I implemented Dijkstra algorithm which can deal with graph with cycle with the constrain of all weights have to be non-negative.

The negative cycle issue is the nemesis for any shortest path graph algorithm, in this post I implemented Bellman-Ford algorithm which can find single-sourced shortest path and report negative cycle if there is one.

The algorithm itself is incredibly simple:

for each vertex in the graph except the source vertex
loop through each edge in the graph
do relaxation on (u, v, w), u, v are the vertex, w is the weight of the edge (u, v)

the following implementation assumes graph represented as adjacent matrix w[][]


/*
* O(VE) algorithm can report negative cycle
*/
int[] bellmanford(int source){
int[] distance = new int[this.size];
int[] parent = new int[this.size];
//initialize
for(int i=0;i<this.size; i++){
distance[i] = Integer.MAX_VALUE;
parent[i] = -1;
}
distance[source] = 0;
parent[source] = -1;
for(int i = 1; i< this.size; i++){
for(int u = 0; u<this.size; u++){
for(int v=0; v<this.size; v++){
if(w[u][v]!=0){//there is edge from u->v
//relaxation
if(distance[v] > distance[u] + w[u][v]){
distance[v] = distance[u] + w[u][v];
parent[v] = u;
}
}
}
}
}
for(int u = 0; u<this.size; u++){
for(int v=0; v<this.size; v++){
if(w[u][v]!=0){//there is edge from u->v
if(distance[v] > distance[u] + w[u][v]){
throw new IllegalStateException("Graph has negative cycle!");
}
}
}
}
return distance;
}

Find repeating sub-sequences in a string

This is a problem given by Peng Li (http://allenlipeng47.com/PersonalPage/index/list/1/nkey)
Given a string , find if there is any sub-sequence that repeats itself. Here, sub-sequence can be a non-contiguous pattern, with the same relative order. 


1. abab <------yes, ab is repeated
2. abba <---- No, a and b follow different order
3. acbdaghfb <-------- yes there is a followed by b at two places
4. abcdacb <----- yes a followed by b twice
5. aacc< ----yes, aacc, the underline ‘ac’ and red ‘ac’ are the repeating sub-sequences.

We can reduce this problem to the problem of finding common sub-sequence amount two strings, in this case, the two strings are identical. Also note the diagonal values of the DP table will be the string itself or identical sub-sequences which we should not consider.

For example:

negative example: abba

before and after we set diagonal values to 0


len(sub-seq)
a
b
b
a
a
1
1
1
2
b
1
2
2
2
b
1
2
3
3
a
1
2
3
4

len(sub-seq)
a
b
b
a
a
0
0
0
1
b
0
0
1
1
b
0
1
0
1
a
1
1
1
0

positive example: abab



len(sub-seq)
a
b
a
b
a
0
0
1
1
b
0
0
1
2
a
1
1
0
2
b
1
2
2
0





public static boolean findLongestSubsequence(String s1, String s2){
 int len1 = s1.length(), len2 = s2.length();
//part1 build DP table bottom-up
 int[][] table = new int[len1+1][len2+1];
     
 for(int i = 1; i<len1+1; i++){
     for(int j=1; j<len2+1; j++){
          if(i!=j){ <=== can not consider the subsequence sharing the same characters
          if(s1.charAt(i-1) == s2.charAt(j-1))
            table[i][j] = table[i-1][j-1]+1;
          else
            table[i][j] = Math.max(table[i-1][j], table[i][j-1]);        
          }
     }
 } 

  for(int i = 0; i<len2+1; i++){
     if(table[len1][i]>1)
        return true;
  }  

return false;

}


Sunday, February 15, 2015

Find single-source shortest paths: Dijkstra algorithm implementation

A generic shortest path algorithm to find shortest path from single source vertex for graph is as following:

Given Graph(V, E, W), V is the set of vertices, E is the set of edges, W is the set of weights for each edge, and source vertex s.

distance[] array contains the distance from any vertex to source vertex s
parent[] array contains the previous vertex to any vertex on a given shortest path

1. initialize
 for any v:
  d[v] = infinity, parent[v] = null
d[s] = 0

2. relaxation
flag = true
while(flag)
 flag = false
 for any edge u, v
   if(d[v] > d[u] + w(u, v)){
    d[v] = d[u] + w(u, v)
    parent[v] = u
    flag = true
   }


Couple optimized shortest path algorithm exists to implement different ways of relaxation and make d[] converge fast. Dijkstra algorithm is one of them. It can handle any graph even with cycles, but all of the weights have to be non-negative.

public class ShortestPathQ {
/*
* Dijkstra algorithm for any graph with positive weights
* the algorithm essentially is BFS using a priority queue
*/
/*
* here i assume graph is represented as adjacent matrix, D
* vertex is represented as number of 0, 1, ..., n-1
* the value at each cell D[i][j] represent weight of the edge
* from vertex i to j
*/
//a custom PQ to hold the list of vertex ordered by their distances from source
class PQ {
//the binary heap represented as array
int[] heap;
//keep track of vertex position in the heap
int[] indices;
int[] distance;
int size;
//time complexity O(size)
PQ(int s, int[] d){
size = s;
heap = new int[s];
distance = d;
heapify();
}
//time complexity O(size)
void heapify(){
for(int i = 0; i<size; i++){
heap[i] = i;
indices[i] = i;
}
for(int i=size/2; i>=0; i--)
bubbleDown(i);
}
//time complexity O(lg(size))
void bubbleDown(int index){
int target = index;
int left = index*2+1;
if(left<size && distance[left] < distance[target])
target = left;
int right = index*2+1;
if(right<size && distance[right] < distance[target])
target = right;
if(target!=index){
swap(target, index);
bubbleDown(target);
}
}
void swap(int i, int j){
int tem = heap[i];
heap[i] = heap[j];
heap[j] = tem;
tem = indices[i];
indices[i] = indices[j];
indices[j] = tem;
}
//time complexity O(lg(size))
void bubbleUp(int index){
int target = index;
int parent = (index-1)/2;
if(parent>=0 && distance[parent] > distance[target])
target = parent;
if(target!=index){
swap(target, index);
bubbleUp(target);
}
}
int min(){
return heap[0];
}
//time complexity O(lg(size))
int removeMin(){
int i = heap[0];
heap[0] = heap[--size];
bubbleDown(0);
return i;
}
int indexOf(int vertex){
return indices[vertex];
}
boolean isEmpty(){
return size == 0;
}
}
class Graph {
int[][] w;
int size;
Graph(int size){
this.size = size;
w = new int[size][size]; 
}
/*
* undirected graph representation
* weight zero means no edge between two vertex
*/
void addEdge(int i, int j, int we){
w[i][j] = we;
w[j][i] = we;
}
/* given vertex, dijkstra algorithm will calculate 
* shortest path from it to any vertex in the graph 
*/
int[] dijkstra(int source){

  int[] distance;
int[] parent;
//initialize
distance = new int[this.size];
parent = new int[this.size];
for(int i=0;i<this.size; i++){
distance[i] = Integer.MAX_VALUE;
parent[i] = -1;
}
distance[source] = 0;
parent[source] = -1;
PQ queue = new PQ(this.size, distance);
while(!queue.isEmpty()){
int u = queue.removeMin();
for(int v = 0; v < size; v++){
if(w[u][v] != 0){//there is edge from u to v
//relaxation O(lgsize)
if(distance[v] > distance[u] + w[u][v]){
distance[v] = distance[u] + w[u][v];
parent[v] = u;
queue.bubbleUp(queue.indexOf(v));
}
}
}
}  

return distance;
}



//PQ implemented in java PriorityQueue but it doesn't have optimized update() or remove()
// method, time complexity is O(size) at relaxation step
int[] dijkstraUsingJavaPQ(int source){
final int[] distance;
int[] parent;
//initialize
distance = new int[this.size];
parent = new int[this.size];
for(int i=0;i<this.size; i++){
distance[i] = Integer.MAX_VALUE;
parent[i] = -1;
}
distance[source] = 0;
parent[source] = -1;
PriorityQueue<Integer> queue = new PriorityQueue<Integer>(this.size, new Comparator<Integer>(){
@Override
public int compare(Integer i, Integer j){
if(distance[i] == distance[j])
return 0;
else if(distance[i] > distance[j])
return 1;
else
return -1;
}
});
for(int i = 0; i<size; i++)
queue.add(i);
while(!queue.isEmpty()){
int u = queue.remove();
for(int v = 0; v < size; v++){
if(w[u][v] != 0){//there is edge from u to v
//relaxation O(size)
if(distance[v] > distance[u] + w[u][v]){
distance[v] = distance[u] + w[u][v];
parent[v] = u;
queue.remove(v);//remove by equals()
queue.add(v);
}
}
}
}
return distance;
}




}

}