Friday, February 13, 2015

Find the longest common subsequence in two strings

This is classic dynamic programming problem.
Given two strings, find the longest common subsequence in the two strings. For example: input: ABCD and BDACE, output: AC or BC. Note it could have multiple answers, but we only need to output one of them.

In the MIT 6.006 course, the dynamic programming were introduced very nicely in the following steps:
1. define subproblem
2. guess
3. related subproblem solutions
4. recurse + memoize OR build DP table bottom-up
5. solve original problem

So let's use the above technique to solve today's problem. First of all, we can reduce the problem to find the length of the longest common subsequence in two strings. Because once we can do that, we can devise our solution from it. Here are the steps:

1. Subproblem will be DP(i, j) where i, j will be the indices of two strings
2. Guess the choice, the choices from DP(i-1, j-1) or DP(i-1, j) or DP(i, j-1).
3. Recurrence, there are two cases:
    1) char of string1 at index i is the same of char of string2 at index j
        DP(i, j) = DP(i-1, j-1) + 1
    2) not the same
        DP(i, j) = Max(DP(i-1, j), DP(i, j-1))
4. Topological order, thinking about DP as 2D matrix, the order will be from top-left corner (0,0) either going right or going left to the bottom-right corner (lenString1, lenString2).
5. Original problem's solution will be DP (lenString1, lenString2).

And once we compute the DP table, then we can do back-tracing from bottom-right corner to find the longest common subsequence.

Here is the code:


public static String findLongestSubsequence(String s1, String s2){
int len1 = s1.length(), len2 = s2.length();
//part1 build DP table bottom-up
int[][] table = new int[len1+1][len2+1];
for(int i = 0; i<len1+1; i++)
    table[i][0] = 0;
     
for(int i = 0; i<len2+1; i++)
    table[0][i] = 0;
     
for(int i = 1; i<len1+1; i++){
    for(int j=1; j<len2+1; j++){
        if(s1.charAt(i-1) == s2.charAt(j-1))
            table[i][j] = table[i-1][j-1]+1;
        else
            table[i][j] = Math.max(table[i-1][j], table[i][j-1]);         
    }
}    
//part 2 backtracing   
StringBuilder sb = new StringBuilder(); 
int i = len1, j = len2;
while(i>0 && j>0){
    if(s1.charAt(i-1) == s2.charAt(j-1)){
        sb.append(s1.charAt(i-1));
        i--; j--;   
    }else if(table[i][j] == table[i-1][j])
        i--;
    else
        j--;    
}

return sb.reverse().toString();

}



3 comments:

  1. To me, DP is very difficult. My experience is that how to define the subproblem, formula is most important and difficult part. I only only had one time, that I really came up with the solution by myself. It is the one DP problem I shared with you.

    ReplyDelete
    Replies
    1. Yes, the hardest part of DP is to define the subproblem(s) and recursion part. In some situation it is easy, like string-related DP, most of time DP is a 1D array or 2D table; in tricky situation, like the segment tree problem we discussed before Maximum sum range query using segment tree and Minimum cost of sorting an array of positive numbers.

      Delete
  2. A similar problem to "Find the longest common subsequence in two strings" is "find if there is any sub-sequence that repeats itself". I would like to share with you:

    Given a string (1-d array) , find if there is any sub-sequence that repeats itself. Here, sub-sequence can be a non-contiguous pattern, with the same relative order. 

    1. abab <------yes, ab is repeated 
    2. abba <---- No, a and b follow different order 
    3. acbdaghfb <-------- yes there is a followed by b at two places 
    4. abcdacb <----- yes a followed by b twice 
    5. aacc< ----yes, aacc, the underline ‘ac’ and red ‘ac’ are the repeating sub-sequences.

    ReplyDelete