General Design Issues

==================

 

1. Longest common sequence or substring?

 

2. Should the overlap be at the edges of both the subsequences? If so, ignore all the subs that have overlap in the middle of the subsequence. If not, how do we merge them?

 

3. When selecting the subsequences to merge should we choose the ones with longer overlap or the ones that produce a longer subsequence?

 

4. What is the acceptable minimum size of overlap?

 

5. Should the new merged sequences be used in the assembly process?

 

6. Will the direction in which assembly is done matter!

 

7. Is greedy search for exact (also known as pairwise matching/alignment) good (rather is it feasible)?  That is comparing each sub with each other and finding their merged sequences?

 

7. I am confused with a lot of information, lot of different approaches, still trying to understand the problem. So this list might grow as I learn more J

 

 

My Implementation

===============

 

 I am trying to follow the some explanation about assembly in [insert the link]. One sequence is compared with the rest in the database. An overlap is defined as a suffix in one sub and a prefix in other. These two are then combined. (There are limitations of this process that I am ignoring right now).

 

What can the code do?

  1. Generate subsequences.
  2. Find LCS(non contiguous)
  3. Find LCS(contiguous, exception: I realized this later so did not fix it yet)
  4. Merge two matching sequences, has to be extended to merge more than two sequences.
  5. Local exact assembly/matching String using LCS
  6. source code: sequencelcsv1.cpp

 

 

 

 

 

Known Limitations:

 

1. Need a better way of evaluating rather than looking at strings. (add code to do it)

2. Problem with my implementation of LCS for exact match:

 

Ex:

S1='hello world'

S2='world is good'

 

The problem is the last 'd' is missed in the 'world', when we do an exact match, since we move up and right one level due to the match with the other 'd' from 'good'. This will work for LCS (non-contiguous, it will simply use the 'd' from good!). This will fail for my addition of finding the exact match, which I plan to correct ASAP. I should look for better methods of finding the common substring from the LCS.

In simple words “worl” will be the common substring instead of “world’.