Issues

Report 2

General Issues and my implementation

==========================

1. Longest common sequence or substring?(using substring)

2. Should the overlap be at the edges of both the subsequences? If so, ignore all the subs that have overlap in the middle of the subsequence. If not, how do we merge them? (Overlap at edges)

3. When selecting the subsequences to merge should we choose the ones with longer overlap or the ones that produce a longer subsequence?

(the one with longest overlap is selected)

4. What is the acceptable minimum size of overlap? (Right now even 1 is accepted)

5. Should the old unused sequences be used in the assembly process? If exact matching is used, then the sequences that are merged are already present, but the sequences that were not used maynot be useful, since we do an exhaustive search.(Keeping them might be useful for later methods).

6. Will the direction in which assembly is done matter.( atleast not, if we use exact matching)

7. Is greedy search for exact (also known as pairwise matching/alignment) good (rather is it feasible)? That is comparing each sub with each other and finding their merged sequences?

My Implementation

===============

One sequence is compared with the rest in the database. An overlap is defined as a suffix in one sub and a prefix in other. These two are then combined. (There are limitations of this process that I am ignoring right now).

What can the code do?

Generate subsequences.
Find LCS(non contiguous)
Find LCS(contiguous)
Merge two matching sequences at a time,we preserve the original subsequences if they are not merged with any other subsequence, so that they can be used later.
Local exact assembly/matching .
source code: sequencelcsv1.cpp (This only merged two strings and had some limitations)
Improved code sequencelcsv3.cpp (Merges all possible substrings)
Currently the program can be executed on the command prompt "
Example of output genome.html

Implementation Observations/ Issues:

1. Need a better way of evaluating rather than looking at strings. Doing string comparison to evaluate matched strtings.

2. Solved the problem with LCS.