General Issues and my implementation
==========================
1. Longest common sequence or substring?(using substring)
2. Should the overlap be at the edges of both the subsequences? If so, ignore all the subs that have overlap in the middle of the subsequence. If not, how do we merge them? (Overlap at edges)
3. When selecting the subsequences to merge should we choose the ones with longer overlap or the ones that produce a longer subsequence?
(the one with longest overlap is selected)
4. What is the acceptable minimum size of overlap? (Right now even 1 is accepted)
5. Should the old unused sequences be used in the assembly process? If exact matching is used, then the sequences that are merged are already present, but the sequences that were not used maynot be useful, since we do an exhaustive search.(Keeping them might be useful for later methods).
6. Will the direction in which assembly is done matter.( atleast not, if we use exact matching)
7. Is greedy search for exact (also known as pairwise matching/alignment) good (rather is it feasible)? That is comparing each sub with each other and finding their merged sequences?
My Implementation
===============
One sequence is compared with the rest in the database. An overlap is defined as a suffix in one sub and a prefix in other. These two are then combined. (There are limitations of this process that I am ignoring right now).
What can the code do?
1. Need a better way of evaluating rather than looking at strings. Doing string comparison to evaluate matched strtings.
2. Solved the problem with LCS.
3.