Utilizing Fuzzy Logic for Gene Sequence Construction from
Sub Sequences and Characteristic Genome Derivation and Assembly
The rapid developments in molecular biological
technologies over the past few years have changed the understanding of
the evolution of bacteria. Even with the amount of research done in these
areas there is lack of information about the properties of the globally
distributed microbes. Modern genomic methods have been used to a small
extent in genome sequencing. While these approaches are extremely informative
when examining a single member of the community, they are not a practical
approach for community level analysis. These preliminary studies provide
a major drive for research that can only be accomplished by high throughput
genomic approaches.
Bacteria can often have minor variations in their DNA that can result
in different metabolic characteristics. The differences can make it difficult
to classify bacteria taxonomically and thus create clusters of related
organisms with similar metabolic characteristics. What has been needed
is a method of creating a characteristic representation (characteristic
genome) from the sub sequences of DNA found in several sub variant of
a bacteria of the same species. Such genome could be used for more efficient
classification at a molecular level through the process of controlled
generalization. The question has been how to achieve this in an intuitive
fashion that does not make black and white decisions about how to assemble
a characteristics genome from n sub sequence derived from m individuals.
Several methods have been studied for sequence assembly. Techniques such
as Neural Networks, hidden Markov models, and Bayesian networks are computationally
expensive and require high performance computing with huge training. Currently
most of the sequence applications do not tolerate any kind of inexactness
or errors in sub sequence matching. String matching in nucleotide sequences
is challenged by variation because there are few concepts in matching
such as LIKE, NOT LIKE, or SIMILAR.
Symbolic sequential data can be considered as either (1) exact matching
or (2) approximate matching (most similar match). Quite often in real
world data mining applications, especially in molecular biology, exact
patterns do not exist and therefore, an approximate matching algorithm
is required. Hence an algorithm that performs a match to a certain degree
is desired.
Fuzzy Logic has been used extensively in approximate string matching using
distance measures, etc. However, very little work has been done in application
of building genomes from subsequences of nucleotides. Moreover this process
becomes computationally expensive because multiple comparisons have to
be performed for each possible string pair. The accuracy of any fuzzy
matching system is partially determined by the error model used. An accurate
system reflects the mechanism responsible for the variations in the match.
Hence a flexible error metrics is desired that is generic for any fuzzy
matching.
Current sequencing methods tend to be rejecting sequences that do not
match with a high degree of similarity. This can lead to large amounts
of data being rejected by algorithms that otherwise may be important in
deriving a genomic sequence and the metabolic characteristics of such
a sequence.
Research Goals
i Given a collection of nucleotide sequences
such as from multiple organisms, develop techniques based on fuzzy set
theory and other methods for assembly of the sequences into the original
full genome for each organism.
ii Using the techniques developed in goal 1, develop a generalized approach
for creating a characteristic genome that represents a generalization
of the original organisms that donated sequence data.
|