Bioinformatics Research at

Department of Computer Science

University of Nevada Reno

Home

Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly

The rapid developments in molecular biological technologies over the past few years have changed the understanding of the evolution of bacteria. Even with the amount of research done in these areas there is lack of information about the properties of the globally distributed microbes. Modern genomic methods have been used to a small extent in genome sequencing. While these approaches are extremely informative when examining a single member of the community, they are not a practical approach for community level analysis. These preliminary studies provide a major drive for research that can only be accomplished by high throughput genomic approaches.


Bacteria can often have minor variations in their DNA that can result in different metabolic characteristics. The differences can make it difficult to classify bacteria taxonomically and thus create clusters of related organisms with similar metabolic characteristics. What has been needed is a method of creating a characteristic representation (characteristic genome) from the sub sequences of DNA found in several sub variant of a bacteria of the same species. Such genome could be used for more efficient classification at a molecular level through the process of controlled generalization. The question has been how to achieve this in an intuitive fashion that does not make black and white decisions about how to assemble a characteristics genome from n sub sequence derived from m individuals.


Several methods have been studied for sequence assembly. Techniques such as Neural Networks, hidden Markov models, and Bayesian networks are computationally expensive and require high performance computing with huge training. Currently most of the sequence applications do not tolerate any kind of inexactness or errors in sub sequence matching. String matching in nucleotide sequences is challenged by variation because there are few concepts in matching such as LIKE, NOT LIKE, or SIMILAR.


Symbolic sequential data can be considered as either (1) exact matching or (2) approximate matching (most similar match). Quite often in real world data mining applications, especially in molecular biology, exact patterns do not exist and therefore, an approximate matching algorithm is required. Hence an algorithm that performs a match to a certain degree is desired.

Fuzzy Logic has been used extensively in approximate string matching using distance measures, etc. However, very little work has been done in application of building genomes from subsequences of nucleotides. Moreover this process becomes computationally expensive because multiple comparisons have to be performed for each possible string pair. The accuracy of any fuzzy matching system is partially determined by the error model used. An accurate system reflects the mechanism responsible for the variations in the match. Hence a flexible error metrics is desired that is generic for any fuzzy matching.
Current sequencing methods tend to be rejecting sequences that do not match with a high degree of similarity. This can lead to large amounts of data being rejected by algorithms that otherwise may be important in deriving a genomic sequence and the metabolic characteristics of such a sequence.

Research Goals

i Given a collection of nucleotide sequences such as from multiple organisms, develop techniques based on fuzzy set theory and other methods for assembly of the sequences into the original full genome for each organism.
ii Using the techniques developed in goal 1, develop a generalized approach for creating a characteristic genome that represents a generalization of the original organisms that donated sequence data.

Links
Research
People
Defintions
Progress Reports
 

 

Literature and Publications

 

Run the Generator/Assembler