Group of matematical study of amino acid and DNA sequences

korotkov-1 Eugene V. Korotkov
Professor, Dr.Sci. (Biology)
Head of the Group
INB, room 303
Телефон +7 (499) 135-21-61
E-Mail bioinf@yandex.ru

Research directions

RESEARCH DIRECTIONS
  • Development of new mathematical methods for search of multiple alignments of amino acid and nucleotide sequences;
  • Development of new mathematical methods for search of pairwise alignments of amino acid and nucleotide sequences without the use of weight matrices;
  • Development of mathematical methods for search of a latent periodicity in DNA or amino acid sequences with deletions and insertions;
  • Study of the presence of dispersed repeats in different genomes and amino acis sequences;
  • Study of the presence of latent periodicity in different nucleotide and amino acid sequences;
  • Development of mathematical methods for search of change points in the genes and in the proteins;
  • Development of mathematical methods for detection of phase shifts of the periodicity in genes from different organisms;
  • Development of mathematical methods for biological annotations of genes of the bacteria;
  • Creation of an original computer databases;
  • Developing new mathematical methods for time series analysis

 

Achievements

KEY ACHIEVEMENTS

The research team is developing new mathematical methods for studying amino acid and nucleotide sequences.  Calculations are carried out on a computer cluster of its own assembly. The results of the research activities of the group were reported at more than 50 scientific conferences both in the Russian Federation and abroad.

Currently, the group consists of 4 researchers  and 4 students. Simultaneously with the scientific activities, the group leader is engaged in teaching activities at the Department of Applied Mathematics at the National research Nuclear University (MEPhI). Lecture courses for students of mathematicians are given – “Methods of analyzing character sequences”, “Information theory” and laboratory work is conducted on the course “Methods of analyzing character sequences”. Under the guidance of the staff of the group more than 30 graduation projects were prepared. The group participated and is participating in the implementation of grants from the Russian Foundation for Basic Research, the Presidium of the Russian Academy of Sciences, Rosnauka , the Ministry of Education and Science and ISTC. As a result of the group’s work, MIR repeats are discovered in the genome of humans and other mammals and vertebrates. Prof. Korotkov EV is the author of the detection of MIR repeats in the human genome and other species. Now we found change points in genes that shows reading frame shift mutations, points of glued genes and points of inserts of DNA fragments into the genes.

A method has now been developed to search for multiple alignment of amino acid and nucleotide sequences (NP-complete task) without using pairwise comparison of sequences and without using identical k-words.   Multiple alignment is detected as a point in 4n or 20n space,   where n is the length of sequences for multiple alignment. The developed solution allows to detect such multiple alignments that are skipped by all previously developed approaches. For example, it may be sequences that have from 2.5 to 4.4 substitutions per amino acid or nucleotide. When searching for multiple alignments, random positional-weight matrices, special optimization procedures, and two-dimensional dynamic programming are used.

The method was applied primarily to search for hidden periodicity in symbolic sequences. We found a periodic structure of very many amino acid sequences and DNA sequences from different genomes. We also constructed a multiple alignment for promoter sequences from A.thaliana genome. Our method for calculation of a multi alignment is more powerful than all spectral approaches and approaches based on dynamic programming. Based on the developed methods, mathematical approaches are developed for searching for pairwise alignments of nucleotide or amino acid sequences without using predetermined weight matrices.

The main scientific achievements of the group:

  • A mathematical algorithm has been developed, software has been created for search of a phase shifts of triplet periodicity in genes. The phase shift of the triplet periodicity makes it possible to identify reading frame shift mutations in the genes. It has been shown that approximately 10% of all known genes from the database (millions of genes) contain triplet periodicity phase shifts. This is about 20 times more than it is possible to register experimentally or with all previously developed mathematical methods. This allows us to conclude that mutations of the reading frame shift type have been widely used by evolution and are far from always fatal for genes . The created software is located at: http://victoria.biengi.ac.ru/fsfinder/
  • A mathematical algorithm has been developed, software has been created for searching for triple- frequency fusion points in DNA. The points of disorder allow you to find potential sites of gluing together of genes and insertions of a fragment of one gene into another gene. An analysis of the genes contained in the Kegg database showed that approximately 20% of the genes contain frustration points . This shows that the processes of gene gluing and inserts occur about 10 times more often than experimental data suggest, or all the previously developed mathematical methods. This result can be used to create new genes, since fusion points are natural sites of cuts in genes that use natural processes.
  • A new mathematical method has been developed, software has been created, and a Web server has been developed for annotating genes from bacteria genomes. Using the example of 104 bacterial genomes, it was shown that this method allows us to annotate at the same statistical level at ~ 19% than this can be done by all the previously developed approaches. This is a new result, as it allows to use millions of genes in biotechnology, which are currently known only as a sequence and their function is not defined. Created a site: http://genefunction.ru , where each user can annotate sequences of genes.
  • A mathematical approach has been developed for finding the hidden periodicity of symbolic sequences. A large number of DNA sequences from various genomes that have latent periodicity with insertions and deletions of symbols have been identified . These are original results not currently available . The software is located on the server: http://victoria.biengi.ac.ru/splinter/login.php
  • Developed a new mathematical approach to search for multiple alignment of nucleotide sequences. The method allows us to find multiple alignment of sequences for the steppe evolutionary divergen n s x <4.4 mutations per nucleotide. All known programs and algorithms allow you to do this up to x <2.5. A second approach Anna allowed to receive multiple alignment of the promoter sequences from the genome sA.thaliaha , d.melanogaster, H.sapiens
  • A database of sequences from diverse genomes with different kinds of periodicity. The base is located at: http://victoria.biengi.ac.ru/cgi-bin/indelper/index.cgi
  • A cds database has been developed with potential mutations of the reading frame shift type. http://victoria.biengi.ac.com/cgi-bin/frameshift/index.cgi

Developments

INNOVATIVE DEVELOPMENTS

Статус Наименование разработки Дата Где Краткое описание
1 Implemented Web site for bacterial gene annotation 2014 http://genefunction.ru The site gives a list of the most likely biological functions of the nucleotide sequence under study. The effectiveness of annotations is about 19% higher than that of all existing methods with the same number of false positives
2 Implemented Base Information «Database of Periodic DNA Regions in Major Genomes» 2017 http://victoria.biengi.ac.ru/cgi-bin/indelper/index.cgi The database contains information on areas with different periodicity tapes in various genomes. For eukaryotic genomes, these areas on average occupy ~ 8% of the genome
3 Implemented Web site for searching for hidden periodicity in DNA and amino acid sequences with deletions and insertions 2017 http://victoria.biengi.ac.ru/splinter/login.php The site allows you to find the hidden periodicity with insertions and deletions, or both amino acid and the nucleotide sequences
4 Implemented Database potential mutations like reading frame shift in cds 2018 http://victoria.biengi.ac.ru/cgi-bin/frameshift/index.cgi The database contains information about potential mutations like reading frame shift in a variety of cds from eukaryotic genomes. On average, about 23% cds contains such mutations
5 Implemented Web site to search for potential mutations of the type of reading frame shift in cds 2018 http://victoria.biengi.ac.ru/fsfinder/ The server allows you to find potential mutations of the type of reading frame shift to any cds

 

Intellectual property

RESULTS OF INTELLECTUAL ACTIVITY (patents, utility models, databases, know-how, etc.)
Type Title Authors Date
1 Database Database of potential micro and minisatellite sequences http://victoria.biengi.ac.ru/mmsat/ Korotkov E.V.,
Shelenkov M.A.
2008
2 Database The database of DNA sequences with hidden periodicity http://victoria.biengi.ac.ru/lp/ Korotkov E.V.,
Chaley MB,
Frenkel F.E.
2006
3 Database Database of sequences similar to the sequence of the hepatitis C virus http://victoria.biengi.ac.ru/hcv/ Frenkel F.E.,
Korotkov E.V.
2005
4 Database Database of t-RNA-like sequences from different genomes http://victoria.biengi.ac.ru/trnalikes/ Frenkel F.E.,
Korotkov E.V.
2004
5 Web-server Server to search for areas with hidden periodicity in posledovatelnosth DNA bases
http://victoria.biengi.ac.ru/lepscan
Shelenkov A.A.,
Korotkov E.V.
2008
6 Database Classes of triplet periodicity in the DNA sequence of known genes from the Kegg database
http://victoria.biengi.ac.ru/ancorfs
Frenkel F.E.,
Korotkov E.V.
2009
7 Database Web site for annotation of bacterial genes http://genefunction.ru Golyshev M.A.,
Korotkov E.V.
2014
8 Database Base Information «Database of Periodic DNA Regions in Major Genomes»
http://victoria.biengi.ac.ru/cgi-bin/indelper/index.cgi
Frenkel F.E.,
Korotkov E.V.
2017
9 Web server A web site to search for hidden periodicity in character sequences with deletions and character insertions
http://victoria.biengi.ac.ru/splinter/login.php
Frenkel F.E.,
Korotkov E.V.
2017
10 Database Database potential mutations like reading frame shift in cds
http://victoria.biengi.ac.ru/cgi-bin/frameshift/index.cgi
Frenkel F.E.,
Korotkov E.V.,
Pugacheva V.M.,
Suvorova Yu.M.
2018
11 Web server Web site to search for potential mutations of the type of reading frame shift in cds
http://victoria.biengi.ac.ru/fsfinder/
Frenkel F.E.,
Korotkov E.V.
2018

Publications

SELECTED PUBLICATIONS

  1. Frenkel F.E., Korotkova MA, Korotkov EV “Database of Periodic DNA Regions in Major Genomes”, BioMed Research International, 2017, https://doi.org/10.1155/2017/7949287
  2. Korotkov, E.V., Korotkova, M.A. Study of the periodicity in Euro-US Dollar exchange rates using local alignment and random matrices Algorithmic Finance v. 6 (2017) 23–33 DOI:10.3233/AF-170182
  3. Korotkov, E.V., Korotkova, M.A. Search for regions with periodicity using the random position weight matrices in the C. elegans genome. Int. J. Data Mining and Bioinformatics, 18(4):331 · January 2017, DOI: 10.1504/IJDMB.2017.10009360
  4. A.Nor, E.Korotkov Search of Fuzzy Periods in the Works of Poetry of Different Authors. Advances in Fuzzy Systems Volume 2018, Article ID 4028417, 10 pages  https://doi.org/10.1155/2018/4028417
  5. Suvorova, Y.M., Korotkova, M.A., Skryabin, K.G., Korotkov, E.V. Search for potential reading frameshifts in cds from Arabidopsis thaliana and other genomes. DNA Research, v.26, 157-170, 2019.  DOI: 10.1093/dnares/dsy046
  6. Suvorova, Y.M., Pugacheva, V.M., Korotkov, E.V. A Database of Potential Reading Frame Shifts in Coding Sequences from Different Eukaryotic Genomes.  Biophysics (Russian Federation), v.64, 339-349, 2019     DOI: 10.1134/S0006350919030217, IF 0.2
  7. Suvorova Yu.M. and Korotkov E.V.  New Method for Potential Fusions Detection in Protein-Coding Sequences.     Journal of computational biology. 2019. 2019 Nov;26(11):1253-1261    DOI: 10.1089/cmb.2019.0122
  8. Korotkov, E.V., Kamionskaya, A.M., Korotkova, M.A. Multiple alignment of promoter sequences from the human genome. Biotekhnologiya. 2020 V.36, n.4, 7-14
  9. Eugene V Korotkov, Yulia M Suvorova, Dmitrii O. Kostenko, Maria A Korotkova. Multiple Alignment of Promoter Sequences from the Arabidopsis thaliana L. Genome. January 2021. Genes. 12(2):135. DOI: 10.3390/genes12020135
  10. Suvorova, Y.M., Kamionskaya, A.M. & Korotkov, E.V. Search for SINE repeats in the rice genome using correlation-based position weight matrices. BMC Bioinformatics 22, 42 (2021). https://doi.org/10.1186/s12859-021-03977-0
  11. Korotkov, Eugene V., Anastasiya M. Kamionskya, and Maria A. Korotkova. 2021. “Detection of Highly Divergent Tandem Repeats in the Rice Genome” Genes 12, no. 4: 473. https://doi.org/10.3390/genes12040473
  12. Korotkov, Eugene V., Yulia. M. Suvorova, Anna V. Nezhdanova, Sofia E. Gaidukova, Irina V. Yakovleva, Anastasia M. Kamionskaya, and Maria A. Korotkova. 2021. “Mathematical Algorithm for Identification of Eukaryotic Promoter Sequences”. Symmetry, 13, no. 6: 917.    https://doi.org/10.3390/sym13060917
  13. Rudenko, Valentina, and Eugene Korotkov. 2021. “Search for Highly Divergent Tandem Repeats in Amino Acid Sequences”. Int. J. Mol. Sci.  22, no. 13: 7096. https://doi.org/10.3390/ijms22137096
  14. Kostenko, D.O.; Korotkov, E.V. Application of the MAHDS Method for Multiple Alignment of Highly Diverged Amino Acid Sequences. Int. J. Mol. Sci. 2022, 23, 3764. https://doi.org/10.3390/ijms23073764

Services

CONTRACT SERVICES (which the laboratory is ready to provide on a contractual basis)

  1. Annotation (prediction of biological function) of bacterial genes
  2. Search for potential mini and microsatellites in DNA sequences
  3. Search for potential mutations of type pa shift reading frame in cds different genomes
  4. Calculation of multiple alignment for highly diverged amino acid or nucleotide sequences (more than 2.5 mutations per amino acid or nucleotide)
  5. Search for potential promoters and TSS in eukaryotic genomes.