Programmes de recherche de gènes

2 Mots importants

Sensibilité: Combien de vraies instances de l'objet recherché sont vraiment trouvées.
Spécificité: Combien de fausses instances sont trouvées (faux positifs).

Analyse des signaux

La plupart des auteurs ont tenté d'exploiter à la fois la présence d'un cadre de lecture et des autres signaux de séquence: promoteur (TATA, ilots CpG), jonction intron-exon (donneur, accepteur), signal de polyadenylation, etc.
Même si les signaux étaient parfaitement conservés, ils sont peu spécifiques. Par exemple, on trouve en moyenne un ORF de 150 nt (la taille typique d'un ORF) tous les kilobases, alors qu'il n'en existe en fait qu'un tous les 10kb dans les génomes de vertébrés!
Les boîtes TATA et autres éléments des promoteurs, ainsi que les signaux d'épissage sont également peu sélectifs: le motif donneur AxGT(A/G)xG est observé 559 fois dans un contig humain de 67kb ne contenant que 7 exons.
Pire encore, tous ces signaux sont dégénérés (mal conservés) et on doit donc, pour éviter de manquer de nombreuses occurrences autoriser de fortes déviations par rapport aux motifs idéaux: on manque alors de spécificité.
Il existe dans les gènes de vertébrés de grandes régions 5' et 3' non traduites (plusieurs centaines de bases en 3'), privées de signaux connus.
La reconstruction du gène complet ajoute encore une source d'erreur: risque d'oublier des exons ou de mélanger ceux provenant de deux gènes.
Des paramètres tels que la structure de la chromatine sont largement hors de contrôle de la bioinformatique. Pourtant, l'euchromatine, déroulée est exprimée activement, l'hétérochromatine, condensée, est inaccessible à la transcription.

Analyse du contenu

Fait appel aux biais dans les séquences des gènes causés par:
- la nature du code génétique
- l'abondance des acides aminés
- les ARNt disponibles
- d'éventuelles contraintes évolutives
Ces contraintes se traduisent par
- Un usage irrégulier des codons synonymes
- Une préférence pour les séquences purine-N-pyrimidines aux codons
- Une assymétrie des 3 positions.
C'est pour l'instant la statistique sur 6 lettres (2 codons) qui s'avère le mieux discriminer séquences codantes et non codantes. Les raisons en sont encore mystérieuses.

Les algorithmes

Les meilleures méthodes sont celles qui combinent détection de signaux et analyse du contenu.
Linear Discriminant Analysis (LDA).
HMM, modèles markoviens cachés
Réseaux neuronaux

Ces méthodes reposent toutes sur des ensembles d'entrainement. Cela pose un problème pour la détection d'exons "non canoniques". La composition des ORF peut varier considérablement dans certaines familles de gènes. Les statistiques apprises sur l'ensemble d'entrainement ne s'appliqueront plus. Les signaux des promoteurs et des sites d'épissages sont par contre assez généraux.

Document

Predicting Pol II Promoter Sequences Using Transcription Factor Binding Sites (extrait) Dan S. Prestridge The development of computer algorithms to recognize Pol II promoter sequences in primary sequence data, however, is an extremely complex and difficult problem. While there have been many successful attempts to build algorithms to recognize prokaryotic promoter sequences using a variety of approaches, there have been few attempts at eukaryotic promoter recognition. The difficulty in developing algorithms to recognize eukaryotic Pol II promoters, in comparison to algorithms designed to recognize prokaryotic promoters, lies in the greater complexity of eukaryotic promoter structure. Prokaryotic promoter sequences contain three sequence elements that can be used as a basis of recognition: the -10 and -35 elements (relative to the transcription start site) and a spacer that separates these two sequence elements which is of a fairly consistent length. Eukaryotic promoters have similar elements that are common to many Pol II promoter sequences; however, these elements are much less consistent and less discriminating. Many Pol II promoters contain a TATA box between -25 and -40; however the TATA box consensus is not a strong one (Bucher, 1990) and the element is missing altogether in many promoters. A second element often found in Pol II promoters is a CA dinucleotide straddling, or very near, the transcription start site (TSS). However, as in the case of the TATA box, many promoter sequences are missing this element, or it is displaced from the TSS. The CA dinucleotide, by nature of its small size, is also not very diagnostic (one in every 16 dinucleotides can be expected to be the CA nucleotide pair by random selection). And finally, the spacer between the TATA box and CA dinucleotide is highly variable in length, even if both sequence elements are present. In examining Pol II promoter sequences, very few can be found to contain both the TATA box and CA dinucleotide separated by a consistent spacer. Additionally, besides the TATA and CA elements, eukaryotic Pol II promoter sequences each contain several binding sites for a variety of transcription factors. The content and arrangement of these binding sites in a promoter sequence determine when, in what tissues, and under what conditions the promoter is functional. Each promoter has a unique selection and arrangement of these elements, giving each promoter a unique program of expression. At the present time, there are more than 2000 individual binding sites that have been described, representing the binding sites of hundreds of transcription factors. It is the structural complexity of Pol II promoters and the heterogeneity of transcriptional elements in promoters that make eukaryotic Pol II promoter recognition by computers difficult to approach. There have been few previous attempts to build Pol II promoter recognition algorithms. Most notable are the past attempts of Claverie and Sauvaget, Bucher, and the recent efforts of Matis et al. and Knudsen. Claverie and Sauvaget developed an algorithm capable of recognizing a pattern of two or three sequence elements separated by defined length spacers. The algorithm, while shown to be capable of recognizing promoter element patterns with a fairly high degree of fidelity, was however, capable of recognizing only specific manually predefined promoter patterns. Tests were limited to glucocorticoid and heat shock promoters. Another approach, potentially capable of recognizing a broader range of TATA-containing promoters, was Bucher's efforts to refine and utilize an improved TATA box weighted matrix and using it to recognize Pol II promoter sequences. However, while results were significant on a limited sequence test set, when more generally applied to non-promoter sequences, results indicate that the rate of false positives is much too high when using this matrix alone to be a useful promoter discriminator (the FP rate is on the order of 1/100 bases in non-promoter sequence; author, data not shown). In addition to these past attempts, recently there have been two as yet unpublished attempts at Pol II promoter recognition, those of Matis et al. and Knudsen. The Matis program is associated with the Grail gene recognition program, while the Knudsen program, Promoter1, can be found as an independent program or in association with the GeneID gene recognition program. The Matis program uses an artificial intelligence method to recognize Pol II promoters, but is limited to recognizing only TATA-containing promoters. The Knudsen program, Promoter1, takes a more generalized approach at promoter recognition, being trained on vertebrate promoters (including TATA-containing promoters), with correspondingly better results at overall primate promoter recognition.