Programmes de recherche de gènes
2 Mots importants
- Sensibilité: Combien de vraies instances
de l'objet recherché sont vraiment trouvées.
- Spécificité: Combien de
fausses instances sont trouvées (faux positifs).
Analyse des signaux
- La plupart des auteurs ont tenté d'exploiter à la fois la présence d'un cadre de lecture
et des autres signaux de séquence: promoteur (TATA, ilots CpG), jonction
intron-exon (donneur, accepteur), signal de polyadenylation, etc.
- Même si les signaux étaient parfaitement conservés, ils sont peu spécifiques.
Par exemple, on trouve en moyenne un ORF de 150 nt (la taille typique d'un ORF)
tous les kilobases, alors qu'il n'en existe en fait qu'un tous les 10kb dans les génomes
de vertébrés!
- Les boîtes TATA et autres éléments des promoteurs, ainsi que les signaux
d'épissage sont également peu sélectifs: le motif donneur AxGT(A/G)xG est observé 559 fois
dans un contig humain de 67kb ne contenant que 7 exons.
- Pire encore, tous ces signaux sont dégénérés (mal conservés) et on doit donc, pour
éviter de manquer de nombreuses occurrences
autoriser de fortes déviations
par rapport aux motifs idéaux: on manque alors de spécificité.
- Il existe dans les gènes de vertébrés de grandes régions 5' et 3' non traduites
(plusieurs centaines de bases en 3'), privées de signaux connus.
- La reconstruction du gène complet ajoute encore une source d'erreur: risque d'oublier
des exons ou de mélanger ceux provenant de deux gènes.
- Des paramètres tels que la structure de la chromatine sont
largement hors de contrôle de la bioinformatique. Pourtant,
l'euchromatine, déroulée est exprimée activement, l'hétérochromatine,
condensée, est inaccessible à la transcription.
Analyse du contenu
- Fait appel aux biais dans les séquences des gènes causés par:
- la nature du code génétique
- l'abondance des acides aminés
- les ARNt disponibles
- d'éventuelles contraintes évolutives
- Ces contraintes se traduisent par
- Un usage irrégulier des codons synonymes
- Une préférence pour les séquences purine-N-pyrimidines aux codons
- Une assymétrie des 3 positions.
- C'est pour l'instant la statistique sur 6 lettres (2 codons)
qui s'avère le mieux discriminer séquences codantes et non
codantes. Les raisons en sont encore mystérieuses.
Les algorithmes
- Les meilleures méthodes sont celles qui combinent détection de
signaux et analyse du contenu.
-
Linear Discriminant Analysis (LDA).
- HMM, modèles markoviens cachés
- Réseaux neuronaux
Ces méthodes reposent toutes sur des ensembles d'entrainement. Cela pose un problème
pour la détection d'exons "non canoniques". La composition
des ORF peut varier considérablement dans certaines familles de gènes. Les statistiques
apprises sur l'ensemble d'entrainement ne s'appliqueront plus.
Les signaux des promoteurs et des sites d'épissages sont par contre assez généraux.
Document
Predicting Pol II Promoter Sequences
Using Transcription Factor Binding Sites (extrait)
Dan S. Prestridge
The development of computer algorithms to recognize Pol II promoter sequences in primary sequence data,
however, is an extremely complex and difficult problem. While there have been many successful attempts to build
algorithms to recognize prokaryotic promoter sequences using a variety of approaches, there have been few
attempts at eukaryotic promoter recognition.
The difficulty in developing algorithms to recognize eukaryotic Pol II promoters, in comparison to algorithms
designed to recognize prokaryotic promoters, lies in the greater complexity of eukaryotic promoter structure.
Prokaryotic promoter sequences contain three sequence elements that can be used as a basis of recognition: the -10
and -35 elements (relative to the transcription start site) and a spacer that separates these two sequence elements
which is of a fairly consistent length. Eukaryotic promoters have similar elements that
are common to many Pol II promoter sequences; however, these elements are much less consistent and less
discriminating. Many Pol II promoters contain a TATA box between -25 and -40; however the TATA box consensus is not a strong one (Bucher, 1990) and the element
is missing altogether in many promoters. A second element often found in Pol II promoters is a CA dinucleotide
straddling, or very near, the transcription start site (TSS). However, as in the case of the TATA box,
many promoter sequences are missing this element, or it is displaced from the TSS. The CA dinucleotide, by nature
of its small size, is also not very diagnostic (one in every 16 dinucleotides can be expected to be the CA nucleotide
pair by random selection). And finally, the spacer between the TATA box and CA dinucleotide is highly variable in
length, even if both sequence elements are present. In examining Pol II promoter sequences, very few can be found
to contain both the TATA box and CA dinucleotide separated by a consistent spacer.
Additionally, besides the TATA and CA elements, eukaryotic Pol II promoter sequences each contain
several binding sites for a variety of transcription factors. The content and arrangement of these
binding sites in a promoter sequence determine when, in what tissues, and under what conditions the
promoter is functional. Each promoter has a unique selection and arrangement of these elements,
giving each promoter a unique program of expression. At the present time, there are more than 2000
individual binding sites that have been described, representing the binding sites of hundreds of
transcription factors. It is the structural complexity of Pol II promoters and the heterogeneity of
transcriptional elements in promoters that make eukaryotic Pol II promoter recognition by computers
difficult to approach.
There have been few previous attempts to build Pol II promoter recognition algorithms.
Most notable are the past attempts of Claverie and Sauvaget, Bucher, and the recent efforts of
Matis et al. and Knudsen. Claverie and Sauvaget developed an algorithm capable of recognizing a
pattern of two or three sequence elements separated by defined length spacers. The algorithm,
while shown to be capable of recognizing promoter element patterns with a fairly high degree of
fidelity, was however, capable of recognizing only specific manually predefined promoter patterns.
Tests were limited to glucocorticoid and heat shock promoters. Another approach, potentially capable
of recognizing a broader range of TATA-containing promoters, was Bucher's efforts to refine and
utilize an improved TATA box weighted matrix and using it to recognize Pol II promoter sequences.
However, while results were significant on a limited sequence test set, when more generally applied
to non-promoter sequences, results indicate that the rate of false positives is much too high when
using this matrix alone to be a useful promoter discriminator (the FP rate is on the order of 1/100
bases in non-promoter sequence; author, data not shown). In addition to these past attempts,
recently there have been two as yet unpublished attempts at Pol II promoter recognition, those of
Matis et al. and Knudsen. The Matis program is associated with the Grail gene recognition program,
while the Knudsen program, Promoter1, can be found as an independent program or in association
with the GeneID gene recognition program. The Matis program uses an artificial intelligence method
to recognize Pol II promoters, but is limited to recognizing only TATA-containing promoters.
The Knudsen program, Promoter1, takes a more generalized approach at promoter recognition,
being trained on vertebrate promoters (including TATA-containing promoters), with correspondingly
better results at overall primate promoter recognition.