Programmes de recherche de gènes

2 Mots importants

Analyse des signaux

Analyse du contenu

Les algorithmes

Document

Predicting Pol II Promoter Sequences Using Transcription Factor Binding Sites (extrait)
Dan S. Prestridge

The development of computer algorithms to recognize Pol II promoter sequences in primary sequence data, however, is an extremely complex and difficult problem. While there have been many successful attempts to build algorithms to recognize prokaryotic promoter sequences using a variety of approaches, there have been few attempts at eukaryotic promoter recognition.
The difficulty in developing algorithms to recognize eukaryotic Pol II promoters, in comparison to algorithms designed to recognize prokaryotic promoters, lies in the greater complexity of eukaryotic promoter structure. Prokaryotic promoter sequences contain three sequence elements that can be used as a basis of recognition: the -10 and -35 elements (relative to the transcription start site) and a spacer that separates these two sequence elements which is of a fairly consistent length. Eukaryotic promoters have similar elements that are common to many Pol II promoter sequences; however, these elements are much less consistent and less discriminating. Many Pol II promoters contain a TATA box between -25 and -40; however the TATA box consensus is not a strong one (Bucher, 1990) and the element is missing altogether in many promoters. A second element often found in Pol II promoters is a CA dinucleotide straddling, or very near, the transcription start site (TSS). However, as in the case of the TATA box, many promoter sequences are missing this element, or it is displaced from the TSS. The CA dinucleotide, by nature of its small size, is also not very diagnostic (one in every 16 dinucleotides can be expected to be the CA nucleotide pair by random selection). And finally, the spacer between the TATA box and CA dinucleotide is highly variable in length, even if both sequence elements are present. In examining Pol II promoter sequences, very few can be found to contain both the TATA box and CA dinucleotide separated by a consistent spacer.
Additionally, besides the TATA and CA elements, eukaryotic Pol II promoter sequences each contain several binding sites for a variety of transcription factors. The content and arrangement of these binding sites in a promoter sequence determine when, in what tissues, and under what conditions the promoter is functional. Each promoter has a unique selection and arrangement of these elements, giving each promoter a unique program of expression. At the present time, there are more than 2000 individual binding sites that have been described, representing the binding sites of hundreds of transcription factors. It is the structural complexity of Pol II promoters and the heterogeneity of transcriptional elements in promoters that make eukaryotic Pol II promoter recognition by computers difficult to approach.
There have been few previous attempts to build Pol II promoter recognition algorithms. Most notable are the past attempts of Claverie and Sauvaget, Bucher, and the recent efforts of Matis et al. and Knudsen. Claverie and Sauvaget developed an algorithm capable of recognizing a pattern of two or three sequence elements separated by defined length spacers. The algorithm, while shown to be capable of recognizing promoter element patterns with a fairly high degree of fidelity, was however, capable of recognizing only specific manually predefined promoter patterns. Tests were limited to glucocorticoid and heat shock promoters. Another approach, potentially capable of recognizing a broader range of TATA-containing promoters, was Bucher's efforts to refine and utilize an improved TATA box weighted matrix and using it to recognize Pol II promoter sequences. However, while results were significant on a limited sequence test set, when more generally applied to non-promoter sequences, results indicate that the rate of false positives is much too high when using this matrix alone to be a useful promoter discriminator (the FP rate is on the order of 1/100 bases in non-promoter sequence; author, data not shown). In addition to these past attempts, recently there have been two as yet unpublished attempts at Pol II promoter recognition, those of Matis et al. and Knudsen. The Matis program is associated with the Grail gene recognition program, while the Knudsen program, Promoter1, can be found as an independent program or in association with the GeneID gene recognition program. The Matis program uses an artificial intelligence method to recognize Pol II promoters, but is limited to recognizing only TATA-containing promoters. The Knudsen program, Promoter1, takes a more generalized approach at promoter recognition, being trained on vertebrate promoters (including TATA-containing promoters), with correspondingly better results at overall primate promoter recognition.