NAPP database

Documentation

1. NAPP home page

2. RNA-rich Cluster Page

3. Cluster contents page

4. Contig page

5.Profile page

6. File formats

7. References

NAPP (Nucleic Acids Phylogenetic Profiling [1]) is a clustering method that efficiently identifies noncoding RNA (ncRNA) elements in a bacterial genome. The intergenic regions of a reference genome are tiled into overlapping 50-nt segments, and all tiles and coding sequences are classified based on their occurrence profiles in other genomes. Tiles corresponding to actual ncRNAs tend to cluster together and with certain types of protein-coding genes. We term these "RNA-rich clusters". Any non-annotated tile in such clusters can be considered as a strong ncRNA candidate (sRNA, cis-acting RNA or other ncRNAs). Furthermore, certain clusters are enriched for genes in specific functional classes, which permits to draw hypotheses on the function of associated ncRNAs [2].

This web site provides a comprehensive access to NAPP results for over 1000 bacterial and archaeal genomes.

1. NAPP home page

NAPP predictions are accessed for one genome at a time. Genomes are chosen by typing keywords (for instance "coli" or "bacillus") in the text area, or selecting a species name from the drop-down menu. When keywords match several genomes, the genome list is displayed. Selecting a genome leads to the "RNA-rich Clusters" page.

2. RNA-rich Cluster Page

This page presents a tabulated list of RNA rich clusters for the selected species, providing for each cluster the following information:

-Cluster number

-Number of elements in cluster

-Number of annotated genes in cluster

-Number of tiles in cluster

-Number of known RNAs in cluster

-ncRNA-enrichment P-value.

The following actions are available:

Clicking on cluster numbers or P-values leads to the corresponding cluster contents page.

The Make contigs button leads to the Contig page (see below).

The View profiles button leads to the phylogenetic profiles page (see below). NAPP profiles are generated on the fly which may cause a few seconds delay.

The Search for position button shows all elements from RNA-rich cluster that are located in a given sequence interval (enter a position interval and select a chromosome name). Elements are presented in the same format as in the "cluster contents" page)

Export all clusters as CSV: Exports CSV file containing the contents of all RNA-rich clusters, with for each cluster element: Name, Coordinates, Genbank id of chromosome, description of element

Export all clusters as GFF: Exports GFF file containing the contents of all RNA-rich clusters

3. Cluster contents page

This page presents all elements found in the selected cluster. For each element, the following data is provided:

- Gene name or "tile" status

- Positions in the genome

- GenBank Id of chromosome

- Description of element

The header line on top is a reminder of the characteristic of the current cluster: Cluster Id, number of elements, number of tiles, number of genes, P-value of ncRNA enrichment.

Export as CSV: Exports CSV file containing contents of clusters, with for each cluster element: Name, Coordinates,Genbank id of chromosome, description of element

Export as GFF: Exports GFF file containing the contents of clusters

See section 6 for description of the CSV and GFF formats.

4. Contig page

This page presents contigs produced by aggregating all adjacent tiles from ncRNA-rich clusters. A single contig may contain tiles from different clusters (cluster number is indicated in parentheses as C.1, C.2, etc.). A gap of up to 100nt between adjacent tiles is tolerated in a contig. Note that tiles and contigs are not oriented.

Table colums contain the following :

- Contig

- GenBank Id

- Tiles

- Description

Export contigs as CSV: Exports CSV file containing contents of Contigs, with for each contig: Region, Genbank id of chromosome, Tiles, description of elements

Export contigs as GFF: Exports GFF file of contigs

See section 6 for description of the CSV and GFF formats.

5.Profile page

This page presents the phylogenetic profiles of RNA-rich clusters. Lines represent species ordered as the leaves of a 16S rRNA phylogenetic tree. Each column represents the average phylogenetic profile of all members of an RNA-rich cluster. White squares indicate that members of this cluster do not have homologues in this species. Shaded squares indicate that members of this cluster have homologues in this species (darker: higher Blast scores, lighter: lower Blast scores).

Moving cursor across columns shows Gene Ontology term biases in the coding sequences (CDS) and a contents summary for the corresponding clusters.

Clicking on a cluster column leads to the corresponding cluster page

6. File formats

CSV: The comma-separated values (CSV) file format is a file format used to store tabular data where items are separated by comma characters. CSV files can be opened using any spreadsheet program.

GFF3: GFF3 is a flat tab-delimited format used by most genome browsers. The first line of the file is a comment that identifies the file format and version. This is followed by a series of data lines, each one of which corresponds to an annotation:

Column 1: "seqid": The ID of the landmark used to establish the coordinate system for the current feature (here "Genbank id").

Column 2: "source": a free text qualifier intended to describe the algorithm or operating procedure that generated this feature (here "NAPP" or "Genbank").

Column 3: "type": The type of the feature (here type of element "TILE","rRNA", "tRNA", "CDS or "CONTIG").

Columns 4 & 5: "start" and "end": the start and end of the feature relative to the landmark given in column 1 (here start and end of element).

Column 6: "score": The score of the feature (here ".",meanings no score is available).

Column 7: "strand": The strand of the feature. + for positive strand, - for minus strand, and . for features that are not stranded.

Column 8: "phase": For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame (here ".", "0").

Column 9: "attributes": A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. Here attributes include, depending on the element type: Internal NAPP ID, NAPP cluster number, Genbank annotation, RFAM match.

7. References

[1] Ott A, Idali A, Marchais A & Gautheret D. (2011) NAPP: the nucleic acid phylogenetic profile database. Nucl. Acids Res. [Epub Ahead of print]

[2] Marchais A, Naville M, Bohn C, Bouloc P & Gautheret D. (2009) Single-Pass Classification of all Non-Coding Sequences in a Bacterial Genome Using Phylogenetic Profiles. Genome Res. 19:1084-92.

[3] Marchais A, Duperrier S, Durand S, Gautheret D, Stragier P. (2011) CsfG, a sporulation-specific, small non-coding RNA highly conserved in endospore formers. RNA Biol. 2011 May 1;8(3).

For comments or questions, please e-mail to napp.biologie@u-psud.fr