From PneumoWiki
Jump to navigation Jump to search

PneumoWiki – The Wiki way of Streptococcus pneumoniae annotation[edit source]

PneumoWiki is based on a pan-genome approach, currently including 41 Streptococcus pneumoniae strains. In addition to the pan-genome data, detailed information on the genes and proteins of S. pneumoniae strains TIGR4, D39, D39V [1], Hungary19A-6 and EF3030 is provided. The TIGR4 genome is used as a reference genome in the RegPrecise database [2]. The PneumoWiki gene pages contain information from a variety of sources, including e.g. data on protein function and localization, transcriptional regulation, and gene expression. Orthologous genes of the individual strains are linked by a pan-genome gene identifier and a unified gene name. The main objective of PneumoWiki is to facilitate the transfer of knowledge gained in studies with different S. pneumoniae strains, thus supporting functional annotation and better understanding of this organism.

The scientific community is invited to extend the current information. This can be performed via the Edit icon on the top right of each page or the [edit] links on top of each section or inside some sections. Changes can only be made by registered users (a user account can be created by using the button with the small down arrow next to the Log in link), who will be easily able to add information to the PneumoWiki as described here.

Information preceded by a filled bullet point is fed from a database behind the PneumoWiki and cannot be changed. This is because we propose that especially sequence based information should be consistently maintained in accordance with the RefSeq annotation and other databases. An [edit] link makes it possible to add user generated content or comments. User added information is indicated by open bullets.

The Main Page presents an easy to use Search mask. Here the user can search for locus tags, gene symbols or keywords. The [Getting Started] button brings you to the actual page.

The bar of register tabs on top of the PneumoWiki pages allows switching between the pan-genome pages and the strain-specific pages for a gene. The pages are linked via an orthologue reference table that was constructed within our work and can be downloaded using the [Downloads] button. What kind of information you can find in the PneumoWiki and where this information comes from will be explained in the following chapters:

The S. pneumoniae Pan Genome and the PneumoWiki Pan Genome Pages[edit source]

For a comparative analysis of S. pneumoniae strains, an overall alignment of a publicly available set of 41 strain-specific S. pneumoniae genomes from the NCBI RefSeq database was performed.

On this basis it became possible to construct a positionally corrected and all genes containing pan-genome. This was used to assign homologous genes (at least 50% identity on DNA level) of different S. pneumoniae strains to orthologous gene groups, so called unified pan-genome gene IDs. The corresponding pan-genome gene symbols are species-wide unified gene names firstly based on the genes symbols of the D39V annotation [1] and supplemented by symbols of other strains (in particular TIGR4, D39, Hungary19A-6 and EF3030) from previous and current RefSeq annotations [3].

The pan-genome was determined by André Hennig (Group of Kay Nieselt, Center for Bioinformatics, University of Tübingen) by progressiveMAUVE [4] based total genome alignment followed by an iterative refinement of orthologue groups with special attention to gene synteny by using OrthologyPredator and PanGee (L. Backert, A. Henning, unpublished data).

The so called pan-genome pages, which can be accessed by using the most left register tab on top of the PneumoWiki pages, present the following information:

A unique pan-genome gene ID (pan ID) resulting from the order of all genes within the S. pneumoniae pan-genome, followed by the pan-genome gene symbol and a list of gene descriptions extracted from the RefSeq annotations of all 41 S. pneumoniae genomes. Next, you find the strand and the start and end positions of the pan-genome genes within the positionally standardized pan-genome. How often a gene has orthologues genes in the 41 S. pneumoniae genomes used for the pan-genome construction is shown in the occurrence entry.

In the following Orthologs section, all 41 S. pneumoniae strains are listed in a fixed order and, if the gene is present in the respective strain, the locus tag is shown together with the strain-specific gene name, if assigned, from the RefSeq annotation.

The Genome Viewer shows the respective genomic region of the TIGR4, D39, D39V, Hungary19A-6 and EF3030 strains. The colors of the gene arrows encode for gene function assignments (see table below) performed by using a collection of Hidden Markov models of TIGRFAMs.

Meta Function Gene Functional Class (TIGRFam Main Role) Color Code
Envelope Cell envelope
Cellular processes Cellular processes
Metabolism Amino acid biosynthesis
Biosynthesis of cofactors, prosthetic groups, and carriers
Central intermediary metabolism
Energy metabolism
Fatty acid and phospholipid metabolism
Purines, pyrimidines, nucleosides, and nucleotides
Transport and binding proteins
Genetic Info processing DNA metabolism
Mobile and extrachromosomal element functions
Protein fate
Protein synthesis
Transcription
Regulation Regulatory functions
Signal transduction
Unknown function Hypothetical proteins
Unknown function
RNAs RNA genes

The Strain Specific Pages[edit source]

The Summary section at the top of each page contains the locus tag, the gene name and function of the gene product from the RefSeq annotation as well as the pan locus tag and the pan gene symbol. In the following Genome View (based on a vectorized (SVG) file format), condensed genome information is provided, initially aligned to the position of the respective gene. The genome position in the genome browser can be changed by dragging the slider. By clicking on gene arrows, the user is transferred to the corresponding gene page, thus enabling a page by page walking through the genome. Colors correspond to the gene functional categories as described above. For S. pneumoniae strains, the genome browser combines the well-established RefSeq annotation and the newly introduced RefSeq annotation, thus directly showing the differences in gene content and/or coordinates.

The Gene section contains the information about the gene. It covers basic information as in the Summary section, complemented by the gene coordinates, gene length, essentiality, DNA sequence, and external accession numbers with links to the gene-specific database entries.

The largest section of the page, the Protein section, is devoted to the encoded protein. It shows, amongst others, the protein length, the molecular weight and isoelectric point, protein function assignments (see next paragraph), and subcellular localization. Finally, the Protein section contains database links (NCBI Protein database and UniProt), the protein sequence and experimental data, e.g. on protein localization and interaction partners.

Functional assignments have been generated as follows:

  1. For enzymes the catalytic activity is specified by the EC number (extracted from the NCBI RefSeq database), complemented by the corresponding enzyme name and reaction equation (extracted from ExPASy).
  2. The assignment of protein sequences to TIGRFAMs protein families [5] is based on TIGRFAM Hidden Markov Models (HMM) using hmmscan of the HMMER3 software package [6]. The display of TIGRFAMs is ordered according to their HMM scores as a significance measure of the assignment and possesses a tree like structure including the TIGR role categories (main role and sub role) and an added meta level summarizing the TIGR main roles (see also color table above).
  3. As described for TIGRFAMs, the assignment of sequences to Pfam protein families [7] is based on HMMs and uses the HMMER package. Pfams with the highest HMM scores are shown first. A large part of Pfams is grouped into clans (evolutionary related families), which are displayed on top of the Pfam annotation.
  4. Assignment of predicted protein functions is obtained from the SEED database [8], a comparative genomics database based on expert annotation of subsystems (sets of related functional roles). By default the lists of assigned predicted functions are collapsed and show only the hit with the highest score, but can be expanded by clicking the plus sign.

The following section of the gene page provides information on Expression & Regulation, including the predicted operon structure obtained from MicrobesOnline [9], gene expression regulation as well as gene expression profiles. Data on transcription factor regulons was retrieved from the RegPrecise database [2] and published literature.

All data are provided with links to the external data sources, including various databases and published literature. References are indicated by the book symbol. Details of the corresponding publication are displayed by mouse-over. A list of additional literature is found under "Relevant publications" at the bottom of the page.

References[edit source]

  1. 1.0 1.1 Jelle Slager, Rieza Aprianto, Jan-Willem Veening
    Deep genome annotation of the opportunistic human pathogen Streptococcus pneumoniae D39.
    Nucleic Acids Res: 2018, 46(19);9971-9989
    [PubMed:30107613] [WorldCat.org] [DOI] (I p)
  2. 2.0 2.1 Pavel S Novichkov, Alexey E Kazakov, Dmitry A Ravcheev, Semen A Leyn, Galina Y Kovaleva, Roman A Sutormin, Marat D Kazanov, William Riehl, Adam P Arkin, Inna Dubchak, Dmitry A Rodionov
    RegPrecise 3.0--a resource for genome-scale exploration of transcriptional regulation in bacteria.
    BMC Genomics: 2013, 14;745
    [PubMed:24175918] [WorldCat.org] [DOI] (I e)
  3. Tatiana Tatusova, Michael DiCuccio, Azat Badretdin, Vyacheslav Chetvernin, Eric P Nawrocki, Leonid Zaslavsky, Alexandre Lomsadze, Kim D Pruitt, Mark Borodovsky, James Ostell
    NCBI prokaryotic genome annotation pipeline.
    Nucleic Acids Res: 2016, 44(14);6614-24
    [PubMed:27342282] [WorldCat.org] [DOI] (I p)
  4. Aaron E Darling, Bob Mau, Nicole T Perna
    progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement.
    PLoS One: 2010, 5(6);e11147
    [PubMed:20593022] [WorldCat.org] [DOI] (I e)
  5. Daniel H Haft, Jeremy D Selengut, Roland A Richter, Derek Harkins, Malay K Basu, Erin Beck
    TIGRFAMs and Genome Properties in 2013.
    Nucleic Acids Res: 2013, 41(Database issue);D387-95
    [PubMed:23197656] [WorldCat.org] [DOI] (I p)
  6. Robert D Finn, Jody Clements, Sean R Eddy
    HMMER web server: interactive sequence similarity searching.
    Nucleic Acids Res: 2011, 39(Web Server issue);W29-37
    [PubMed:21593126] [WorldCat.org] [DOI] (I p)
  7. Robert D Finn, Penelope Coggill, Ruth Y Eberhardt, Sean R Eddy, Jaina Mistry, Alex L Mitchell, Simon C Potter, Marco Punta, Matloob Qureshi, Amaia Sangrador-Vegas, Gustavo A Salazar, John Tate, Alex Bateman
    The Pfam protein families database: towards a more sustainable future.
    Nucleic Acids Res: 2016, 44(D1);D279-85
    [PubMed:26673716] [WorldCat.org] [DOI] (I p)
  8. Ross Overbeek, Tadhg Begley, Ralph M Butler, Jomuna V Choudhuri, Han-Yu Chuang, Matthew Cohoon, Valérie de Crécy-Lagard, Naryttza Diaz, Terry Disz, Robert Edwards, Michael Fonstein, Ed D Frank, Svetlana Gerdes, Elizabeth M Glass, Alexander Goesmann, Andrew Hanson, Dirk Iwata-Reuyl, Roy Jensen, Neema Jamshidi, Lutz Krause, Michael Kubal, Niels Larsen, Burkhard Linke, Alice C McHardy, Folker Meyer, Heiko Neuweger, Gary Olsen, Robert Olson, Andrei Osterman, Vasiliy Portnoy, Gordon D Pusch, Dmitry A Rodionov, Christian Rückert, Jason Steiner, Rick Stevens, Ines Thiele, Olga Vassieva, Yuzhen Ye, Olga Zagnitko, Veronika Vonstein
    The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes.
    Nucleic Acids Res: 2005, 33(17);5691-702
    [PubMed:16214803] [WorldCat.org] [DOI] (I e)
  9. Paramvir S Dehal, Marcin P Joachimiak, Morgan N Price, John T Bates, Jason K Baumohl, Dylan Chivian, Greg D Friedland, Katherine H Huang, Keith Keller, Pavel S Novichkov, Inna L Dubchak, Eric J Alm, Adam P Arkin
    MicrobesOnline: an integrated portal for comparative and functional genomics.
    Nucleic Acids Res: 2010, 38(Database issue);D396-400
    [PubMed:19906701] [WorldCat.org] [DOI] (I p)