Humboldt-Universität zu Berlin - Mathematisch-Naturwissenschaftliche Fakultät - Wissensmanagement in der Bioinformatik

Poster for ECCB 2004 - COLUMBA - A database of annotated protein structures

COLUMBA - A database of annotated protein structures


Silke Trissl1, Kristian Rother2, Ulf Leser
1trissl@informatik.hu-berlin.de, Institute of Informatics, Humboldt-Universität zu Berlin, Germany;
2kristian.rother@charite.de, Institute of Biochemistry, Charitè, Berlin, Germany

Short abstract:

We present COLUMBA, a database of information on protein structures that integrates data from twelve different biological databases, including ENZYME, KEGG, SCOP, CATH, DSSP, and SwissProt. COLUMBA allows for the quick computation of sets of protein structures that share interesting properties according to the different data sources.


Long abstract:

The number of protein structures deposited in the Protein Data Bank, PDB (Berman et al. 2000) is increasing rapidly. This allows researchers in life science to study complex relationships between macromolecular structures and their properties, such as biological function, folding classification, or secondary structure. To undertake those studies, not only the three dimensional (3D) structures have to be known, but also the folding classification and several other properties of a protein. Gathering such information from web resources by following hyperlinks is a tedious and time-consuming task.
We have created COLUMBA (Rother et al. 2004), a database of information on protein structures, that physically integrates information from twelve different data sources into a single relational data warehouse. We enrich the protein structures from the PDB with

  • structure classifications from SCOP and CATH
  • computed secondary structures from the DSSP program, 
  • functional annotation from ENZYME and GO
  • participation in metabolic pathways from KEGG and the Boehringer map, 
  • taxonomic information from the NCBI Taxonomy
  • and further information from SwissProt
  • In addition to that, each chain is assigned to a cluster of similar sequences by PISCES and SYSTERS.

Web interface

We have created a user friendly web interface, which is available at http://www.columba-db.de. The web interface allows a full text search as well as data source specific queries. The web interface uses a "query refinement" paradigm to return a set of PDB entries, which fulfill the conditions stated. A query is defined by entering restriction conditions in the form for the data source specific annotation. The user can combine queries from different data sources, which act as filters, to obtain the desired subset of PDB entries. The interface supports interactive and exploratory usage by straightforward adding, deleting, restricting, or easing of conditions. The user is supported by a header, called "filter chain", where the number of PDB entries after each filter step is stated.
The result set, gives basic information on each entry returned. The user can see the full scope of COLUMBA for a single entry where all the annotated information for a single entry is shown.

Applications

Through the web interface it is fairly simple to answer the following two questions:

  • Which structures contain chains with a TIM-barrel fold and have a resolution better than 2.0 Å.
  • Which proteins in the citric acid cycle do have a resolved structure?

The first query is answered by first entering the phrase 'TIM barrel' in one of the Protein Fold forms - as fold in SCOP and as keyword in CATH, respectively, then enter for the condition resolution '2.0' in the PDB Structure form. This will result in the desired set of currently 370 entries for CATH and 381 for SCOP, respectively.
The second query can be answered by using the Metabolism form of COLUMBA. The option 'path coverage' not only shows the enzymes participating in the selected pathway, but also the number of structures known for each enzyme.

References

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Research.28: 235 - 242.
Rother, K., Müller, H., Trissl, S., Koch, I., Steinke, T., Preissner, R., Frömmel, C., Leser, U. 2004. COLUMBA: Multidimensional Data Integration of Protein Annotations. E. Rahm(Ed.): DILS 2004, LNBI 2994, 156 - 171.