Protein Sequence Databases

Protein sequence databases contain the amino acid translations extracted from nucleotide sequence database records that are annotated with one or more coding DNA sequence (CDS) features and the experimental results reported in published literatures. This section provides an inexhaustive list of protein sequence databases including PIR, PRF, RefSeq, Swiss-Prot, and TrEMBL.

NAME: PIR, Protein Information Resource12

DESC: The PIR is a worldwide protein information resource that is composed of a number of databases and computational tools designed for the identification and analysis of protein sequences. The PIR protein sequence database (PIR-PSD) contains information concerning all naturally occurring, wild-type proteins whose sequence is known. The PIR nonredundant protein sequence database (PIR-NREF) provides comprehensive, nonredundant data uniquely organized by homology and taxonomy. The WWW server provides keyword searching as well as sequence similarity searching against PIR-PSD and PIR-NREF. GROUP: National Biomedical Research Foundation, Washington, DC, U.S. EMAIL: [email protected] WWW: http://www-nbrf.georgetown.edu/

QUERY: http://www-nbrf.georgetown.edu/pirwww/search/textpsd.shtml and

http://www-nbrf.georgetown.edu/pirwww/search/pirnref.shtml FTP: ftp://ftp.pir.georgetown.edu/pir_databases/, ftp://ftp.infobiogen.fr/pub/db/ pir, ftp://ftp.ebi.ac.uk/databases/pir/, and ftp://ftp.ncbi.nih.gov/repository/PIR/ to download the database.

NAME: PRF/SEQDB, Protein Research Foundation/SEQuence DataBase DESC: The PRF protein sequence database contains amino acid sequences of peptides and proteins, and also sequences predicted from genes as well as manual annotations with regard to amino acids, peptides, and proteins. GROUP: Protein Research Foundation, Osaka, Japan EMAIL: [email protected] WWW: http://www.prf.or.jp/

QUERY: http://www.prf.or.jp/en/os.html for Amino Acid Sequence Database search by using short segments (limited to <20 amino acid sequences) as probe.

FTP: ftp://ftp.genome.ad.jp/pub/db/genomenet/ to download database NAME: RefSeq, Reference Sequence Database3

DESC: The RefSeq contains nonredundant sets of sequences, including genomic DNA, transcript RNA, and protein products. The RefSeq NPs is a reference set of protein sequences and the RefSeq XPs is a reference set of Homo sapiens model proteins provided by the human genome annotation process. GROUP: NCBI, National Center for Biotechnology Information, U.S. EMAIL: [email protected]?subject=RefSeq WWW: http://www.ncbi.nih.gov/RefSeq/

QUERY: http://www.ncbi.nih.gov/Entrez/ for the Entrez-based database retrieval and http://www.ncbi.nih.gov/BLAST to search the database using BLAST algorithms. FTP: ftp://ftp.ncbi.nih.gov/refseq/ to download the database

NAME: Swiss-Prot45

DESC: The Swiss-Prot is a curated protein sequence database that strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, posttranslational modifications, variants, etc.), a minimal level of redundancy, and a high level of integration with other databases.

GROUP: Swiss Institute of Bioinformatics (SIB), Geneva, Switzerland

EMAIL: [email protected]

WWW: http://www.expasy.ch/sprot/sprot-top.html

QUERY: The WWW server (http://www.expasy.ch/cgi-bin/sprot-search-ful/, http://www.infobiogen.fr/srs/, and http://www.ebi.ac.uk/swissprot/) is available for keyword searching and sequence similarity searching. The Swiss-Shop (http://www.expasy.org/swiss-shop/) is available for an automatic database retrieval service against the noncumulative weekly additions of new protein sequences to the Swiss-Prot.

FTP: ftp://ftp.expasy.ch/databases/swiss-prot/, ftp://ftp.infobiogen.fr/pub/db/ swissprot/, and ftp://ftp.ebi.ac.uk/pub/databases/swissprot/ to download the database.

NAME: TrEMBL, Translation from EMBL6

DESC: The TrEMBL is a protein sequence database from the EMBL nucleotide sequence translations. TrEMBL is split in two main sections: SP-TrEMBL and REM-TrEMBL. SP-TrEMBL (Swiss-Prot TrEMBL) contains the entries that should be incorporated into Swiss-Prot. REM-TrEMBL (REMaining TrEMBL) contains the entries that EMBL does not want to include in Swiss-Prot for a variety of reasons.

GROUP: EMBL Outstation - European Bioinformatics Institute (EBI), U.K.

EMAIL: [email protected]

WWW: http://www.ebi.ac.uk/trembl/

QUERY: http://www.ebi.ac.uk/trembl/access.html and http://srs.embl-heidelberg. de:80000/

FTP: ftp://ftp.ebi.ac.uk/pub/databases/trembl/ 4.2.2.2 Protein Family Databases

The rapid expansion of the nucleotide sequence databases has caused a massive influx of data into the protein sequence databases and this has led to the same influx of data into the protein family databases. This section provides an inexhaustive list of protein domain, family, motif, and fingerprint databases, which were delineated by the assessment of computational results derived from automatic classification of protein sequences using sequence similarity/homology programs.

NAME: InterPro7

DESC: The InterPro is a database of protein families, domains, and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences. The InterPro contains high-quality annotations and cross references to other protein family databases including Pfam, PRINTS, ProDom, SMART, TIGRFAMs, and PROSITE as well as Swiss-Prot and TrEMBL. The InterPro database has almost 3,000 families classified by expert curators.

GROUP: EMBL Outstation - European Bioinformatics Institute (EBI), U.K.

EMAIL: [email protected]

WWW: http://www.ebi.ac.uk/interpro/ for a service for biological sequence analysis

QUERY: http://www.ebi.ac.uk/interproscan/ for InterProScan sequence similarity search against InterPro.8 FTP: ftp://ftp.ebi.ac.uk/pub/databases/interpro/

NAME: iProClass910

DESC: The iProClass database is a nonredundant protein database organized according to family relationships as defined collectively by PROSITE patterns and PIR superfamilies. PROSITE patterns are defined as sequences (from Swiss-Prot) with the common function (http://pir.georgetown. edu/pirwww/search/pattern_help.html). PIR superfamilies are defined as sequences (from PIR protein sequence database) with the same function in various organisms (http://pir.georgetown.edu/iproclass/description.html). GROUP: Georgetown University Hospital, Washington, DC, U.S. EMAIL: [email protected] WWW: http://pir.georgetown.edu/iproclass QUERY: http://pir.georgetown.edu/pirwww/search/searchseq.html FTP: ftp://ftp.pir.georgetown.edu/pir_databases/iproclass/

NAME: Pfam, Protein Families11

DESC: The Pfam consists of two parts: Pfam-A and Pfam-B. The Pfam-A is a comprehensive collection of annotated protein domain families, including multiple sequence alignments and Hidden Markov Models (HMMs) covering many common protein domains. The Pfam-B is a supplement to the Pfam-A and contains a large number of small families automatically clustered from the ProDom database. The Pfam 8.0, which came out in February 2003, contains over 5,193 protein families. GROUP: The Sanger Centre, Hinxton, U.K. EMAIL: [email protected] WWW: http://www.sanger.ac.uk/Pfam/ QUERY: http://www.sanger.ac.uk/software/Pfam/ FTP: ftp://ftp.sanger.ac.uk/pub/databases/Pfam/

NAME: PIR-ALN12

DESC: The PIR-ALN is a database of protein sequence alignments. Alignments are of sequences in the same family (less than 55% different from each other), of sequences representing various families within a superfamily, or of sequence segments corresponding to the same homology domain in different proteins. GROUP: National Biomedical Research Foundation, Washington, DC, U.S. EMAIL: [email protected] WWW: http://www-nbrf.georgetown.edu/pir/alndb.html QUERY: http://www-nbrf.georgetown.edu/pirwww/search/searchseq.html FTP:ftp://ftp.pir.georgetown.edu/pir_databases/other_databases/piraln/

NAME: PRINTS13

DESC: The PRINTS is a protein motif fingerprint database. Each protein family is represented by a fingerprint, which is a series of ungapped multiple alignments corresponding to the conserved motifs. The PRINTS obtains protein sequences from Swiss-Prot and TrEMBL databases. GROUP: UMBER, University of Manchester Bioinformatics Education and

Research, U.K. EMAIL: [email protected] WWW: http://umber.sbs.man.ac.uk/dbbrowser/PRINTS/ QUERY: http://umber.sbs.man.ac.uk/dbbrowser/fingerPRINTScan/ for the PRINTS similarity search to find the closet matching PRINTS fingerprints by a user-specified protein sequence.14 FTP: ftp://ftp.bioinf.man.ac.uk/pub/prints/, ftp://ftp.ebi.ac.uk/pub/databases/ prints/, ftp://ftp.embl-heidelberg.de/pub/databases/, and ftp://ftp.ncbi.nih. gov/repository/PRINTS

NAME: ProDom15

DESC: The ProDom is a database of homologous domain families automatically generated from Swiss-Prot and TrEMBL. The database provides users capabilities for graphical display that link related families through their shared sequences and for pairwise comparison with every sequence in each family. The database has 365,172 entries, as of February 19, 2003. GROUP: INRA/CNRS, Laboratoire de Biologie Moleculaire, France EMAIL: [email protected] WWW: http://protein.toulouse.inra.fr/prodom.html

QUERY: http://protein.toulouse.inra.fr/prodom/current/html/form.php and http://prodes.toulouse.inra.fr/srs6/ for navigation between ProDom, Swiss-Prot, TrEMBL, PROSITE, PFAMA, InterPro, and PDB. FTP: ftp://ftp.infobiogen.fr/pub/db/prodom/ and ftp://ftp.ebi.ac.uk/pub/databases/ prodom/

NAME: PROSITE16

DESC: The PROSITE is a database of protein families and domains and obtains protein sequences from Swiss-Prot. The PROSITE database contains biologically significant sites, patterns, and profiles that help to reliably identify to which known family of protein (if any) a new sequence belongs and to look for small motifs found in nonhomologous contexts. The PROSITE 17.46, which came out on May 11, 2003, contains 1,187 documented entries that describe 1,625 different patterns, rules, and profiles/matrices. The WWW server provides keyword searching as well as pattern match searching for classification of protein sequences. GROUP: Swiss Institute of Bioinformatics (SIB) Geneva, Switzerland EMAIL: [email protected] WWW: http://www.expasy.org/prosite/

QUERY: The ScanProsite (http://www.expasy.org/tools/scanprosite/) allows users to scan a sequence against PROSITE or a pattern against Swiss-Prot or PDB and visualize matches on structures.17

FTP: ftp://ftp.expasy.ch/databases/prosite/, ftp://ftp.infobiogen.fr/pub/db/ prosite/, and ftp://ftp.ebi.ac.uk/pub/databases/prosite/

NAME: SMART, Simple Modular Architecture Research Tool18

DESC: The SMART allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. The SMART v3.5, which came out on April 28, 2003, contains 654 HMMs found in signaling, extracellular, and chromatin-associated proteins that are detectable. The focus of SMART is to search for evolutionarily conserved protein domains rather than small sites of posttranslational modification. The WWW server provides keyword searching as well as HMM searching for classification of protein sequences.

GROUP: European Molecular Biology Laboratory (EMBL), Heidelberg, Germany

EMAIL: [email protected]

WWW: http://smart.embl-heidelberg.de/

QUERY: http://smart.embl-heidelberg.de/index2.cgi for the SMART advanced search and http://dylan.embl-heidelberg.de/alert/ for the SMART alert service to be automatically informed each time a new protein with a defined domain composition is deposited in databases.

NAME: TIGRFAMs19

DESC: The TIGRFAMs is a database of protein families based on Hidden Markov Models.

GROUP: The Institute for Genomic Research (TIGR), Rockville, MD, U.S.

EMAIL: [email protected]

WWW: http://www.tigr.org/TIGRFAMs/index.shtml

QUERY: http://www.tigr.org/tigr-scripts/CMR2/find_hmm.spl?db=CMR for TIGRFAMs text search and http://tigrblast.tigr.org/web-hmm/ for TIGRFAMs sequence similarity search.

FTP: ftp://ftp.tigr.org/pub/data/TIGRFAMs/

Was this article helpful?

+1 0

Post a comment