Archive Ensembl HomeArchive Ensembl Home
Home > Help & Documentation

Gene Families in Compara

Introduction

Ensembl families are determined through clustering of all Ensembl proteins along with metazoan sequences from UniProtKB. It therefore provides a way of exploring orthologues and closely related homologues across a range of animal species.

Ensembl Protein Family Report

Family ID
The family ID reflects the Ensembl version. For example, fam50v... reflects Ensembl version 50. The ID is not stable, i.e. it can change upon a new Ensembl release. However, recording the family ID in itself should be sufficient to find it in the archive sites.
Consensus Annotation
For each cluster obtained, a consensus annotation is automatically generated from the UniProt/Swiss-Prot and UniProt/TrEMBL description lines of all UniProtKB members using the following approach:
If the description covers less than 40% of UniProt members in the cluster, the family description is assigned 'AMBIGUOUS'. If the annotation confidence score, described below, is zero, 'UNKNOWN' is assigned. Be aware that 'UNCHARACTERIZED' is a UniProt description for a protein, and does not reflect the score.
The annotation confidence score is the percentage of UniProtKB family members with this description, or part of it. Note that only family members with 'informative' UniProt descriptions are taken into account.
Prediction Method
A brief summary of the protein family clustering algorithm is given.
Multiple Alignments
Ensembl provides pre-calculated multiple sequence alignments of all members for each cluster.
If the Java runtime environment, JRE, is properly installed on your computer, then buttons will produce a new window with multiple alignments of family members displayed in JalView. The first option includes Ensembl protein prediction from the current, as well as all other species supported by Ensembl. The second option also includes UniProtKB members. You can also export a text file with the alignments of all the family members - a wide range of formats is available from the control panel.
Alternatively, export alignments using the Compara Perl API.

Clustering

The protein family database is generated by running the Markov Clustering (MCL) algorithm [1, 2, 3, 4] as initially proposed by A.J. Enright, S. van Dongen and C.A. Ouzounis [5]. Prior to the clustering process, an all-against-all BLASTP sequence similarity search is run on the super-set of all Ensembl protein predictions of all species, together with all metazoan sequences from UniProt/Swiss-Prot and UniProt/TrEMBL, to establish similarities. Using these similarities, protein family clusters are established running the MCL algorithm.

References

1. Stijn van Dongen
Graph Clustering by Flow Simulation.
PhD thesis, University of Utrecht, May 2000
Full text

2. Stijn van Dongen
A cluster algorithm for graphs.
Technical Report INS-R0010, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000
PostScript file

3. Stijn van Dongen
A stochastic uncoupling process for graphs.
Technical Report INS-R0011, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000
PostScript file

4. Stijn van Dongen
Performance criteria for graph clustering and Markov cluster experiments.
Technical Report INS-R0012, National Research Institute for Mathematics and Computer Science in the Netherlands, Amsterdam, May 2000
PostScript file

5. Anton J. Enright, Stijn van Dongen and Christos A. Ouzounis
An efficient algorithm for large-scale detection of protein families.
Nucleic Acids Res. 2002 Apr 1;30,7,:1575-1584.
Abstract Full text