The Ensembl Release Cycle
Ensembl data is released on an approximately two-month cycle (occasionally longer if a lot of development work is being undertaken). Whatever its length, the cycle works as follows:
- Genebuild
This stage varies in length for each species, as it takes longer if the genome is more complex complexity of the genome(s) involved. Individual species are updated on an irregular schedule, depending on the availability of new assemblies and evidence. New species are added frequently from a number of sequencing projects around the world, and all species databases may receive minor updates. These can include patches to correct erroneous data and updates to data that changes regularly (such as cDNAs for human and mouse).
The genebuild team members take evidence for genes and transcripts, such as protein and mRNAs, and combine it with manual annotation data in the analysis pipeline, to create an Ensembl core database and optionally otherfeatures and cdna databases. Once these are complete, they are handed over to the other Ensembl data teams for further processing (see below).
- Additional core data
The role of the core team is two-fold: to provide API support for the core and core-like (otherfeatures and cdna) databases, and to run scripts that add supplementary data to the database (e.g. gene counts) and check that the database contents are as complete and accurate as possible. These latter scripts, known as healthchecks, help to pick out any anomalous data produced by the automated pipeline, such as unusually long genes. - Other databases
- Compara
The comparative genomics team runs a second pipeline which brings together the separate species databases, aligns sequences to identify syntenous regions and predicts orthologues, paralogues, and protein family clusters. The resultant data is compiled into a single large database (although this is now becoming so big that there are plans to separate the content into multiple databases). - Variation
The variation team brings together data from a variety of sources, including dbSNP, and also call new variations from resequencing data. These are then used to create variation databases for the relevant species. Currently there are around a dozen species with variation data, including human, chimp, mouse, rat, dog and zebrafish. - Functional Genomics
The functional genomics team collects experimental data from their collaborators and incorporates this into the regulatory build. This includes regulatory features determined by chromatin immuno-precipitation and epigenomic modifications, and other data with a regulatory focus such as CisRED. Currently only human, mouse and fruitfly have functional genomics databases.
- Compara
- Mart
The mart team build their own normalised database tables from the Ensembl data, so that it can be accessed through the BioMart data-mining tool. - Web
Whilst the genomic data is being prepared, the web team works on new displays and new website features. They then bring together all the finished databases and make the content available online in a number of ways:
- The website configuration is updated to access the new data
- The databases are copied to the public MySQL servers and also dumped in a variety of formats for downloading from our FTP site
- The database dumps are also used to create search indexes for the BLAST service
The web team also populates an additional database, ensembl_website, which contains help, news, and other web-specific information. If there are new displays, or if existing ones have changed substantially, the outreach team update the help content.
- Release
When the new release is ready to go live, a copy of the current version is set up as an archive, and the webserver is updated to point to the new site.
This is necessarily a simplified account of a process that takes around 50 people several months to complete!