Ciona Genebuild
With its small and compact genome of about 117Mb in size, Ciona intestinalis (the sea squirt) has the smallest genome of any experimentally manipulable chordate. C. intestinalis is evolutionarily distant from other well-characterised species and hence suffers from a lack of related sequence data. Due to this we had to deviate from the normal principles used in the Ensembl genome annotation procedure. Normally, cDNA evidence is used mainly for the addition of UTRs to protein-based gene models, and EST-based predictions are not included in the final gene set. In the case of C. intestinalis, the bulk of the available sequence data consists mainly of these two types (cDNA and EST), so we were required to use these for generation of protein-coding gene models.
The current gene build was conducted on the version 1.95 assembly published by the Joint Genome Institutes and represents 11x coverage. Presently the genome is assembled into 2242 scaffolds and has a total length of 117 Mb. Nearly half of the genome is contained in 136 scaffolds of greater than 234kb in length. The gene build was conducted in a similar manner to the standard Ensembl mammalian pipeline, modified to make optimal choices of source proteins for each gene. This initial analysis resulted in 10,988 genes comprised of 21,574 transcripts. Notable features of this automated annotation include:
- Repeats identified using RepeatMasker with a specific C. intestinalis repeat library (dated 13.07.2002).
- Ab initio gene predictions made by Genscan.
- Mappings of Uniprot sequences and over 682.000 ESTs.
- A subset of gene-predictions derived solely from Ciona-specific evidence (cDNA, EST and protein sequence).
The gene prediction procedure applied to the C. intestinalis genome can be distinguished into different stages, where each stage uses different underlying data. The data is combined in the last annotation step.
The first stage consists of aligning Ciona-specific proteins (~650 sequences) against the genome. These alignments were used to build a basic set of gene-models, which are supported by high-quality evidence. We used pmatch to align these proteins, from which the gene structures were produced using genewise. These genes are classified as known genes.
In the next step, we aligned the exceptionally large amount of Ciona-specific cDNA and EST sequences against the genome using exonerate and produced gene structures using genewise. This step differs from a standard Pipeline run, where this data is only normally used to add UTR sequences to known genes.
Later steps entailed the use of protein data from other species for building additional gene-models.
Finally, all gene-predictions were combined from each of the incremental build stages. Genes with the strongest supporting evidence were firstly added to the final gene set, followed by genes built using other evidence. Transcripts which displayed distant, terminal ('satellite') exons were discarded if they overlapped transcripts from other genes.