Gene-building on the 2x genomes

The difficulties in trying to identify gene structures in draft genome sequence are acutely apparent when the assembly has been constructed from only 2x genome coverage. Erroneous sequence, mis-orientations and misplacements occur as they do in any genome assembly, but missing sequence and fragmentation are particular problems; many genes will be represented only partially (or not at all) in the assemblies, and many others (particularly those with large genomic extent) will be found in pieces, distributed across more than one scaffold. The standard Ensembl gene-build pipeline, which is based upon the alignment of complete proteins and cDNAs to the genome sequence, is therefore unsuitable for low-coverage genomes.

We have developed a new gene-building methodology for low-coverage genomes that relies on a whole genome alignment (WGA) to an annotated, reference genome. The WGA underlying each annotated gene structure in the reference genome is used to infer arrangements of scaffolds into "gene-scaffolds" in the target genome that contain complete gene structures.

The protein-coding transcripts of the reference gene structures are projected through the WGA onto the implied gene-scaffolds in the target genome. Small insertions/deletions that disrupt the reading-frame of the resultant transcripts are corrected for by inserting "frame-shift" introns into the structure.

When the WGA implies that the sequence containing an internal exon is missing from the assembly, and the location is consistent with an intra- or inter-scaffold gap, the exon is placed on the gap sequence. This results a run of X's of the correct length in the translation.

The Ensembl-annotated human genome is used as the reference, and a whole-genome alignment between the human and target genomes is generated in-house using BLASTz. The resulting set of local alignments are linked into chains using the axtTools written by Jim Kent (W.J. Kent et.al., PNAS 100:1484-9). A custom filter is applied to ensure that each bp in target genome is aligned to no more than one position in the reference genome. It is this processed alignment from which the gene-scaffolds are inferred and through which the reference annotation is projected.