Variation API Tutorial
Introduction
This tutorial is an introduction to the Ensembl Variation API. Knowledge of the Ensembl Core API and of the concepts and conventions in the Ensembl Core API tutorial is assumed. Documentation about the Variation database schema is available at http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-variation/schema/?root=ensembl , and while not necessary for this tutorial, an understanding of the database tables may help as many of the adaptor modules are table-specific.
Code Conventions (and unconventions)
Refer to the Ensembl core tutorial for a good description of the coding conventions normally used in Ensembl. Please note that there may be exceptions to these rules in variation.
Connecting an Ensembl variation database
Connecting to an Ensembl variation database is made simple by using the Bio::EnsEMBL::Registry module:
use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' );
The use of the registry ensures you will load the correct versions of the Ensembl databases for the software release it can find on a database instance. Using the registry object, you can then create any of number of database adaptors. Each of these adaptors is responsible for generating an object of one type. The Ensembl variation API uses a number of object types that relate to the data stored in the database. For example, in order to generate variation objects, you should first create a variation adaptor:
use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $variation_adaptor = $registry->get_adaptor( 'human', # species 'variation', # database 'variation' # object type ); my $variation = $variation_adaptor->fetch_by_name('rs1333049');
The get_adaptor method will automatically create a connection to the relevant database; in the example above, a connection will be made to the variation database for human. The three parameters passed specify the species, database and object type you require. Below is a non exhaustive list of Ensembl variation adaptors that are most often used
- IndividualAdaptor to fetch Bio::EnsEMBL::Variation::Individual objects
- LDFeatureContainerAdaptor to fetch Bio::EnsEMBL::Variation::LDFeatureContainer objects
- PopulationAdaptor to fetch Bio::EnsEMBL::Variation::Population objects
- ReadCoverageAdaptor to fetch Bio::EnsEMBL::Variation::ReadCoverage objects
- TranscriptVariationAdaptor to fetch Bio::EnsEMBL::Variation::TranscriptVariation objects
- VariationAdaptor to fetch Bio::EnsEMBL::Variation::Variation objects
- VariationFeatureAdaptor to fetch Bio::EnsEMBL::Variation::VariationFeature objects
Only some of these adaptors will be used for illustration as part of this tutorial through commented perl scripts code.
Variations in the genome
One of the most important uses for the variation database is to be able to get all variations in a certain region in the genome. Below it is a simple commented perl script to illustrate how to get all variations in chromosome 25 in zebrafish.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $slice_adaptor = $registry->get_adaptor('danio_rerio', 'core', 'slice'); #get the database adaptor for Slice objects my $slice = $slice_adaptor->fetch_by_region('chromosome',25); #get chromosome 25 in zebrafish my $vf_adaptor = $registry->get_adaptor('danio_rerio', 'variation', 'variationfeature'); #get adaptor to VariationFeature object my $vfs = $vf_adaptor->fetch_all_by_Slice($slice); #return ALL variations defined in $slice foreach my $vf (@{$vfs}){ print "Variation: ", $vf->variation_name, " with alleles ", $vf->allele_string, " in chromosome ", $slice->seq_region_name, " and position ", $vf->start,"-",$vf->end,"\n"; }
Consequence type of variations
Another common use of the variation database is to retrieve the effects that variations have on a transcript. In the example below, it is explained how to get all variations in a particular chicken transcript and see what is the effect of that variation in the transcript.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $stable_id = 'ENSGALT00000007843'; #this is the stable_id of a chicken transcript my $transcript_adaptor = $registry->get_adaptor('gallus_gallus', 'core', 'transcript'); #get the adaptor to get the Transcript from the database my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id); #get the Transcript object my $trv_adaptor = $registry->get_adaptor('gallus_gallus', 'variation', 'transcriptvariation'); #get the adaptor to get TranscriptVariation objects my $trvs = $trv_adaptor->fetch_all_by_Transcripts([$transcript]); #get ALL effects of Variations in the Transcript foreach my $tv (@{$trvs}){ print "SNP :",$tv->variation_feature->variation_name, " has a consequence/s ", join(",",@{$tv->consequence_type}), " in transcript ", $stable_id, "\n"; #print the name of the variation and the effect (consequence_type) of the variation in the Transcript }
Variations, Flanking sequences and Genes
Below is a complete example on how to use the variation API to retrieve different data from the database. In this particular example, we want to get, for a list of variation names, information about alleles, flanking sequences, locations, effects of variations in transcripts, position in the transcript (in case it has a coding effect) and genes containing the transcripts.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $va_adaptor = $registry->get_adaptor('human', 'variation', 'variation'); #get the different adaptors for the different objects needed my $vf_adaptor = $registry->get_adaptor('human', 'variation', 'variationfeature'); my $gene_adaptor = $registry->get_adaptor('human', 'core', 'gene'); my @rsIds = qw(rs1367827 rs1367830); foreach my $id (@rsIds){ # get Variation object my $var = $va_adaptor->fetch_by_name($id); #get the Variation from the database using the name &get_VariationFeatures($var); } sub get_VariationFeatures{ my $var = shift; # get all VariationFeature objects: might be more than 1 !!! foreach my $vf (@{$vf_adaptor->fetch_all_by_Variation($var)}){ print $vf->variation_name(),","; # print rsID print $vf->allele_string(),","; # print alleles print join(",",@{$vf->get_consequence_type()}),","; # print consequenceType print substr($var->five_prime_flanking_seq,-10) , "[",$vf->allele_string,"]"; #print the allele string print substr($var->three_prime_flanking_seq,0,10), ","; # print RefSeq print $vf->seq_region_name, ":", $vf->start,"-",$vf->end; # print position in Ref in format Chr:start-end &get_TranscriptVariations($vf); # get Transcript information } } sub get_TranscriptVariations{ my $vf = shift; # get all TranscriptVariation objects: might be more than 1 !!! my $transcript_variations = $vf->get_all_TranscriptVariations; #get ALL the effects of the variation in # different Transcripts if (defined $transcript_variations){ foreach my $tv (@{$transcript_variations}){ print ",", $tv->pep_allele_string if (defined $tv->pep_allele_string); # the AA change, but only if it is in a coding region my $gene = $gene_adaptor->fetch_by_transcript_id($tv->transcript->dbID); print ",",$gene->stable_id if (defined $gene->external_name); # and the external gene name } } print "\n"; }
LD calculation
In order to be able to use the LD calculation, you need to compile the C source code and install a module, called IPC::Run. There is more information on how to do this in Use LD calculation In the example below, it calculates the LD in a region in human chromosome 6 for a HAPMAP population, but only prints when there is a high LD
use strict; use warnings; use Bio::EnsEMBL::Registry; my $registry = 'Bio::EnsEMBL::Registry'; $registry->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous' ); my $chr = 6; #defining the region in chromosome 6 my $start = 25_834_000; my $end = 25_854_000; my $population_name = 'CSHL-HAPMAP:HapMap-CEU'; #we only want LD in this population my $slice_adaptor = $registry->get_adaptor('human', 'core', 'slice'); #get adaptor for Slice object my $slice = $slice_adaptor->fetch_by_region('chromosome',$chr,$start,$end); #get slice of the region my $population_adaptor = $registry->get_adaptor('human', 'variation', 'population'); #get adaptor for Population object my $population = $population_adaptor->fetch_by_name($population_name); #get population object from database my $ldFeatureContainerAdaptor = $registry->get_adaptor('human', 'variation', 'ldfeaturecontainer'); #get adaptor for LDFeatureContainer object my $ldFeatureContainer = $ldFeatureContainerAdaptor->fetch_by_Slice($slice,$population); #retrieve all LD values in the region foreach my $r_square (@{$ldFeatureContainer->get_all_r_square_values}){ if ($r_square->{r2} > 0.8){ #only print high LD, where high is defined as r2 > 0.8 print "High LD between variations ", $r_square->{variation1}->variation_name,"-", $r_square->{variation2}->variation_name, "\n"; } }
Specific strain information
With the apparition of the new technologies, one of the new functionalities that the variation API has is the possibility to work with your specific strain as if it was the reference one, and compare it against others. In the example, we create a StrainSlice object for a region in Craig Venter's sequence and compare it against the reference sequence.
use strict; use warnings; use Bio::EnsEMBL::Registry; my $reg = 'Bio::EnsEMBL::Registry'; my $host= 'ensembldb.ensembl.org'; my $user= 'anonymous'; $reg->load_registry_from_db( -host => $host, -user => $user ); # get exon adaptor from core my $sa = $reg->get_adaptor("human", "core", "slice"); my $slice = $sa->fetch_by_region('chromosome', 8, 9213000, 9216000); # get strainSlice from the slice my $venter = $slice->get_by_strain("Venter"); my @differences = @{$venter->get_all_AlleleFeatures_Slice()}; foreach my $diff (@differences){ print "Locus ", $diff->seq_region_start, "-", $diff->seq_region_end, ", Venter's alleles: ",$diff->allele_string, "\n"; }
Further help
For additional information or help mail the ensembl-dev mailing list. You will need to subscribe to this mailing list to use it. More information on subscribing to any Ensembl mailing list is available from the Ensembl Contacts page.