Bio::EnsEMBL::Pipeline::Alignment
EvidenceAlignment
Toolbar
Summary
Bio::EnsEMBL::Pipeline::Alignment::EvidenceAlignment
Package variables
No package variables defined.
Included modules
Inherit
Synopsis
This module allows a transcript to be displayed with its
attached supporting evidence. It re-creates an alignment
from the database which is returned as an array of
multiply-aligned sequences. These can be printed as text
and used to display the alignment with an alignment
viewer/editor.
Quick start - use the following code if you want the
alignment of an ensembl transcript with the evidence
used to predict it:
my $evidence_alignment =
Bio::EnsEMBL::Pipeline::Alignment::EvidenceAlignment->new(
-transcript => $transcript,
-dbadaptor => $db,
-seqfetcher => $seqfetcher,
-padding => 10);
my $alignment =
$evidence_alignment->retrieve_alignment(
'-type' => 'all',
# Type can be 'all', 'nucleotide' or 'protein'
'-remove_introns' => 1);
my $align_seqs = $alignment->fetch_AlignmentSeqs;
foreach my $align_seq (@$align_seqs){
print $align_seq->seq . "\n";
}
The '-padding' option specifies the amount of tailing
sequence upstream and downstream of the transcript.
Other alignment presentation options exist. The intronic
regions of the transcript can be left in the alignment, but
in some cases can make it _very_ big:
my $alignment =
$align_tool->retrieve_alignment(
'-type' => 'all',
'-remove_introns' => 1);
If you are displaying protein sequences in your alignment, you
can use either single letter or three letter amino acid codes.
By default the alignment is generated with single letter codes,
which is fractionally faster than three letter codes. To
switch on the three letter codes specify the '-three_letter_aa'
argument when retrieving an alignment, like so:
my $alignment =
$evidence_alignment->retrieve_alignment(
'-type' => 'all',
'-three_letter_aa' => 1,);
As it might be mind-bogglingly useful to some folks, it is possible
to display a transcript with a set of external supporting features
that can come from any source. Where external features are used,
the supporting features attached to the transcript are ignored and
the external set used instead. This feature is only going to work if
the external features are mapped to the same assembly as the transcript.
It also usually helps if the external supporting features actually overlap
the transcript sequence :) External features are attached to the evidence
alignment object at the time of object creation:
my $evidence_alignment =
Bio::EnsEMBL::Pipeline::Alignment::EvidenceAlignment->new(
'-dbadaptor' => $db,
'-seqfetcher' => $seqfetcher,
'-transcript' => $transcript,
'-supporting_features' => \@supporting_features);
Description
Object for reconstructing alignments of predicted transcripts
and their associated supporting evidence. Produces an alignment
that is an array of Bio::Seq objects that can be printed in
fasta format (for example) and displayed using an alignment
viewer, such as SeaView or JalView.
Methods
Methods description
Arg [1] : String. Alignment type (one of 'all', 'nucleotide' or 'protein'). Arg [2] : int (1 or 0). Set to 1 if introns are to be truncated. Example : my $status = $self->_align('all', 1); Description: Top level internal method that builds alignment from the database. While other sudsidary methods perform tasks such as building sequences from cigar strings, this method does the difficult reconciliation of multiple pairwise alignments into a single multiple alignment. Returntype : int Exceptions : none. Caller : retrieve_alignment |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Takes a single align feature and constructs an aligned sequence of the correct portion of the hit sequence including all of the right gaps. Creates a reference array that allows the correct insertion of gaps in the genomic sequence. Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Uses all supporting features to generate a list of unique evidence sequences. Returntype : Exceptions : Caller : |
Arg [1] : String. Alignment type ('all', 'nucleotide' or 'protein'). Example : $self->_create_Alignment_object('all') Description: Builds the final Bio::EnsEMBL::Pipeline::Alignment object and and is responsible for adding each aligned sequence. Returntype : Bio::EnsEMBL::Pipeline::Alignment Exceptions : none Caller : retrieve_alignment |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : String. Alignment type ('all', 'nucleotide' or 'protein'). Arg [2] : int (1 or 0). Set to 1 if alignment has been previously computed. Example : $self->_is_computed('nucleotide', 1) Description: Stores information about the type of alignment that may have been previously computed. Returntype : int Exceptions : Warns if alignment previous computed, but of another type. Throws if alignment type is unknown. Caller : retrieve_alignment, _align |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: A debugging subroutine for printing the coordinate locations of our gene, exons, feature and evidence. Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Returntype : Exceptions : Caller : |
Arg [1] : Example : Description: Toggle indicating whether translations are available. Returntype : Exceptions : Caller : |
Args : none Example : $self->_truncate_introns Description: This method is responsible for removing intronic regions from the overall sequence alignment. Returntype : int Exceptions : Warns (via info) if the truncation of any intron causes part of the evidence sequence to be discarded. Caller : _align |
Arg [1] : String. Alignment type ('all', 'nucleotide' or 'protein'). Example : $self->_type('all') Description: Stores the type of alignment that has been computed. Returntype : String Exceptions : Throws if alignment type is unknown Caller : _is_computed, _working_alignment, _corroborating_sequences, _truncate_introns, retrieve_alignment, new |
Arg [1] : String. Alignment sequence identity (or 'slot'). Should be one of 'genomic_sequence', 'exon_protein', 'exon_nucleotide', 'evidence', or 'unaligned'.
Arg [2] : Bio::EnsEMBL::Pipeline::Alignment::AlignmentSeq
OR string 'empty'. Pass an align seq to add it to the
slot or pass 'empty' to clear the slot.
Example : $self->('evidence', $align_seq) or perhaps
$self->('evidence', 'empty')
Description: Stores the different types of alignment sequences
while the alignment is being built.
Returntype : Bio::EnsEMBL::Pipeline::Alignment::AlignmentSeq or
a list of these objects ('evidence' is an array, while
each of the other slots are single sequences). Returns
0 if slot is empty.
Exceptions : Throws if slot type is unrecognised.
Throws if sequence is not a
Bio::EnsEMBL::Pipeline::Alignment::AlignmentSeq.
Caller : _create_Alignment_object, _truncate_introns, _align |
Arg [db] : a Bio::EnsEMBL::DBSQL::DBAdaptor object. Arg [transcript] : a Bio::EnsEMBL::Transcript object. Arg [seqfetcher] : a Bio::EnsEMBL::Pipeline::SeqFetcher object. Arg [padding] : (optional) int, default 10, transcript slice padding. Arg [evidence_identity_cutoff] : (optional) percentage id cutoff to be applied to supporting features. Arg [supporting_features] : (optional) a listref of Bio::EnsEMBL::SeqFeature. These will be displayed instead of the evidence attached to the transcript. Example : my $evidence_alignment = Bio::EnsEMBL::Pipeline::Alignment::EvidenceAlignment->new( -transcript => $transcript, -dbadaptor => $db, -seqfetcher => $seqfetcher, -padding => 10); Description: Constructs new EvidenceAlignment object. If a listref of supporting features is provided, these will be displayed instead of the features attached to the transcript. Returntype : Bio::EnsEMBL::Pipeline::Alignment::EvidenceAlignment Exceptions : Will throw if isnt passed DBAdaptor/Transcript/SeqFetcher. Warns if transcript does not have evidence attached, or if all evidence is discarded or unfetchable. Caller : General |
Arg [type] : String. Alignment type (all, nucleotide, protein). Arg [remove_introns] : 0 or 1. Truncate intron sequences from alignment. Arg [three_letter_aa] : 0 or 1. Display amino acid sequences with three-letter instead of single-letter codes. Example : my $alignment = $evidence_alignment->retrieve_alignment( '-type' => 'all', '-remove_introns' => 1, '-three_letter_aa' => 1,); Description: Retrieves an alignment object of the transcript passed during object construction. The alignment is not constructed prior to this call, hence this method drives the full alignment reconstruction process. Returntype : Bio::EnsEMBL::Pipeline::Alignment Exceptions : Warns if alignment generation fails. Caller : General. |
Methods code
sub _align
{ my ($self, $type, $truncate_introns) = @_;
my $evidence_sequences = $self->_corroborating_sequences;
return 0 unless $evidence_sequences;
my %deletion_sets;
my %all_deletions;
my %deletion_tracking;
my %evidence_sequence_hash;
foreach my $evidence_sequence (@$evidence_sequences) {
$evidence_sequence_hash{$evidence_sequence} = $evidence_sequence;
while (my ($deletion_coord, $length) = each %{$evidence_sequence->deletion_hash}){
$deletion_sets{$evidence_sequence}->{$deletion_coord} = $length;
$all_deletions{$deletion_coord} = $length
unless ((defined $all_deletions{$deletion_coord} &&
$all_deletions{$deletion_coord} > $length));
push @{$deletion_tracking{$deletion_coord}},
[$length, $evidence_sequence->name];
}
}
my @all_deletions = sort {$a <=> $b} keys %all_deletions;
$self->{_total_inserted_deletions} = 0;
foreach my $deletion_coord (@all_deletions) {
my $length = $all_deletions{$deletion_coord};
$self->_genomic_sequence->insert_gap($deletion_coord, $length);
$self->{_total_inserted_deletions}++; $self->_exon_nucleotide_sequence->insert_gap($deletion_coord, $length);
$self->_exon_protein_translation->insert_gap($deletion_coord, $length);
foreach my $evidence_name (keys %evidence_sequence_hash) {
if (! $deletion_sets{$evidence_name}->{$deletion_coord}) {
$evidence_sequence_hash{$evidence_name}->insert_gap($deletion_coord, $length)
} elsif (defined $deletion_sets{$evidence_name}->{$deletion_coord} &&
$deletion_sets{$evidence_name}->{$deletion_coord} < $length){
my $insert_coord =
$deletion_coord + $deletion_sets{$evidence_name}->{$deletion_coord};
my $deletion_length_difference =
$length - $deletion_sets{$evidence_name}->{$deletion_coord};
$evidence_sequence_hash{$evidence_name}->insert_gap($insert_coord,
$deletion_length_difference)
}
}
my %new_all_deletions;
for (my $i = 0; $i < scalar @all_deletions; $i++) {
if ($all_deletions[$i] > $deletion_coord) {
$all_deletions[$i] += $length;
$new_all_deletions{$all_deletions[$i]} =
$all_deletions{$all_deletions[$i] - $length};
}
}
%all_deletions = %new_all_deletions;
my %new_deletion_sets;
foreach my $evidence_name (keys %evidence_sequence_hash) {
my @coords = keys %{$deletion_sets{$evidence_name}};
for (my $i = 0; $i < scalar @coords; $i++) {
if ($deletion_sets{$evidence_name}->{$coords[$i]}){
$new_deletion_sets{$evidence_name}->{$coords[$i]+$length} =
$deletion_sets{$evidence_name}->{$coords[$i]};
} else {
$new_deletion_sets{$evidence_name}->{$coords[$i]} =
$deletion_sets{$evidence_name}->{$coords[$i]};
}
}
}
%deletion_sets = %new_deletion_sets;
my %new_deletion_tracking;
my @coords = keys %deletion_tracking;
for (my $i = 0; $i < scalar @coords; $i++) {
if ($coords[$i] > $deletion_coord){
$new_deletion_tracking{$coords[$i]+$length} = $deletion_tracking{$coords[$i]}
} else {
$new_deletion_tracking{$coords[$i]} = $deletion_tracking{$coords[$i]}
}
}
%deletion_tracking = %new_deletion_tracking;
}
$self->_working_alignment('genomic_sequence', $self->_genomic_sequence);
$self->_working_alignment('exon_nucleotide', $self->_exon_nucleotide_sequence);
if ($self->_translatable
&&(($type eq 'protein')||($type eq 'all'))) {
$self->_working_alignment('exon_protein', $self->_exon_protein_translation);
}
foreach my $evidence_key (keys %evidence_sequence_hash) {
$self->_working_alignment('evidence',
$evidence_sequence_hash{$evidence_key});
}
if ($truncate_introns) {
$self->_truncate_introns;
}
$self->_is_computed($type, 1);
return 1; } |
sub _all_supporting_features
{ my $self = shift;
if (@_) {
my $bafs = shift;
foreach my $baf (@$bafs){
throw("Evidence provided does not consist " .
"of Bio::EnsEMBL::BaseAlignFeature")
unless $baf->isa("Bio::EnsEMBL::BaseAlignFeature");
$baf = $baf->transfer($self->_slice);
push @{$self->{_all_supporting_features}}, $baf;
}
}
unless ($self->{_all_supporting_features}){
my $sfs = $self->_transcript->get_all_supporting_features;
if (@$sfs) {
push @{$self->{_all_supporting_features}}, @$sfs;
} else {
foreach my $exon (@{$self->_transcript->get_all_Exons}){
push @{$self->{_all_supporting_features}},
@{$exon->get_all_supporting_features};
}
}
$self->{_all_supporting_features} =
$self->_remove_duplicate_features($self->{_all_supporting_features});
}
return $self->{_all_supporting_features} } |
sub _build_evidence_seq
{ my ($self, $base_align_feature) = @_;
my $hstart = $base_align_feature->hstart;
my $hend = $base_align_feature->hend;
my $cigar = $base_align_feature->cigar_string;
my $hstrand = $base_align_feature->hstrand;
my @cigar_instructions = $self->_cigar_reader($cigar);
my $fetched_seq = $self->_fetch_sequence($base_align_feature->hseqname);
if ( ! $fetched_seq) {
info("Error fetching sequence [" .
$base_align_feature->hseqname .
"]. Ignoring.");
return 0;
} elsif (($fetched_seq->seq eq '')||
($fetched_seq->seq eq '-' x (length($fetched_seq->seq)))) {
warning("Error fetching sequence [" .
$base_align_feature->hseqname .
"]. Sequence is an empty string.");
return 0;
}
if ($base_align_feature->isa("Bio::EnsEMBL::DnaPepAlignFeature")){
unless ($hstrand == $self->_strand) {
print STDERR "Strand : " . $self->_strand . "\tHstrand : " . $hstrand . "\n";
warning("Protein align feature [". $base_align_feature->hseqname .
"] is reversed with respect to transcript - and I cant reverse " .
"compliment an amino acid sequence.");
return 0
}
my $aa_seq = $fetched_seq->seq;
if (($hstart > length($aa_seq))||
($hend > length($aa_seq))){
warning("Evidence sequence coordinates lie outside " .
"the bounds of that sequence. Data problem.");
}
$fetched_seq = substr($aa_seq, ($hstart - 1), ($hend - $hstart + 1));
unless ($fetched_seq ne ''){
throw("Error building amino acid sequence [" . $base_align_feature->hseqname
. "]. Feature does not seem to lie within the bounds of the " .
"evidence sequence.");
}
if ($self->_three_letter_aa) {
$fetched_seq =~ s/(.)/$self->{_aa_names}{$1}/g;
} else {
$fetched_seq =~ s/(.)/$1\-\-/g;
}
}
if ($base_align_feature->isa("Bio::EnsEMBL::DnaDnaAlignFeature")) {
$fetched_seq = $fetched_seq->seq;
$fetched_seq = substr($fetched_seq, ($hstart - 1), ($hend - $hstart + 1));
if ($hstrand == -1) {
my $backwards_seq = Bio::Seq->new('-seq' => $fetched_seq);
my $forwards_seq = $backwards_seq->revcom;
$fetched_seq = $forwards_seq->seq;
}
}
my $hit_position = 1;
my @deletions;
my $feature_deletions = 0;
my $feature_insertions = 0;
my $deletions_upstream_of_slice = 0;
my $slice_position = $base_align_feature->start;
my $slice_rev_position = $self->_slice->length - $base_align_feature->end + 1;
foreach my $instruction (@cigar_instructions) {
if ($instruction->{type} eq 'I') {
my $gap = '-' x $instruction->{length};
if ($hit_position <= length $fetched_seq){
$fetched_seq = substr($fetched_seq, 0, $hit_position - 1) .
$gap . substr($fetched_seq, $hit_position - 1);
} else {
throw("Gap in sequence [" . $base_align_feature->hseqname .
"] lies outside feature. Code problem.")
}
$hit_position += $instruction->{length};
$slice_position += $instruction->{length};
$slice_rev_position += $instruction->{length};
$feature_insertions += $instruction->{length};
} elsif ($instruction->{type} eq 'M') {
$hit_position += $instruction->{length};
$slice_position += $instruction->{length};
$slice_rev_position += $instruction->{length};
} elsif ($instruction->{type} eq 'D') {
if ($slice_position <= 0){
$deletions_upstream_of_slice++;
next
}
if ($slice_position <= $self->_slice->length) {
push @deletions, [$slice_position, $instruction->{length}];
$feature_deletions += $instruction->{length};
}
}
}
if (scalar @deletions) {
$self->_there_are_deletions(1);
}
if ($base_align_feature->start < 0 ||
$base_align_feature->start > $self->_slice->length ||
$base_align_feature->end > $self->_slice->length ||
$base_align_feature->end < 0) {
info("Feature [". $base_align_feature->hseqname . " start:" .
$base_align_feature->start . " end:" . $base_align_feature->end
."] extends\npast the start or end of genomic slice. Truncating\n" .
"overhanging sequence.");
if (($base_align_feature->start < 0 &&
$base_align_feature->end < 0)||
($base_align_feature->start > $self->_slice->length &&
$base_align_feature->end > $self->_slice->length)) {
info("Feature [" . $base_align_feature->hseqname .
"] lies completely outside the bounds of Slice. Chuck.");
return 0
} else {
my $start_overshoot = 0;
if ($base_align_feature->start < 0) {
info("Feature [". $base_align_feature->hseqname .
"] extends past the beginning of the slice. Truncating.\n");
$start_overshoot =
($base_align_feature->start * -1) + $deletions_upstream_of_slice;
$fetched_seq = substr($fetched_seq, $start_overshoot + 1);
}
if ($base_align_feature->end > $self->_slice->length) {
info("Feature [". $base_align_feature->hseqname .
"] extends past the end of the slice. Truncating.\n");
my $end_overshoot = $base_align_feature->end - $self->_slice->length - 1;
$fetched_seq = substr($fetched_seq, 0,
(length $fetched_seq) - $start_overshoot - $end_overshoot - 1);
}
}
}
my $genomic_start;
$genomic_start =
$base_align_feature->start > 0 ? $base_align_feature->start - 1 : 0;
if ($genomic_start <= $self->_slice->length){
my $feat_align_seq =
$self->_feature_sequence($base_align_feature->hseqname);
my $feature_sequence = $feat_align_seq->seq;
my $all_deletions = $feat_align_seq->deletion_coords;
my $upstream_deletion_count = 0;
my $total_deletion_count = $feature_deletions;
foreach my $deletion (@$all_deletions) {
if ($deletion <= $genomic_start) {
$upstream_deletion_count +=
$feat_align_seq->deletion_hash->{$deletion};
}
$total_deletion_count +=
$feat_align_seq->deletion_hash->{$deletion};
}
my $deletion_gap = '-' x $feature_deletions;
my $insert_end =
$genomic_start + $upstream_deletion_count + length($fetched_seq);
my $length_to_end =
$self->_slice->length + $total_deletion_count - $insert_end;
unless (($insert_end + $feature_deletions) <
($self->_slice->length + $total_deletion_count)) {
my $overrun =
$insert_end + $feature_deletions -
$self->_slice->length - $total_deletion_count;
$insert_end =
$self->_slice->length + $total_deletion_count - $feature_deletions;
$length_to_end = 0;
if (length($deletion_gap) > $overrun){
$deletion_gap = '-' x (length($deletion_gap) - $overrun);
} else {
$overrun -= length($deletion_gap);
$deletion_gap = '';
$fetched_seq =
substr($fetched_seq, 0, (length($fetched_seq) - $overrun));
}
}
$feature_sequence =
substr($feature_sequence, 0, ($genomic_start + $upstream_deletion_count))
. $fetched_seq . $deletion_gap .
substr($feature_sequence, $insert_end, $length_to_end);
if (length($feature_sequence) !=
$self->_slice->length + $total_deletion_count){
throw("Building a feature sequence [" . $base_align_feature->hseqname .
"] that is longer than our genomic slice.");
}
$self->_feature_sequence($base_align_feature->hseqname,
$feature_sequence,\@
deletions);
} else {
info("Feature [". $base_align_feature->hseqname . " start:" .
$base_align_feature->start . " end:" . $base_align_feature->end .
" strand:" . $base_align_feature->strand
."] starts beyond end of genomic slice. This feature most probably " .
"aligns to an exon that is not part of this transcript.");
return 0
}
return 1 } |
sub _build_sequence_cache
{ my ($self) = @_;
my %hash_of_accessions;
foreach my $supporting_feature (@{$self->_all_supporting_features}){
$hash_of_accessions{$supporting_feature->hseqname}++;
}
my @array_of_accessions = keys %hash_of_accessions;
my $fetched_seqs;
if ($self->_seq_fetcher->can("batch_fetch")){
eval {
$fetched_seqs = $self->_seq_fetcher->batch_fetch(@array_of_accessions);
};
if ($@){
info("Not all evidence sequences could be pfetched.\n".
"Ignoring missing sequences.\n$@\n");
}
} else {
foreach my $accession (@array_of_accessions){
my $fetched_seq;
eval {
$fetched_seq = $self->_seq_fetcher->get_Seq_by_acc($accession);
};
if ($@) {
warning("The seqfetcher is having trouble.\t$@\n");
}
push(@$fetched_seqs, $fetched_seq);
}
}
foreach my $fetched_seq (@$fetched_seqs){
next unless defined $fetched_seq;
$self->{_fetched_seq_cache}->{$fetched_seq->accession_number}
= $fetched_seq;
}
$self->{_cache_is_built} = 1; } |
sub _cigar_reader
{ my ($self, $cigar_string) = @_;
my @cigar_array = split //, $cigar_string;
my @cigar_elements;
my $current_digits;
while (scalar @cigar_array) {
my $next_char = shift @cigar_array;
if ($next_char =~ /[MDI]/) {
$current_digits = 1
unless ($current_digits && $current_digits > 0);
my %enduring_hash;
$enduring_hash{type} = $next_char;
$enduring_hash{length} = $current_digits;
push (@cigar_elements,\% enduring_hash);
$current_digits = '';
} elsif ($next_char =~ /\d/) {
$current_digits .= $next_char;
} else {
throw "There is an odd character in the CIGAR string retrieved from the database.\n" .
$cigar_string . "\n";
}
}
return @cigar_elements;
}
1; } |
sub _corroborating_sequences
{ my $self = shift;
my %name_vs_type;
FEATURE:
foreach my $base_align_feature (@{$self->_all_supporting_features}){
if ((($self->_type eq 'nucleotide')
&&($base_align_feature->isa("Bio::EnsEMBL::DnaPepAlignFeature")))
||(($self->_type eq 'protein')
&&($base_align_feature->isa("Bio::EnsEMBL::DnaDnaAlignFeature")))){
next FEATURE;
}
if ((defined $base_align_feature->percent_id)
&&($base_align_feature->percent_id < $self->{_evidence_identity_cutoff})) {
next FEATURE;
}
unless (defined $name_vs_type{$base_align_feature->hseqname}){
if ($base_align_feature->isa("Bio::EnsEMBL::DnaPepAlignFeature")){
$name_vs_type{$base_align_feature->hseqname} = 'protein';
} elsif ($base_align_feature->isa("Bio::EnsEMBL::DnaDnaAlignFeature")){
$name_vs_type{$base_align_feature->hseqname} = 'nucleotide';
}
}
$self->_build_evidence_seq($base_align_feature);
}
my $corroborating_sequences = $self->_feature_sequences;
for (my $i = 0; $i < scalar @$corroborating_sequences; $i++) {
$corroborating_sequences->[$i]->type($name_vs_type{$corroborating_sequences->[$i]->name});
}
unless (defined $corroborating_sequences &&
scalar @$corroborating_sequences > 0) {
warning("There are no displayable supporting features for this transcript [" .
$self->_transcript->stable_id . "]. " .
"Try setting the -type attribute to 'all', instead of just " .
"'protein' or 'nucleotide'.");
return 0
}
return $corroborating_sequences; } |
sub _create_Alignment_object
{ my ($self, $type) = @_;
my $alignment = Bio::EnsEMBL::Pipeline::Alignment->new(
'-name' => 'evidence alignment');
$alignment->add_sequence($self->_working_alignment('genomic_sequence'));
$alignment->add_sequence($self->_working_alignment('exon_nucleotide'));
if ($self->_translatable
&&(($type eq 'protein')||($type eq 'all'))) {
$alignment->add_sequence($self->_working_alignment('exon_protein'));
}
foreach my $evidence_sequence (@{$self->_working_alignment('evidence')}){
$alignment->add_sequence($evidence_sequence);
}
return $alignment; } |
sub _db
{ my $self = shift;
if (@_) {
$self->{_db_adaptor} = shift;
}
return $self->{_db_adaptor} } |
sub _exon_nucleotide_sequence
{ my ($self) = @_;
if (!defined $self->{_exon_nucleotide_sequence}) {
my @exon_list;
foreach my $exon (@{$self->_transcript->get_all_Exons}) {
my $exon_start;
$exon_start = $exon->start - 1;
push @exon_list, [$exon, $exon_start];
}
@exon_list = sort {$a->[1] <=> $b->[1]} @exon_list;
my $genomic_exon_seq = '';
my $gap_start = 1;
foreach my $exon_item (@exon_list) {
my $exon = $exon_item->[0];
my $exon_start = $exon_item->[1];
my $exon_seq = $exon->seq->seq;
my $gap_length = $exon_start - $gap_start + 1;
$genomic_exon_seq .= ('-' x $gap_length) . $exon_seq;
$gap_start += $gap_length + length($exon_seq);
}
my $final_gap_length = $self->_slice->length - $gap_start + 1;
$genomic_exon_seq .= ('-' x $final_gap_length);
$self->{_exon_nucleotide_sequence}
= Bio::EnsEMBL::Pipeline::Alignment::AlignmentSeq->new(
'-seq' => $genomic_exon_seq,
'-name' => 'exon_sequence',
'-type' => 'nucleotide');
}
return $self->{_exon_nucleotide_sequence}; } |
sub _exon_protein_translation
{
my ($self) = @_;
if (! defined $self->{_exon_protein_translation}) {
my @exon_list;
foreach my $exon (@{$self->_transcript->get_all_Exons}){
my $end_coding_exon = 0;
if (($exon->start < $self->_transcript->coding_region_end)&&
($exon->end >= $self->_transcript->coding_region_end)) {
$end_coding_exon = 1;
}
my $start_exon = 0;
if ($exon == $self->_transcript->start_Exon) {
$start_exon = 1;
}
my $seq = $exon->peptide($self->_transcript);
my $peptide = $seq->seq;
next unless (length($peptide));
$peptide =~ s/(.)/$1\-\-/g;
if ($exon->phase == 1){
$peptide = substr($peptide, 1);
} elsif ($exon->phase == 2){
$peptide = substr($peptide, 2);
}
if ($exon->end_phase == 2){
$peptide = substr($peptide, 0, length($peptide) - 1);
} elsif ($exon->end_phase == 1){
$peptide = substr($peptide, 0, length($peptide) - 2);
}
my $peptide_genomic_start = $exon->end - length($peptide) + 1;
if ($start_exon && $end_coding_exon){ my $utr = $self->_transcript->five_prime_utr;
my $utr_length = $utr ? length($utr->seq) : 0;
$peptide_genomic_start = $exon->start + $utr_length;
} elsif ($end_coding_exon) {
$peptide_genomic_start = $exon->start;
}
push @exon_list, [$peptide_genomic_start, $peptide];
}
my $exon_translation_string = '';
foreach my $exon_item (@exon_list){
my $genomic_start = $exon_item->[0];
my $peptide_seq = $exon_item->[1];
my $gap_length = $genomic_start - 1 - length($exon_translation_string);
$exon_translation_string .= ('-' x $gap_length) . $peptide_seq;
}
my $last_gap_length = $self->_slice->length - length($exon_translation_string);
$exon_translation_string .= ('-' x $last_gap_length);
if ($self->_three_letter_aa) {
$exon_translation_string =~ s/([^\-])\-\-/$self->{_aa_names}{$1}/g;
}
$self->{_exon_protein_translation} =
Bio::EnsEMBL::Pipeline::Alignment::AlignmentSeq->new(
'-seq' => $exon_translation_string,
'-name' => 'translated_exon_sequence',
'-type' => 'protein');
}
return $self->{_exon_protein_translation}; } |
sub _feature_sequence
{ my ($self, $name, $sequence, $deletion_coords) = @_;
unless (defined $name) {
throw("Cant do anything without a sequence name.")
}
unless (defined $self->{_feature_sequences}){
$self->{_feature_sequences} = {}
}
if (! defined $self->{_feature_sequences}->{$name}) {
if (! defined $sequence) {
$sequence = '-' x $self->_slice->length;
}
$self->{_feature_sequences}->{$name} =
Bio::EnsEMBL::Pipeline::Alignment::AlignmentSeq->new(
'-name' => $name,
'-seq' => $sequence);
$sequence = undef;
} elsif (defined $sequence) {
$self->{_feature_sequences}->{$name}->seq($sequence);
if (defined $deletion_coords){
foreach my $deletion (@$deletion_coords){
$self->{_feature_sequences}->{$name}->add_deletion($deletion->[0],
$deletion->[1])
}
}
}
return $self->{_feature_sequences}->{$name} } |
sub _feature_sequences
{ my $self = shift;
my @feat_seqs = values %{$self->{_feature_sequences}};
return\@ feat_seqs } |
sub _fetch_sequence
{ my ($self, $accession) = @_;
$self->_build_sequence_cache
unless $self->{_cache_is_built};
unless ($self->{_fetched_seq_cache}->{$accession}){
info("Sequence $accession could not be retrieved from cache.");
}
return $self->{_fetched_seq_cache}->{$accession};
}
} |
sub _genomic_sequence
{ my ($self) = @_;
if (!defined $self->{_genomic_sequence}) {
my $genomic_sequence;
$genomic_sequence = $self->_slice->seq;
$self->{_genomic_sequence} =
Bio::EnsEMBL::Pipeline::Alignment::AlignmentSeq->new(
'-seq' => $genomic_sequence,
'-name' => 'genomic_sequence',
'-type' => 'nucleotide');
}
return $self->{_genomic_sequence}; } |
sub _is_computed
{ my ($self, $type, $value) = @_;
if ((!defined $type)&&(!$self->_type)) {
return 0;
}
if ((!defined $type)&&($self->_type)) {
return 1;
}
if (!defined $self->{_is_computed}) {
$self->{_is_computed} = 0;
}
if ((!defined $value)&&($self->{_is_computed})&&($type ne $self->_type)) {
warning("Alignment has been previously computed, but was not\n" .
"of the same type. The previously computed alignment\n" .
"type was\' " . $self->_type . "\'.\n");
return 0;
}
if (defined $value && $value > 0) {
if ((defined $type)
&&(($type eq 'nucleotide')||
($type eq 'protein')||
($type eq 'all'))){
$self->_type($type);
} else {
throw("Unknown alignment type. Can be nucleotide, protein or all.\n");
return 0;
}
$self->{_is_computed} = 1;
}
return $self->{_is_computed}; } |
sub _padding
{ my $self = shift;
if (@_) {
$self->{_padding} = shift;
}
return $self->{_padding} ? $self->{_padding} : 0; } |
sub _print_tabulated_coordinates
{ my $self = shift;
print STDERR
"Slice : length - " . $self->_slice->length . "\n" .
" chr - " . $self->_slice->seq_region_name . "\n" .
" chr start - " . $self->_slice->start . "\n" .
" chr end - " . $self->_slice->end . "\n" .
" strand - " . $self->_slice->strand . "\n";
print STDERR "\n";
my $exons = $self->_transcript->get_all_Exons;
print STDERR
"Transcript : id - " . $self->_transcript->stable_id . "\n" .
" coding - (start) " . $self->_transcript->coding_region_start . " (end) " .
$self->_transcript->coding_region_end . "\n" .
" exon count - " . scalar @$exons . "\n";
foreach my $exon (@$exons) {
print STDERR
" exon - (start) " . $exon->start .
" (end) " . $exon->end . " (strand) " . $exon->strand . "\n";
}
print STDERR "\n";
my $supporting_features = $self->_all_supporting_features;
print STDERR
"Evidence : total features - " . @$supporting_features . "\n";
foreach my $feature (@$supporting_features){
print STDERR
" Feature : " . $feature->hseqname . "\n" .
" (start) " . $feature->start . "\t(end) " .
$feature->end . "\t(strand) " . $feature->strand . "\n" .
" (hstart) " . $feature->hstart . "\t(hend) " .
$feature->hend . "\t(hstrand) " . $feature->hstrand . "\n" .
" (CIGAR) " . $feature->cigar_string . "\n";
}
return 1 } |
sub _remove_duplicate_features
{ my ($self, $features) = @_;
unless (defined $features){
throw("No features passed to filter.")
}
my %feature_lookup;
my %feats_by_id;
my @filtered_features;
FEAT:
foreach my $feat (@$features){
my $feat_identifier
= $feat->hseqname . $feat->hstart . $feat->hend . $feat->hstrand;
unless ($feature_lookup{$feat_identifier}){
foreach my $same_id_feat (@{$feats_by_id{$feat->hseqname}}) {
if ($feat->end() >= $same_id_feat->start() &&
$feat->start() <= $same_id_feat->end()) {
next FEAT
}
}
push @filtered_features, $feat;
push(@{$feats_by_id{$feat->hseqname}}, $feat);
}
$feature_lookup{$feat_identifier}++
}
return\@ filtered_features
}
} |
sub _seq_fetcher
{ my ($self, $fetcher) = @_;
if (defined $fetcher) {
$self->{_seq_fetcher} = $fetcher;
return 1;
}
return $self->{_seq_fetcher};
}
} |
sub _set_aa_names
{ my $self = shift;
$self->{_aa_names} = {'A' => 'Ala',
'B' => 'Asx',
'C' => 'Cys',
'D' => 'Asp',
'E' => 'Glu',
'F' => 'Phe',
'G' => 'Gly',
'H' => 'His',
'I' => 'Ile',
'K' => 'Lys',
'L' => 'Leu',
'M' => 'Met',
'N' => 'Asn',
'P' => 'Pro',
'Q' => 'Gln',
'R' => 'Arg',
'S' => 'Ser',
'T' => 'Thr',
'V' => 'Val',
'W' => 'Trp',
'Y' => 'Tyr',
'Z' => 'Glx',
'-' => '---',
'X' => 'xxx'} } |
sub _slice
{ my ($self, $object) = @_;
my $transcript;
if (defined $object &&
$object->isa("Bio::EnsEMBL::Slice")){
$self->{_slice} = $object
} elsif (defined $object &&
$object->isa("Bio::EnsEMBL::Transcript")) {
$transcript = $object
}
if (! defined $self->{_slice} && defined $transcript) {
my $slice_adaptor = $self->_db->get_SliceAdaptor;
if ($transcript->stable_id){
$self->{_slice} =
$slice_adaptor->fetch_by_transcript_stable_id($transcript->stable_id,
$self->_padding);
} else {
$self->{_slice} =
$slice_adaptor->fetch_by_transcript_id($transcript->dbID,
$self->_padding);
}
if ($transcript->strand == -1){
$self->{_slice} = $self->{_slice}->invert;
}
}
return $self->{_slice}; } |
sub _strand
{ my $self = shift;
return $self->{_transcript}->strand; } |
sub _there_are_deletions
{ my $self = shift;
if (@_) {
$self->{_there_are_deletions} = shift;
}
return $self->{_there_are_deletions}
}
} |
sub _three_letter_aa
{ my $self = shift;
if (@_) {
$self->{_three_letter_aa} = shift;
}
if ($self->{_three_letter_aa}){
$self->_set_aa_names;
}
return $self->{_three_letter_aa} ? $self->{_three_letter_aa} : 0; } |
sub _transcript
{ my $self = shift;
if (@_){
$self->{_transcript} = shift;
}
unless (defined $self->{_transcript} &&
$self->{_transcript}->isa("Bio::EnsEMBL::Transcript")){
throw("Problem with transcript. It is either unset or is not a " .
"transcript. It is a [".$self->{_transcript}."].");
}
return $self->{_transcript} } |
sub _translatable
{ my $self = shift;
if (@_) {
$self->{_translatable} = shift;
}
return $self->{_translatable}; } |
sub _truncate_introns
{ my ($self) = @_;
my @sequences;
push (@sequences, $self->_working_alignment('genomic_sequence'));
push (@sequences, $self->_working_alignment('exon_nucleotide'));
if (($self->_type eq 'all')||($self->_type eq 'protein')){
push (@sequences, $self->_working_alignment('exon_protein'));
}
foreach my $aligned_seq (@{$self->_working_alignment('evidence')}){
push (@sequences, $aligned_seq);
}
if ($self->_working_alignment('unaligned')){
foreach my $unaligned_seq (@{$self->_working_alignment('unaligned')}){
push (@sequences, $unaligned_seq);
}
}
my @coordinates;
foreach my $exon (@{$self->_transcript->get_all_Exons}){
my $intron_coord_1 = $exon->start - 1;
my $intron_coord_2 = $exon->end + 1;
push(@coordinates, $intron_coord_1, $intron_coord_2);
}
my $genomic_gaps = $self->_genomic_sequence->all_gaps;
foreach my $gap_coord (@$genomic_gaps) {
for (my $i = 0; $i < scalar @coordinates; $i++) {
$coordinates[$i]++ if $coordinates[$i] >= $gap_coord;
}
}
@coordinates = sort {$b <=> $a} @coordinates;
shift @coordinates;
pop @coordinates;
foreach my $align_seq (@sequences){
my $seq = $align_seq->seq;
my @working_coordinates = @coordinates;
INTRON:
while (@working_coordinates){
my $intron_end = (shift @working_coordinates) - 1;
my $intron_start = (shift @working_coordinates);
next INTRON unless ($intron_start + $self->_padding + 22) < ($intron_end - $self->_padding - 22);
my $offcut = substr($seq, $intron_start + $self->_padding - 1,
($intron_end - 2*$self->_padding));
if ($self->_padding == 0) {
$seq =
substr($seq, 0, $intron_start - 1) .
substr($seq, $intron_end + 1, length($seq));
} else {
$seq =
substr($seq, 0, $intron_start + $self->_padding - 1) .
'---intron-truncated---' .
substr($seq, ($intron_end - 2*$self->_padding + 1), length($seq));
}
if (($offcut =~ /[atgc]/)&&($align_seq->name ne 'genomic_sequence')) {
info("Truncating intron sequences has caused some aligned evidence\n" .
"to be discarded. Try again with a higher amount of padding\n".
"around the exon sequences.\n");
}
}
$align_seq->seq($seq);
}
return 1; } |
sub _type
{ my ($self, $type) = @_;
if (defined $type) {
unless ($type eq 'all' ||
$type eq 'nucleotide' ||
$type eq 'protein') {
throw("Type of alignment to retrieve not recognised. Please use one " .
"of\' all\',\' nucleotide\' or\' protein\'.");
}
$self->{_computed_type} = $type;
}
throw("Type of alignment to retrieve not specified. Please use one " .
"of\' all\',\' nucleotide\' or\' protein\'.")
unless $self->{_computed_type};
return $self->{_computed_type};
}
} |
sub _working_alignment
{ my ($self, $slot, $align_seq) = @_;
unless (defined $slot &&
(($slot eq 'genomic_sequence')
||($slot eq 'exon_protein')
||($slot eq 'exon_nucleotide')
||($slot eq 'evidence')
||($slot eq 'unaligned'))){
throw("Was trying to retrieve or write working alignments to "
. "a slot that isnt allowed ($slot)");
}
if (defined $slot && defined $align_seq && $align_seq eq 'empty'){
undef $self->{_working_alignment_array}->{$slot};
return 1
}
if (defined $slot && defined $align_seq){
unless ($align_seq->isa("Bio::EnsEMBL::Pipeline::Alignment::AlignmentSeq")){
throw("Sequence passed to _working alignment was not an " .
"AlignmentSeq, it was a [$align_seq]")
}
push (@{$self->{_working_alignment_array}->{$slot}}, $align_seq);
} elsif (defined $slot && !defined $align_seq) {
my $slot_resident = $self->{_working_alignment_array}->{$slot};
if ((($slot eq 'genomic_sequence')||($slot eq 'exon_protein')||($slot eq 'exon_nucleotide'))
&& defined $slot_resident && scalar @$slot_resident == 1) {
return $slot_resident->[0];
}
if ($slot eq 'evidence' && $self->_type eq 'all') {
throw "All the evidence for this gene has been thrown away or skipped. Yuck."
unless $slot_resident;
@$slot_resident = sort {$a->type cmp $b->type} @$slot_resident;
}
return $slot_resident;
}
return 0;
}
} |
sub new
{ my ($class, @args) = @_;
my $self = bless {},$class;
my ($db,
$transcript,
$seqfetcher,
$padding,
$evidence_identity_cutoff,
$supporting_features) = rearrange([qw(DBADAPTOR
TRANSCRIPT
SEQFETCHER
PADDING
IDENTITY_CUTOFF
SUPPORTING_FEATURES
)],@args);
unless (defined $db && $db->isa("Bio::EnsEMBL::DBSQL::DBAdaptor")){
throw("No database adaptor passed to AlignmentTool. You passed a $db.");
}
unless (defined $transcript && $transcript->isa("Bio::EnsEMBL::Transcript")){
throw("No transcript passed to AlignmentTool. You passed a $transcript.");
}
unless (defined $seqfetcher && $seqfetcher->isa("Bio::DB::RandomAccessI")) {
throw("No sequence fetcher passed to AlignmentTool. You passed a $seqfetcher.");
}
$self->_db($db);
$self->_padding($padding) if (defined $padding);
$self->_seq_fetcher($seqfetcher);
$self->_slice($transcript);
$transcript = $transcript->transfer($self->_slice);
$self->_transcript($transcript);
if ($self->_transcript->translation){
$self->_translatable(1);
} else {
warning("Database doesn't contain translation. Subsequently, ".
"wont be able to display translations of each exon.");
$self->_type('nucleotide');
$self->_translatable(0);
}
if (defined $supporting_features) {
$self->_all_supporting_features($supporting_features)
}
if ($evidence_identity_cutoff) {
$self->{_evidence_identity_cutoff} = $evidence_identity_cutoff;
} else {
$self->{_evidence_identity_cutoff} = 0;
}
return $self; } |
sub retrieve_alignment
{ my $self = shift @_;
my ($type,
$remove_introns,
$three_letter_aa) = rearrange([qw(TYPE
REMOVE_INTRONS
THREE_LETTER_AA
)],@_);
$self->_type($type);
$self->_three_letter_aa($three_letter_aa) if $three_letter_aa;
unless ($self->_is_computed($type)){
my $alignment_success = $self->_align($type,
$remove_introns);
unless ($alignment_success) {
warning "Alignment generation failed. There probably were" .
" not any displayable sequences.";
return 0
}
}
return $self->_create_Alignment_object($type);
}
} |
General documentation
Post general queries to ensembl-dev@ebi.ac.uk