Bio::EnsEMBL::Compara::Production::GenomicAlignBlock
Ortheus
Toolbar
Summary
Bio::EnsEMBL::Compara::Production::GenomicAlignBlock::Ortheus
Package variables
Privates (from "my" definitions)
$max_pads = 100
$do_hack = 0
$create_block_frag_array = 1
$pad_char = "N"
$max_pads_percent = 1.0
Included modules
Inherit
Synopsis
Description
This module acts as a layer between the Hive system and the Bio::EnsEMBL::Analysis::Runnable::Ortheus
module since the ensembl-analysis API does not know about ensembl-compara
Ortheus wants the files to be provided in the same order as in the tree string. This module starts
by getting all the DnaFragRegions of the SyntenyRegion and then use them to edit the tree (some
nodes must be removed and other ones must be duplicated in order to cope with deletions and
duplications). The build_tree_string methods numbers the sequences in order and changes the
order of the dnafrag_regions array accordingly. Last, the dumpFasta() method dumps the sequences
according to the tree_string order.
This module can be used to include low coverage 2X genomes in the alignment. To do this, the pairwise BLASTZ_NET alignments between each 2X genome and a reference species (eg human) are retrieved from specified databases.
Ortheus also generates a set of aligned ancestral sequences. This module stores them in a core-like database.
Methods
_add_to_cluster | No description | Code |
_add_to_different_cluster | No description | Code |
_add_to_same_cluster | No description | Code |
_build_2x_composite_seq | No description | Code |
_build_tree_string | Description | Code |
_cluster_regions | No description | Code |
_create_block_frag_array | No description | Code |
_create_span_frag_array | No description | Code |
_dump_2x_fasta | No description | Code |
_dump_fasta | Description | Code |
_extract_sequence | No description | Code |
_find_longest_region_in_cluster | No description | Code |
_in_cluster | No description | Code |
_load_2XGenomes | Description | Code |
_load_DnaFragRegions | Description | Code |
_merge_clusters | No description | Code |
_print_cluster | No description | Code |
_trim_gab_left | No description | Code |
_trim_gab_right | No description | Code |
_update_tree | Description | Code |
_write_gerp_dataflow | No description | Code |
dnafrag_regions | No description | Code |
fasta_files | No description | Code |
fetch_input | Description | Code |
get_params | No description | Code |
get_species_tree | No description | Code |
max_block_size | No description | Code |
method_link_species_set_id | No description | Code |
options | No description | Code |
parse_results | No description | Code |
reference_species | No description | Code |
remove_empty_cols | No description | Code |
run | No description | Code |
species_order | No description | Code |
synteny_region_id | No description | Code |
tree_string | No description | Code |
write_output | No description | Code |
Methods description
Arg [1] : -none- Example : $self->_build_tree_string(); Description: This method sets the tree_string using the orginal species tree and the set of DnaFragRegions. The tree is edited by the _update_tree method which resort the DnaFragRegions (see _update_tree elsewwhere in this document) Returntype : -none- Exception : Warning : |
Arg [1] : -none- Example : $self->_dump_fasta(); Description: Dumps FASTA files in the order given by the tree string (needed by Pecan). Resulting file names are stored using the fasta_files getter/setter Returntype : 1 Exception : Warning : |
Arg [1] : int syteny_region_id Arg [2] : int analysis_data_id Description: Creates a fake assembly for each 2X genome by stitching together the BLASTZ_NET alignments found on this synteny_region between the reference species and each 2X genome. The list of the pairwise database locations and Bio::EnsEMBL::Compara::MethodLinkSpeciesSet ids are obtained from the analysis_data_id. Creates a listref of genomic_align fragments Returntype : Exception : Warning : |
Arg [1] : int syteny_region_id Example : $self->_load_DnaFragRegions(); Description: Gets the list of DnaFragRegions for this syteny_region_id. Resulting DnaFragRegions are stored using the dnafrag_regions getter/setter. Returntype : listref of Bio::EnsEMBL::Compara::DnaFragRegion objects Exception : Warning : |
Arg [1] : Bio::EnsEMBL::Compara::NestedSet $tree_root Example : $self->_update_nodes_names($tree); Description: This method updates the tree by removing or duplicating the leaves according to the orginal tree and the set of DnaFragRegions. The tree nodes will be renamed seq1, seq2, seq3 and so on and the DnaFragRegions will be resorted in order to match the names of the nodes (the first DnaFragRegion will correspond to seq1, the second to seq2 and so on). Returntype : Bio::EnsEMBL::Compara::NestedSet (a tree) Exception : Warning : |
Title : fetch_input Usage : $self->fetch_input Function: Fetches input data for repeatmasker from the database Returns : none Args : none |
Methods code
_add_to_cluster | description | prev | next | Top |
sub _add_to_cluster
{ my ($cluster, $region1) = @_;
if (!defined $cluster) {
$cluster->[0]->{$region1} = 1;
}
return $cluster; } |
sub _add_to_different_cluster
{ my ($cluster, $region1, $region2) = @_;
if (!defined $cluster) {
$cluster->[0]->{$region1} = 1;
$cluster->[1]->{$region2} = 1;
return $cluster;
}
my $cluster_size = @$cluster;
my $index1 = _in_cluster($cluster, $region1);
my $index2 = _in_cluster($cluster, $region2);
if ($index1 == -1) {
$cluster->[@$cluster]->{$region1} = 1;
}
if ($index2 == -1) {
$cluster->[@$cluster]->{$region2} = 1;
}
return $cluster; } |
sub _add_to_same_cluster
{ my ($cluster, $region1, $region2) = @_;
if (!defined $cluster) {
$cluster->[0]->{$region1} = 1;
$cluster->[0]->{$region2} = 1;
return $cluster;
}
my $cluster_size = @$cluster;
my $index1 = _in_cluster($cluster, $region1);
my $index2 = _in_cluster($cluster, $region2);
if ($index1 == -1 && $index2 == -1) {
$cluster->[$cluster_size]->{$region1} = 1;
$cluster->[$cluster_size]->{$region2} = 1;
} elsif ($index1 != -1 && $index2 == -1) {
$cluster->[$index1]->{$region2} = 1;
} elsif ($index1 == -1 && $index2 != -1) {
$cluster->[$index2]->{$region1} = 1;
} else {
$cluster = _merge_clusters($cluster, $index1, $index2);
}
return $cluster; } |
sub _build_2x_composite_seq
{ my ($self, $pairwise_dba, $ref_slice_adaptor, $target_slice_adaptor, $ga_frags) = @_;
my $slice_array;
my $composite_seq;
@$ga_frags = sort {$a->{ref_start} <=> $b->{ref_start}} @$ga_frags;
my $first_frag = $ga_frags->[0];
my $num_pads;
my $prev_end;
my $prev_frag;
my $dnafrag_adaptor = $pairwise_dba->get_DnaFragAdaptor;
$num_pads = $first_frag->{ref_start} - $first_frag->{ref_slice_start};
$num_pads = $max_pads if ($num_pads > $max_pads);
$num_pads = int($num_pads * $max_pads_percent);
$composite_seq .= $pad_char x $num_pads;
$first_frag->{first_pads} = $num_pads;
foreach my $ga_frag (@$ga_frags) {
my $dnafrag = $dnafrag_adaptor->fetch_by_dbID($ga_frag->{dnafrag_region}->dnafrag_id);
print "species " . $dnafrag->genome_db->name . " name " . $dnafrag->name . " start " . $ga_frag->{dnafrag_region}->dnafrag_start . " end " . $ga_frag->{dnafrag_region}->dnafrag_end . " len " . ($ga_frag->{dnafrag_region}->dnafrag_end-$ga_frag->{dnafrag_region}->dnafrag_start+1) . " strand " . $ga_frag->{dnafrag_region}->dnafrag_strand . " ref_name " . $ga_frag->{ref_dnafrag}->dnafrag->name . " ref_start " . $ga_frag->{ref_start} . " ref_end " . $ga_frag->{ref_end} . " ref_len " . ($ga_frag->{ref_end}-$ga_frag->{ref_start}+1) . "\n" if $self->debug;
if (defined($prev_frag)) {
print "prev_end " . $prev_frag->{ref_end} . " start " . $ga_frag->{ref_start} . "\n" if $self->debug;
$num_pads = $ga_frag->{ref_start} - $prev_frag->{ref_end} - 1;
print "before max_pads $num_pads\n" if $self->debug;
$num_pads = $max_pads if ($num_pads > $max_pads);
$num_pads = int($num_pads * $max_pads_percent);
print "pads $num_pads\n" if $self->debug;
$composite_seq .= $pad_char x $num_pads;
$prev_frag->{num_pads} = $num_pads;
}
my $ref_slice = $ref_slice_adaptor->fetch_by_region('toplevel', $ga_frag->{ref_dnafrag}->dnafrag->name, $ga_frag->{ref_dnafrag}->dnafrag_start, $ga_frag->{ref_dnafrag}->dnafrag_end, $ga_frag->{ref_dnafrag}->dnafrag_strand);
my $slice = $target_slice_adaptor->fetch_by_region('toplevel', $dnafrag->name, $ga_frag->{dnafrag_region}->dnafrag_start, $ga_frag->{dnafrag_region}->dnafrag_end, $ga_frag->{dnafrag_region}->dnafrag_strand);
my $seq = $slice->get_repeatmasked_seq(undef, 1)->seq;
if ($seq =~ /[^ACTGactgNnXx]/) {
print STDERR $slice->name, " contains at least one non-ACTGactgNnXx character. These have been replaced by N's\n";
$seq =~ s/[^ACTGactgNnXx]/N/g;
}
$composite_seq .= $seq;
$prev_frag = $ga_frag;
}
my $last_frag = $ga_frags->[-1];
$num_pads = $first_frag->{ref_slice_end} - $last_frag->{ref_end};
$num_pads = $max_pads if ($num_pads > $max_pads);
$num_pads = int($num_pads * $max_pads_percent);
$composite_seq .= $pad_char x $num_pads;
$last_frag->{num_pads} = $num_pads;
$composite_seq =~ s/(.{80})/$1\n/g;
chomp $composite_seq;
$first_frag->{seq} = $composite_seq;
return $composite_seq; } |
sub _build_tree_string
{ my $self = shift;
my $tree = $self->get_species_tree->copy;
return if (!$tree);
$tree = $self->_update_tree($tree);
return if (!$tree);
my $tree_string = $tree->newick_simple_format;
$tree_string =~ s/"(seq\d+)"/$1/g;
$tree_string =~ s/\:0\.0+(\D)/$1/g;
$tree_string =~ s/\:0([^\.\d])/$1/g;
$tree->release_tree;
$self->tree_string($tree_string); } |
sub _cluster_regions
{ my ($self, $found_overlap) = @_;
my $overlap_cnt = 0;
my $not_overlap_cnt = 0;
my $cluster;
foreach my $region1 (keys %$found_overlap) {
foreach my $region2 (keys %{$found_overlap->{$region1}}) {
print "FOUND OVERLAP $region1 $region2 " . $found_overlap->{$region1}{$region2} . "\n" if $self->debug;
if ($found_overlap->{$region1}{$region2}) {
$overlap_cnt++;
$cluster = _add_to_same_cluster($cluster, $region1, $region2);
} else {
$not_overlap_cnt++;
$cluster = _add_to_different_cluster($cluster, $region1, $region2);
}
}
}
print "overlap_cnt $overlap_cnt not $not_overlap_cnt\n" if $self->debug;
return $cluster;
}
} |
sub _create_block_frag_array
{ my ($self, $gab_adaptor, $ref_slice_adaptor, $pairwise_mlss, $ref_dnafrags) = @_;
my $ga_frag_array;
my $ga_num_ns = 0;
foreach my $ref_dnafrag (@$ref_dnafrags) {
print " " . $ref_dnafrag->dnafrag->name . " " . $ref_dnafrag->dnafrag_start . " " . $ref_dnafrag->dnafrag_end . " " . $ref_dnafrag->dnafrag_strand . "\n" if $self->debug;
my $slice = $ref_slice_adaptor->fetch_by_region('toplevel', $ref_dnafrag->dnafrag->name, $ref_dnafrag->dnafrag_start, $ref_dnafrag->dnafrag_end, $ref_dnafrag->dnafrag_strand);
print "ref_seq " . $slice->start . " " . $slice->end . " " . $slice->strand . " " . substr($slice->seq,0,120) . "\n" if $self->debug;
my $pairwise_gabs = $gab_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice($pairwise_mlss, $slice, undef,undef,1);
@$pairwise_gabs = sort {$a->reference_genomic_align->dnafrag_start <=> $b->reference_genomic_align->dnafrag_start} @$pairwise_gabs;
print " pairwise gabs " . scalar(@$pairwise_gabs) . "\n" if $self->debug;
next if (scalar(@$pairwise_gabs) == 0);
my $ga_frags;
foreach my $pairwise_gab (@$pairwise_gabs) {
my $gas = $pairwise_gab->get_all_non_reference_genomic_aligns;
my $ga = $gas->[0];
my $ref_start = $ga->genomic_align_block->reference_genomic_align->dnafrag_start;
my $ref_end = $ga->genomic_align_block->reference_genomic_align->dnafrag_end;
if ($slice->strand == -1) {
my $tmp_start = $ref_start;
$ref_start = $slice->end - $ref_end + $slice->start;
$ref_end = $slice->end - $tmp_start + $slice->start;
}
my $dnafrag_adaptor = $self->{'comparaDBA'}->get_DnaFragAdaptor;
my $dnafrag_region = new Bio::EnsEMBL::Compara::DnaFragRegion(
-dnafrag_id => $ga->dnafrag->dbID,
-dnafrag_start => $ga->dnafrag_start,
-dnafrag_end => $ga->dnafrag_end,
-dnafrag_strand => $ga->dnafrag_strand,
-adaptor => $dnafrag_adaptor
);
my $ga_fragment = {dnafrag_region => $dnafrag_region,
genome_db => $ga->dnafrag->genome_db,
genome_db_id => $ga->dnafrag->genome_db_id,
ref_dnafrag => $ref_dnafrag,
ref_start => $ref_start,
ref_end => $ref_end,
ref_slice_start => $slice->start,
ref_slice_end => $slice->end};
push @$ga_frags, $ga_fragment;
}
push @$ga_frag_array, $ga_frags;
}
return $ga_frag_array;
}
} |
sub _create_span_frag_array
{ my ($self, $gab_adaptor, $ref_slice_adaptor, $pairwise_mlss, $ref_dnafrags) = @_;
my $ga_frag_array;
my $ga_num_ns = 0;
foreach my $ref_dnafrag (@$ref_dnafrags) {
print " " . $ref_dnafrag->dnafrag->name . " " . $ref_dnafrag->dnafrag_start . " " . $ref_dnafrag->dnafrag_end . " " . $ref_dnafrag->dnafrag_strand . "\n" if $self->debug;
my $slice = $ref_slice_adaptor->fetch_by_region('toplevel', $ref_dnafrag->dnafrag->name, $ref_dnafrag->dnafrag_start, $ref_dnafrag->dnafrag_end, $ref_dnafrag->dnafrag_strand);
print "ref_seq " . $slice->start . " " . $slice->end . " " . $slice->strand . " " . substr($slice->seq,0,120) . "\n" if $self->debug;
my $pairwise_gabs = $gab_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice($pairwise_mlss, $slice, undef,undef,1);
@$pairwise_gabs = sort {$a->reference_genomic_align->dnafrag_start <=> $b->reference_genomic_align->dnafrag_start} @$pairwise_gabs;
print " pairwise gabs " . scalar(@$pairwise_gabs) . "\n" if $self->debug;
next if (scalar(@$pairwise_gabs) == 0);
my $ga_frags;
my $prev_group_id = $pairwise_gabs->[0]->group_id;
my $min_start;
my $max_end;
my $dnafrag_name;
my $genome_db_id;
my $genome_db;
my $dnafrag_strand;
my $prev_ga;
my $ref_min_start;
my $ref_max_end;
my $dnafrag;
foreach my $pairwise_gab (@$pairwise_gabs) {
my $gas = $pairwise_gab->get_all_non_reference_genomic_aligns;
my $ga = $gas->[0];
print " " . $ga->genome_db->name . " " . $ga->dnafrag->name . " " . $ga->dnafrag_start . " " . $ga->dnafrag_end . " " . $ga->dnafrag_strand . " " . $pairwise_gab->group_id . " " . $ga->dnafrag->coord_system_name . " " . $ga->genomic_align_block->reference_genomic_align->dnafrag_start . " " . $ga->genomic_align_block->reference_genomic_align->dnafrag_end . " " . $ga->genomic_align_block->reference_genomic_align->dnafrag_strand . "\n" if $self->debug;
my $ga_slice = $ga->get_Slice;
$ga_num_ns += $ga_slice->seq =~ tr/N/N/;
if ($prev_group_id == $pairwise_gab->group_id) {
if (!defined $min_start || $ga->dnafrag_start < $min_start) {
$min_start = $ga->dnafrag_start;
$ref_min_start = $ga->genomic_align_block->reference_genomic_align->dnafrag_start;
}
if (!defined $max_end || $ga->dnafrag_end > $max_end) {
$max_end = $ga->dnafrag_end;
$ref_max_end = $ga->genomic_align_block->reference_genomic_align->dnafrag_end;
}
} else {
if ($slice->strand == -1) {
$ref_min_start = $slice->end - $ref_min_start + $slice->start;
$ref_max_end = $slice->end - $ref_max_end + $slice->start;
}
my $ref_start;
if ($ref_min_start > $ref_max_end) {
$ref_start = $ref_max_end;
$ref_max_end = $ref_min_start;
$ref_min_start = $ref_start;
}
my $dnafrag_adaptor = $self->{'comparaDBA'}->get_DnaFragAdaptor;
my $dnafrag_region = new Bio::EnsEMBL::Compara::DnaFragRegion(
-dnafrag_id => $dnafrag->dbID,
-dnafrag_start => $min_start,
-dnafrag_end => $max_end,
-dnafrag_strand => $dnafrag_strand,
-adaptor => $dnafrag_adaptor
);
my $ga_fragment = {dnafrag_region => $dnafrag_region,
genome_db => $genome_db,
genome_db_id => $genome_db_id,
ref_dnafrag => $ref_dnafrag,
ref_start => $ref_min_start,
ref_end => $ref_max_end,
ref_slice_start => $slice->start,
ref_slice_end => $slice->end};
print "store frag $min_start $max_end " . ($max_end - $min_start) . "\n" if $self->debug;
print "final seq $ref_min_start $ref_max_end " . substr($dnafrag_region->slice->seq,0,10) . "\n" if $self->debug;
push @$ga_frags, $ga_fragment;
$min_start = $ga->dnafrag_start;
$ref_min_start = $ga->genomic_align_block->reference_genomic_align->dnafrag_start;
$max_end = $ga->dnafrag_end;
$ref_max_end = $ga->genomic_align_block->reference_genomic_align->dnafrag_end;
}
$dnafrag_name = $ga->dnafrag->name;
$genome_db_id = $ga->dnafrag->genome_db_id;
$genome_db = $ga->dnafrag->genome_db;
$dnafrag = $ga->dnafrag;
$dnafrag_strand = $ga->dnafrag_strand;
$prev_group_id = $pairwise_gab->group_id;
$prev_ga = $ga;
}
if ($slice->strand == -1) {
$ref_min_start = $slice->end - $ref_min_start + $slice->start;
$ref_max_end = $slice->end - $ref_max_end + $slice->start;
}
my $ref_start;
if ($ref_min_start > $ref_max_end) {
$ref_start = $ref_max_end;
$ref_max_end = $ref_min_start;
$ref_min_start = $ref_start;
}
print "store last $min_start $max_end $ref_min_start $ref_max_end\n " if $self->debug;
my $dnafrag_region = new Bio::EnsEMBL::Compara::DnaFragRegion(
-dnafrag_id => $dnafrag->dbID,
-dnafrag_start => $min_start,
-dnafrag_end => $max_end,
-dnafrag_strand => $dnafrag_strand
);
my $ga_fragment = {dnafrag_region => $dnafrag_region,
genome_db => $genome_db,
genome_db_id => $genome_db_id,
ref_dnafrag => $ref_dnafrag,
ref_start => $ref_min_start,
ref_end => $ref_max_end,
ref_slice_start => $slice->start,
ref_slice_end => $slice->end};
push @$ga_frags, $ga_fragment;
push @$ga_frag_array, $ga_frags;
}
return $ga_frag_array;
}
} |
sub _dump_2x_fasta
{ my ($self, $ga_frags, $file, $seq_id) = @_;
open F, ">$file" || throw("Couldn't open $file");
print F ">SeqID" . $seq_id . "\n";
print F $ga_frags->[0]->{seq},"\n";
close F;
push @{$self->fasta_files}, $file;
push @{$self->species_order}, $ga_frags->[0]->{genome_db_id};
}
1; } |
sub _dump_fasta
{ my $self = shift;
my $all_dnafrag_regions = $self->dnafrag_regions;
my @seqs;
if ($self->tree_string) {
@seqs = ($self->tree_string =~ /seq(\d+)/g);
} else {
@seqs = (1..scalar(@$all_dnafrag_regions));
}
foreach my $seq_id (@seqs) {
my $dfr = $all_dnafrag_regions->[$seq_id-1];
my $file = $self->worker_temp_directory . "/seq" . $seq_id . ".fa";
print "file $file\n" if $self->debug;
if (!UNIVERSAL::isa($dfr, 'Bio::EnsEMBL::Compara::DnaFragRegion')) {
print "FOUND 2X GENOME\n" if $self->debug;
print "num of frags " . @$dfr . "\n" if $self->debug;
$self->_dump_2x_fasta($dfr, $file, $seq_id);
next;
}
open F, ">$file" || throw("Couldn't open $file");
print F ">SeqID" . $seq_id . "\n";
print ">DnaFrag", $dfr->dnafrag_id, "|", $dfr->dnafrag->name, ".",
$dfr->dnafrag_start, "-", $dfr->dnafrag_end, ":", $dfr->dnafrag_strand,"\n" if $self->debug;
my $slice = $dfr->slice;
throw("Cannot get slice for DnaFragRegion in DnaFrag #".$dfr->dnafrag_id) if (!$slice);
my $seq = $slice->get_repeatmasked_seq(undef, 1)->seq;
if ($seq =~ /[^ACTGactgNnXx]/) {
print STDERR $slice->name, " contains at least one non-ACTGactgNnXx character. These have been replaced by N's\n";
$seq =~ s/[^ACTGactgNnXx]/N/g;
}
$seq =~ s/(.{80})/$1\n/g;
chomp $seq;
print F $seq,"\n";
close F;
push @{$self->fasta_files}, $file;
push @{$self->species_order}, $dfr->dnafrag->genome_db_id;
}
return 1; } |
sub _extract_sequence
{ my ($seq, $original_start, $original_end) = @_;
my $original_count = 0;
my $aligned_count = 0;
my $aligned_start;
my $aligned_end;
foreach my $subseq (grep {$_} split /(\-+)/, $seq) {
my $length = length($subseq);
if ($subseq !~ /\-/) {
if (!defined($aligned_start) && ($original_count + $length >= $original_start)) {
$aligned_start = $aligned_count + ($original_start - $original_count) - 1;
}
if (!defined($aligned_end) && ($original_count + $length >= $original_end)) {
$aligned_end = $aligned_count + $original_end - $original_count - 1;
last;
}
$original_count += $length;
}
$aligned_count += $length;
}
my $subseq = substr($seq, $aligned_start, ($aligned_end-$aligned_start+1));
return ($subseq, $aligned_start, $aligned_end);
}
} |
sub _find_longest_region_in_cluster
{ my ($self, $cluster, $sum_lengths) = @_;
my $max_frag = 0;
my $final_region = -1;
my @overlap_frag;
my $overlap_cnt = 0;
my $not_overlap_cnt = 0;
my $longest_clusters;
foreach my $this_cluster (@$cluster) {
my $longest_frag;
my $longest_region;
foreach my $region (keys %{$this_cluster}) {
if (!defined $longest_frag) {
$longest_frag = $sum_lengths->[$region];
$longest_region = $region;
}
if ($sum_lengths->[$region] >= $longest_frag) {
$longest_frag = $sum_lengths->[$region];
$longest_region = $region;
}
}
push @$longest_clusters, $longest_region;
}
print "overlap_cnt $overlap_cnt not $not_overlap_cnt\n" if $self->debug;
return $longest_clusters;
}
} |
sub _in_cluster
{ my ($cluster, $region) = @_;
for (my $i = 0; $i < @$cluster; $i++) {
if ($cluster->[$i]->{$region}) {
return $i;
}
}
return -1; } |
sub _load_2XGenomes
{ my ($self, $synteny_region_id, $analysis_data_id) = @_;
my $analysis_data_adaptor = $self->{hiveDBA}->get_AnalysisDataAdaptor();
my @parameters = split (" ",$analysis_data_adaptor->fetch_by_dbID($analysis_data_id));
if (scalar(@parameters) == 0) {
print "No 2x genomes to load\n" if $self->debug;
return;
}
my $genome_db_adaptor = $self->{'comparaDBA'}->get_GenomeDBAdaptor;
my $ref_genome_db = $genome_db_adaptor->fetch_by_name_assembly($self->reference_species);
my $ref_dba = $ref_genome_db->db_adaptor;
my $ref_slice_adaptor = $ref_dba->get_SliceAdaptor();
my $ref_dnafrags =[];
foreach my $dnafrag_region (@{$self->dnafrag_regions}) {
if ($dnafrag_region->genome_db->dbID == $ref_genome_db->dbID) {
push @$ref_dnafrags, $dnafrag_region;
}
}
if (scalar(@$ref_dnafrags) == 0) {
print "No " . $self->reference_species . " sequences found in syntenic block $synteny_region_id\n";
return;
}
print "Synteny region $synteny_region_id num copies " . scalar(@$ref_dnafrags) . "\n" if $self->debug;
foreach my $params (@parameters) {
my $param = eval($params);
my $target_species;
my $compara_db_url = $param->{'compara_db_url'};
my $compara_dba;
my $locator;
if ($compara_db_url =~ /mysql:\/\/.*@.*\/.+/) {
$locator = "Bio::EnsEMBL::Compara::DBSQL::DBAdaptor/url=>$compara_db_url";
} else {
throw "Invalid url $compara_db_url. Should be of the form: mysql://user:pass\@host:port/db_name\n";
}
$compara_dba = Bio::EnsEMBL::DBLoader->new($locator);
$self->{pairwise_compara_dba}->{$compara_dba->dbc->dbname} = $compara_dba;
my $gaba = $compara_dba->get_GenomicAlignBlockAdaptor;
my $p_mlss_adaptor = $compara_dba->get_MethodLinkSpeciesSetAdaptor;
my $pairwise_mlss = $p_mlss_adaptor->fetch_by_dbID($param->{'method_link_species_set_id'});
my $species_set = $pairwise_mlss->species_set;
foreach my $genome_db (@$species_set) {
if ($genome_db->name ne $self->reference_species) {
$target_species = $genome_db->name;
last;
}
}
my $target_genome_db = $genome_db_adaptor->fetch_by_name_assembly($target_species);
my $target_dba = $target_genome_db->db_adaptor;
my $target_slice_adaptor = $target_dba->get_SliceAdaptor();
my $ga_frag_array;
if ($create_block_frag_array) {
$ga_frag_array = $self->_create_block_frag_array($gaba, $ref_slice_adaptor, $pairwise_mlss, $ref_dnafrags);
} else {
$ga_frag_array = $self->_create_span_frag_array($gaba, $ref_slice_adaptor, $pairwise_mlss, $ref_dnafrags);
}
next if (!defined $ga_frag_array);
for (my $i = 0; $i < scalar(@$ga_frag_array); $i++) {
@{$ga_frag_array->[$i]} = sort {$a->{dnafrag_region}->dnafrag_start <=> $b->{dnafrag_region}->dnafrag_start} @{$ga_frag_array->[$i]};
}
my $sum_lengths;
for (my $i = 0; $i < scalar(@$ga_frag_array); $i++) {
for (my $j = 0; $j < scalar(@{$ga_frag_array->[$i]}); $j++) {
$sum_lengths->[$i] += ($ga_frag_array->[$i][$j]->{dnafrag_region}->dnafrag_end - $ga_frag_array->[$i][$j]->{dnafrag_region}->dnafrag_start + 1);
}
}
my $found_overlap;
my $j = 0;
if (@$ga_frag_array == 1) {
my $cluster;
$cluster = _add_to_cluster($cluster, 0);
_print_cluster($cluster) if $self->debug;
my $longest_ref_region = 0;
print "SIMPLE CASE: longest_region $longest_ref_region length " . $sum_lengths->[$longest_ref_region] . "\n" if $self->debug;
_build_2x_composite_seq($self, $compara_dba, $ref_slice_adaptor, $target_slice_adaptor, $ga_frag_array->[$longest_ref_region]);
push @{$self->{ga_frag}}, $ga_frag_array->[$longest_ref_region];
push @{$self->{'2x_dnafrag_region'}}, $ga_frag_array->[$longest_ref_region]->[0]->{dnafrag_region};
next;
}
for (my $region1 = 0; $region1 < scalar(@$ga_frag_array)-1; $region1++) {
for (my $region2 = $region1+1; $region2 < scalar(@$ga_frag_array); $region2++) {
if (!defined $found_overlap->{$region1}{$region2}) {
$found_overlap->{$region1}{$region2} = 0;
}
for (my $j = 0; ($j < @{$ga_frag_array->[$region1]}); $j++) {
last if ($found_overlap->{$region1}{$region2});
for (my $k = 0; ($k < @{$ga_frag_array->[$region2]}); $k++) {
last if ($found_overlap->{$region1}{$region2});
if ($ga_frag_array->[$region1][$j]->{seq_region_name} eq $ga_frag_array->[$region2][$k]->{seq_region_name}) {
if (($ga_frag_array->[$region1][$j]->{start} <= $ga_frag_array->[$region2][$k]->{end}) && ($ga_frag_array->[$region1][$j]->{end} >= $ga_frag_array->[$region2][$k]->{start})) {
$found_overlap->{$region1}{$region2} = 1;
print "found overlap $region1 $region2\n" if $self->debug;
last;
}
}
}
}
}
}
my $cluster = $self->_cluster_regions($found_overlap);
_print_cluster($cluster) if $self->debug;
my $longest_regions = $self->_find_longest_region_in_cluster($cluster, $sum_lengths);
my $slice_array;
foreach my $longest_ref_region (@$longest_regions) {
print "longest_ref_region $longest_ref_region length " . $sum_lengths->[$longest_ref_region] . "\n" if $self->debug;
_build_2x_composite_seq($self, $compara_dba, $ref_slice_adaptor, $target_slice_adaptor, $ga_frag_array->[$longest_ref_region]);
push @{$self->{ga_frag}}, $ga_frag_array->[$longest_ref_region];
push @{$self->{'2x_dnafrag_region'}}, $ga_frag_array->[$longest_ref_region]->[0]->{dnafrag_region};
}
} } |
sub _load_DnaFragRegions
{ my ($self, $synteny_region_id) = @_;
my $dnafrag_regions = [];
return $dnafrag_regions if (!$synteny_region_id);
my $sra = $self->{'comparaDBA'}->get_SyntenyRegionAdaptor;
my $sr = $sra->fetch_by_dbID($self->synteny_region_id);
foreach my $dfr (@{$sr->children}) {
$dfr->disavow_parent;
push(@{$dnafrag_regions}, $dfr);
}
$sr->release_tree;
$self->dnafrag_regions($dnafrag_regions); } |
sub _merge_clusters
{ my ($cluster, $index1, $index2) = @_;
if ($index1 != -1 && $index1 == $index2) {
return $cluster;
}
foreach my $region (keys %{$cluster->[$index2]}) {
$cluster->[$index1]->{$region} = 1;
}
splice(@$cluster, $index2, 1);
return $cluster; } |
sub _print_cluster
{ my ($cluster) = @_;
print "FINAL cluster ";
foreach my $this_cluster (@$cluster) {
print "(";
foreach my $group (keys %{$this_cluster}) {
print "$group ";
}
print "), ";
}
print "\n";
}
} |
sub _trim_gab_left
{ my ($gab) = @_;
if (!defined($gab)) {
return undef;
}
my $align_length = $gab->length;
my $gas = $gab->get_all_GenomicAligns();
my $d_length;
my $m_length;
my $min_d_length = $align_length;
my $found_min = 0;
foreach my $ga (@$gas) {
my ($cigLength, $cigType) = ( $ga->cigar_line =~ /^(\d*)([GMD])/ );
$cigLength = 1 unless ($cigLength =~ /^\d+$/);
if ($cigType eq "D" or $cigType eq "G") {
$d_length = $cigLength;
if ($d_length < $min_d_length) {
$min_d_length = $d_length;
}
} else {
$m_length = $cigLength;
$found_min++;
}
}
if ($found_min > 1) {
return $gab;
}
my $new_gab = ($gab->restrict_between_alignment_positions(
$min_d_length+1, $align_length, 1));
if ($new_gab->length == 0) {
return $new_gab;
}
if ($min_d_length < $m_length) {
return $new_gab;
}
return _trim_gab_left($new_gab);
}
} |
sub _trim_gab_right
{ my ($gab) = @_;
if (!defined($gab)) {
return undef;
}
my $align_length = $gab->length;
my $max_pos = 0;
my $gas = $gab->get_all_GenomicAligns();
my $found_max = 0;
my $d_length;
my $m_length;
my $min_d_length = $align_length;
foreach my $ga (@$gas) {
my ($cigLength, $cigType) = ( $ga->cigar_line =~ /(\d*)([GMD])$/ );
$cigLength = 1 unless ($cigLength =~ /^\d+$/);
if ($cigType eq "D" or $cigType eq "G") {
$d_length =$cigLength;
if ($d_length < $min_d_length) {
$min_d_length = $d_length;
}
} else {
$m_length = $cigLength;
$found_max++;
}
}
if ($found_max > 1) {
return $gab;
}
my $new_gab = $gab->restrict_between_alignment_positions(1, $align_length - $min_d_length, 1);
if ($new_gab->length == 0) {
return $new_gab;
}
if ($min_d_length < $m_length) {
return $new_gab;
}
return _trim_gab_right($new_gab); } |
sub _update_tree
{ my $self = shift;
my $tree = shift;
my $all_dnafrag_regions = $self->dnafrag_regions();
my $ordered_dnafrag_regions = [];
my $ordered_2x_genomes = [];
my $idx = 1;
my $all_leaves = $tree->get_all_leaves;
foreach my $this_leaf (@$all_leaves) {
my $these_dnafrag_regions = [];
my $these_2x_genomes = [];
foreach my $this_dnafrag_region (@$all_dnafrag_regions) {
if ($this_dnafrag_region->dnafrag->genome_db_id == $this_leaf->name) {
push (@$these_dnafrag_regions, $this_dnafrag_region);
}
}
my $index = 0;
foreach my $ga_frags (@{$self->{ga_frag}}) {
my $first_frag = $ga_frags->[0];
if ($first_frag->{genome_db_id} == $this_leaf->name) {
push(@$these_2x_genomes, $index);
}
$index++;
}
print "num " . @$these_dnafrag_regions . " " . @$these_2x_genomes . "\n" if $self->debug;
if (@$these_dnafrag_regions == 1) {
print "seq$idx genome_db_id=" . $these_dnafrag_regions->[0]->dnafrag->genome_db_id . "\n" if $self->debug;
$this_leaf->name("seq".$idx++);
push(@$ordered_dnafrag_regions, $these_dnafrag_regions->[0]);
} elsif (@$these_dnafrag_regions > 1) {
my $dfa = $self->dnafrag_regions;
foreach my $ga_frags (@{$self->{ga_frag}}) {
push @$dfa, $ga_frags;
}
$self->dnafrag_regions($dfa);
return undef;
} elsif (@$these_2x_genomes == 1) {
my $ga_frags = $self->{ga_frag}->[$these_2x_genomes->[0]];
print "number of frags " . @$ga_frags . "\n" if $self->debug;
print "2x seq$idx " . $ga_frags->[0]->{genome_db_id} . "\n" if $self->debug;
$this_leaf->name("seq".$idx++);
push(@$ordered_dnafrag_regions, $ga_frags);
} else {
$this_leaf->disavow_parent;
$tree = $tree->minimize_tree;
}
}
$self->dnafrag_regions($ordered_dnafrag_regions);
$self->{ordered_2x_genomes} = $ordered_2x_genomes;
if ($tree->get_child_count == 1) {
my $child = $tree->children->[0];
$child->parent->merge_children($child);
$child->disavow_parent;
}
return $tree;
}
} |
sub _write_gerp_dataflow
{ my ($self, $gab_id, $mlss) = @_;
my $species_set = "[";
my $genome_db_set = $mlss->species_set;
foreach my $genome_db (@$genome_db_set) {
$species_set .= $genome_db->dbID . ",";
}
$species_set .= "]";
my $output_id = "{genomic_align_block_id=>" . $gab_id . ",species_set=>" . $species_set . "}";
$self->dataflow_output_id($output_id);
}
} |
sub dnafrag_regions
{ my $self = shift;
$self->{'_dnafrag_regions'} = shift if(@_);
return $self->{'_dnafrag_regions'}; } |
sub fasta_files
{ my $self = shift;
$self->{'_fasta_files'} = [] unless (defined $self->{'_fasta_files'});
if (@_) {
my $value = shift;
push @{$self->{'_fasta_files'}}, $value;
}
return $self->{'_fasta_files'}; } |
sub fetch_input
{ my( $self) = @_;
$self->{'comparaDBA'} = Bio::EnsEMBL::Compara::Production::DBSQL::DBAdaptor->new(-DBCONN=>$self->db->dbc);
$self->{'hiveDBA'} = Bio::EnsEMBL::Hive::DBSQL::DBAdaptor->new(-DBCONN => $self->{'comparaDBA'}->dbc);
$self->get_params($self->parameters);
$self->get_params($self->input_id);
if (!$self->method_link_species_set_id) {
throw("MethodLinkSpeciesSet->dbID is not defined for this Pecan job");
}
$self->_load_DnaFragRegions($self->synteny_region_id);
if ($self->dnafrag_regions) {
$self->_load_2XGenomes($self->synteny_region_id, $self->{_pairwise_analysis_data_id});
if ($self->get_species_tree and $self->dnafrag_regions) {
$self->_build_tree_string;
print "seq_string ", $self->tree_string , "\n";
}
$self->_dump_fasta;
} else {
throw("Cannot start Pecan job because some information is missing");
}
return 1; } |
sub get_params
{ my $self = shift;
my $param_string = shift;
return unless($param_string);
print("parsing parameter string : ",$param_string,"\n");
my $params = eval($param_string);
return unless($params);
if(defined($params->{'synteny_region_id'})) {
$self->synteny_region_id($params->{'synteny_region_id'});
}
if(defined($params->{'method_link_species_set_id'})) {
$self->method_link_species_set_id($params->{'method_link_species_set_id'});
}
if(defined($params->{'java_options'})) {
$self->{_java_options} = $params->{'java_options'};
}
if(defined($params->{'options'})) {
$self->{_options} = $params->{'options'};
}
if(defined($params->{'tree_file'})) {
$self->{_tree_file} = $params->{'tree_file'};
}
if(defined($params->{'tree_analysis_data_id'})) {
$self->{_tree_analysis_data_id} = $params->{'tree_analysis_data_id'};
}
if(defined($params->{'pairwise_analysis_data_id'})) {
$self->{_pairwise_analysis_data_id} = $params->{'pairwise_analysis_data_id'};
}
if(defined($params->{'reference_species'})) {
$self->{_reference_species} = $params->{'reference_species'};
}
if(defined($params->{'max_block_size'})) {
$self->{_max_block_size} = $params->{'max_block_size'};
}
return 1; } |
sub get_species_tree
{ my $self = shift;
my $newick_species_tree;
if (defined($self->{_species_tree})) {
return $self->{_species_tree};
} elsif ($self->{_tree_analysis_data_id}) {
my $analysis_data_adaptor = $self->{hiveDBA}->get_AnalysisDataAdaptor();
$newick_species_tree = $analysis_data_adaptor->fetch_by_dbID($self->{_tree_analysis_data_id});
} elsif ($self->{_tree_file}) {
open(TREE_FILE, $self->{_tree_file}) or throw("Cannot open file ".$self->{_tree_file});
$newick_species_tree = join("", <TREE_FILE>);
close(TREE_FILE);
}
if (!defined($newick_species_tree)) {
return undef;
}
$newick_species_tree =~ s/^\s*//;
$newick_species_tree =~ s/\s*$//;
$newick_species_tree =~ s/[\r\n]//g;
$self->{'_species_tree'} =
Bio::EnsEMBL::Compara::Graph::NewickParser::parse_newick_into_tree($newick_species_tree);
return $self->{'_species_tree'}; } |
sub max_block_size
{ my $self = shift;
$self->{'_max_block_size'} = shift if(@_);
return $self->{'_max_block_size'}; } |
sub method_link_species_set_id
{ my $self = shift;
$self->{'_method_link_species_set_id'} = shift if(@_);
return $self->{'_method_link_species_set_id'}; } |
sub options
{ my $self = shift;
$self->{'_options'} = shift if(@_);
return $self->{'_options'}; } |
sub parse_results
{ my ($self) = @_;
my ($self, $run_number) = @_;
my $tree_file;
my $workdir;
if ($do_hack) {
$workdir = "/lustre/work1/ensembl/kb3/hive/tests/test_ortheus/job_1087";
$tree_file = $workdir . "/output.$$.tree";
} else {
$tree_file = $self->worker_temp_directory . "/output.$$.tree";
}
my $ordered_fasta_files = $self->fasta_files;
if (-e $tree_file) {
open(F, $tree_file) || throw("Could not open tree file <$tree_file>");
my ($newick, $files) = <F>;
close(F);
$newick =~ s/[\r\n]+$//;
$self->tree_string($newick);
$files =~ s/[\r\n]+$//;
my $all_files = [split(" ", $files)];
$ordered_fasta_files = $all_files;
$self->fasta_files(@$all_files);
print STDERR "**NEWICK: $newick\nFILES: ", join(" -- ", @$all_files), "\n";
}
my (@ordered_leaves) = $self->tree_string =~ /[(,]([^(:)]+)/g;
print "++NEWICK: ", $self->tree_string, "\nLEAVES: ", join(" -- ", @ordered_leaves), "\nFILES: ", join(" -- ", @{$self->fasta_files}), "\n";
my $alignment_file;
if ($do_hack) {
$alignment_file = $workdir . "/output.6525.mfa";
} else {
$alignment_file = $self->worker_temp_directory . "/output.$$.mfa";
}
my $this_genomic_align_block = new Bio::EnsEMBL::Compara::GenomicAlignBlock;
open(F, $alignment_file) || throw("Could not open $alignment_file");
my $seq = "";
my $this_genomic_align;
my $genomic_align_group;
my $tree = Bio::EnsEMBL::Compara::Graph::NewickParser::parse_newick_into_tree($self->tree_string);
$tree->print_tree(100);
print $tree->newick_format("simple"), "\n";
print join(" -- ", map {$_->name} @{$tree->get_all_leaves}), "\n";
print "Reading $alignment_file...\n";
my $ids;
foreach my $this_file (@$ordered_fasta_files) {
push(@$ids, qx""head -1 $this_file");
push(@$ids, undef); ## There is an internal node after each leaf..
}
pop(@$ids); ## ...except for the last leaf which is the end of the tree
#print join(" :: ", @$ids), "\n\n";
my $genomic_aligns_2x_array = [];
my @num_frag_pads;
while (<F>) {
next if (/^\s*$/);
chomp;
## FASTA headers correspond to the tree and the order of the leaves in the tree corresponds
## to the order of the files
if (/^>/) {
print "PARSING $_\n" if ($self->debug);
print $tree->newick_format(), "\n" if ($self->debug);
my ($name) = $_ =~ /^>(.+)/;
if (defined($this_genomic_align) and $seq) {
if (@$genomic_aligns_2x_array) {
print "*****FOUND 2x seq " . length($seq) . "\n" if ($self->debug);
#starting offset
#my $offset = $max_pads;
my $offset = $num_frag_pads[0];
#how many X's to add at the start of the cigar_line
my $start_X;
#how many X's to add to the end of the cigar_line
my $end_X;
my $align_offset = 0;
for (my $i = 0; $i < @$genomic_aligns_2x_array; $i++) {
my $genomic_align = $genomic_aligns_2x_array->[$i];
my $num_pads = $num_frag_pads[$i+1];
my $ga_length = $genomic_align->dnafrag_end-$genomic_align->dnafrag_start+1;
print "extract_sequence $offset " .($offset+$ga_length) . " num pads $num_pads\n" if ($self->debug);
my ($subseq, $aligned_start, $aligned_end) = _extract_sequence($seq, $offset+1, ($offset+$ga_length));
#Add aligned sequence
$genomic_align->aligned_sequence($subseq);
#Add X padding characters to ends of seq
$start_X = $aligned_start;
$end_X = length($seq) - ($start_X+length($subseq));
print "start_X $start_X end_X $end_X subseq_length " . length($subseq) . "\n" if ($self->debug);
$genomic_align->cigar_line($start_X . "X" .$genomic_align->cigar_line . $end_X . "X");
#my $aln_seq = "." x $start_X;
#$aln_seq .= $genomic_align->aligned_sequence();
#$aln_seq .= "." x $end_X;
#$genomic_align->aligned_sequence($aln_seq);
#free aligned_sequence now that I've used it to
#create the cigar_line
undef($genomic_align->{'aligned_sequence'});
#Add genomic align to genomic align block
$this_genomic_align_block->add_GenomicAlign($genomic_align);
$offset += $num_pads + $ga_length;
}
$genomic_aligns_2x_array = [];
undef @num_frag_pads;
} else {
print "add aligned_sequence " . $this_genomic_align->dnafrag_id . " " . $this_genomic_align->dnafrag_start . " " . $this_genomic_align->dnafrag_end . "\n" if $self->debug;
$this_genomic_align->aligned_sequence($seq);
$this_genomic_align_block->add_GenomicAlign($this_genomic_align);
}
}
my $header = shift(@$ids);
$this_genomic_align = new Bio::EnsEMBL::Compara::GenomicAlign;
if (!defined($header)) {
print "INTERNAL NODE $name\n" if ($self->debug);
my $this_node;
foreach my $this_leaf_name (split("_", $name)) {
if ($this_node) {
my $other_node = $tree->find_node_by_name($this_leaf_name);
if (!$other_node) {
throw("Cannot find node <$this_leaf_name>\n");
}
$this_node = $this_node->find_first_shared_ancestor($other_node);
} else {
print $tree->newick_format() if ($self->debug);
print "LEAF: $this_leaf_name\n" if ($self->debug);
$this_node = $tree->find_node_by_name($this_leaf_name);
}
}
print join("_", map {$_->name} @{$this_node->get_all_leaves}), "\n" if ($self->debug);
## INTERNAL NODE: dnafrag_id and dnafrag_end must be edited somewhere else
$this_genomic_align->dnafrag_id(-1);
$this_genomic_align->dnafrag_start(1);
$this_genomic_align->dnafrag_end(0);
$this_genomic_align->dnafrag_strand(1);
bless($this_node, "Bio::EnsEMBL::Compara::GenomicAlignTree");
#$this_node->genomic_align($this_genomic_align);
$genomic_align_group = new Bio::EnsEMBL::Compara::GenomicAlignGroup(
# -genomic_align_array => [$this_genomic_align],
-type => "epo");
$genomic_align_group->add_GenomicAlign($this_genomic_align);
$this_node->genomic_align_group($genomic_align_group);
$this_node->name($name);
#} elsif ($header =~ /^>DnaFrag(\d+)\|(.+)\.(\d+)\-(\d+)\:(\-?1)$/) {
} elsif ($header =~ /^>SeqID(\d+)/) {
#print "old $name\n";
print "leaf_name?? $name\n" if ($self->debug);
my $this_leaf = $tree->find_node_by_name($name);
if (!$this_leaf) {
print $tree->newick_format(), " ****\n" if ($self->debug);
die "";
}
#print "$this_leaf\n";
# print "****** $name -- $header -- ";
# if ($this_leaf) {
# $this_leaf->print_node();
# } else {
# print "[none]\n";
# }
#information extracted from fasta header
my $seq_id = ($1);
my $all_dnafrag_regions = $self->dnafrag_regions;
my $dfr = $all_dnafrag_regions->[$seq_id-1];
if (!UNIVERSAL::isa($dfr, 'Bio::EnsEMBL::Compara::DnaFragRegion')) {
print "FOUND 2X GENOME\n" if $self->debug;
print "num of frags " . @$dfr . "\n" if $self->debug;
#first pads
push @num_frag_pads, $dfr->[0]->{first_pads};
#create new genomic_align for each pairwise fragment
foreach my $ga_frag (@$dfr) {
my $genomic_align = new Bio::EnsEMBL::Compara::GenomicAlign;
print "2x dnafrag_id " . $ga_frag->{dnafrag_region}->dnafrag_id . "\n" if $self->debug;
$genomic_align->dnafrag_id($ga_frag->{dnafrag_region}->dnafrag_id);
$genomic_align->dnafrag_start($ga_frag->{dnafrag_region}->dnafrag_start);
$genomic_align->dnafrag_end($ga_frag->{dnafrag_region}->dnafrag_end);
$genomic_align->dnafrag_strand($ga_frag->{dnafrag_region}->dnafrag_strand);
push @num_frag_pads, $ga_frag->{num_pads};
push @$genomic_aligns_2x_array, $genomic_align;
}
#Add genomic align to genomic align group
$genomic_align_group = new Bio::EnsEMBL::Compara::GenomicAlignGroup(
#-genomic_align_array => $genomic_aligns_2x_array,
-type => "epo");
foreach my $this_genomic_align (@$genomic_aligns_2x_array) {
$genomic_align_group->add_GenomicAlign($this_genomic_align);
}
bless($this_leaf, "Bio::EnsEMBL::Compara::GenomicAlignTree");
$this_leaf->genomic_align_group($genomic_align_group);
print "size of array " . @$genomic_aligns_2x_array . "\n" if $self->debug;
print "store gag1 $this_leaf\n" if $self->debug;
#$self->{$this_leaf} = $genomic_align_group;
} else {
print "normal dnafrag_id " . $dfr->dnafrag_id . "\n" if $self->debug;
$this_genomic_align->dnafrag_id($dfr->dnafrag_id);
$this_genomic_align->dnafrag_start($dfr->dnafrag_start);
$this_genomic_align->dnafrag_end($dfr->dnafrag_end);
$this_genomic_align->dnafrag_strand($dfr->dnafrag_strand);
$genomic_align_group = new Bio::EnsEMBL::Compara::GenomicAlignGroup(
#-genomic_align_array => [$this_genomic_align],
-type => "epo");
$genomic_align_group->add_GenomicAlign($this_genomic_align);
bless($this_leaf, "Bio::EnsEMBL::Compara::GenomicAlignTree");
$this_leaf->genomic_align_group($genomic_align_group);
print "store gag2 $this_leaf\n" if $self->debug;
}
} else {
throw("Error while parsing the FASTA header. It must start by\" >DnaFrag }
$seq = "";
} else {
$seq .= $_;
}
}
close F;
print "Last genomic align\n" if ($self->debug);
if (@$genomic_aligns_2x_array) {
print "*****FOUND 2x seq " . length($seq) . "\n" if ($self->debug);
my $offset = $num_frag_pads[0];
my ($start_X , $end_X);
my $align_offset = 0;
for (my $i = 0; $i < @$genomic_aligns_2x_array; $i++) {
my $genomic_align = $genomic_aligns_2x_array->[$i];
my $num_pads = $num_frag_pads[$i+1];
my $ga_length = $genomic_align->dnafrag_end-$genomic_align->dnafrag_start+1;
print "extract_sequence $offset " .($offset+$ga_length) . " num pads $num_pads\n" if ($self->debug);
my ($subseq, $aligned_start, $aligned_end) = _extract_sequence($seq, $offset+1, ($offset+$ga_length));
$genomic_align->aligned_sequence($subseq);
$start_X = $aligned_start;
$end_X = length($seq) - ($start_X+length($subseq));
print "start_X $start_X end_X $end_X subseq_length " . length($subseq) . "\n" if ($self->debug);
$genomic_align->cigar_line($start_X . "X" .$genomic_align->cigar_line . $end_X . "X");
my $aln_seq = "." x $start_X;
$aln_seq .= $genomic_align->aligned_sequence();
$aln_seq .= "." x $end_X;
$genomic_align->aligned_sequence($aln_seq);
$this_genomic_align_block->add_GenomicAlign($genomic_align);
$offset += $num_pads + $ga_length;
}
} else {
if ($this_genomic_align->dnafrag_id == -1) {
} else {
$this_genomic_align->aligned_sequence($seq);
$this_genomic_align_block->add_GenomicAlign($this_genomic_align);
}
}
$self->remove_empty_cols($tree);
print $tree->newick_format("simple"), "\n";
print join(" -- ", map {$_."+".$_->node_id."+".$_->name} (@{$tree->get_all_nodes()})), "\n";
$self->{'_runnable'}->output([$tree]);
} |
sub reference_species
{ my $self = shift;
$self->{'_reference_species'} = shift if(@_);
return $self->{'_reference_species'};
}
} |
sub remove_empty_cols
{ my ($self, $tree) = @_;
my $gaa = $self->{'comparaDBA'}->get_GenomicAlignAdaptor;
my $seqs = {}; foreach my $this_leaf (@{$tree->get_all_leaves}) {
foreach my $this_genomic_align (@{$this_leaf->genomic_align_group->get_all_GenomicAligns}) {
my $cigar_line = $this_genomic_align->cigar_line;
my $pos = 1; foreach my $cig_elem (grep {$_} split(/(\d*[DMIGX])/, $cigar_line)) {
my ($num, $mode) = $cig_elem =~ /(\d*)([DMIGX])/;
$num = 1 if ($num eq "");
if ($mode eq "M" or $mode eq "I") {
my $start = $pos;
my $end = $pos + $num - 1;
unless (exists($seqs->{$start}) and $seqs->{$start} >= $end) {
$seqs->{$start} = $end;
}
}
$pos += $num;
}
}
}
my $last_start_pos = 0;
my $last_end_pos = 0;
my $gaps = {};
foreach my $start_pos (sort {$a <=> $b} keys %$seqs) {
my $end_pos = $seqs->{$start_pos};
print " $start_pos -> $end_pos\n" if $self->debug;
if ($end_pos <= $last_end_pos) {
print " XXX\n" if $self->debug;
next;
} elsif ($start_pos <= $last_end_pos + 1) {
$last_end_pos = $end_pos;
print " ---> $end_pos\n" if $self->debug;
} else {
$gaps->{$last_end_pos + 1} = $start_pos - 1 if ($last_end_pos);
print " ---> GAP (" . ($last_end_pos + 1) . "-" . ($start_pos - 1) . ")\n" if $self->debug;
$last_start_pos = $start_pos;
$last_end_pos = $end_pos;
}
}
foreach my $this_leaf (@{$tree->get_all_nodes}) {
foreach my $this_genomic_align (@{$this_leaf->genomic_align_group->get_all_GenomicAligns}) {
if (!defined $this_genomic_align->{'adaptor'}) {
$this_genomic_align->adaptor($gaa);
}
my $aligned_sequence = $this_genomic_align->aligned_sequence;
print "before cigar " . $this_genomic_align->cigar_line . "\n" if $self->debug;
foreach my $start_pos (sort {$b <=> $a} keys %$gaps) { my $end_pos = $gaps->{$start_pos};
substr($aligned_sequence, $start_pos - 1, ($end_pos - $start_pos + 1), "");
}
$this_genomic_align->{cigar_line} = undef;
$this_genomic_align->aligned_sequence($aligned_sequence);
print "after cigar " . $this_genomic_align->cigar_line . "\n" if $self->debug;
}
}
}
} |
sub run
{
my $self = shift;
my $runnable = new Bio::EnsEMBL::Analysis::Runnable::Ortheus(
-workdir => $self->worker_temp_directory,
-fasta_files => $self->fasta_files,
-tree_string => $self->tree_string,
-species_tree => $self->get_species_tree->newick_simple_format,
-species_order => $self->species_order,
-analysis => $self->analysis,
-parameters => $self->{_java_options},
-options => $self->options,
);
$self->{'_runnable'} = $runnable;
if ($self->{pairwise_compara_dba}) {
foreach my $dba (values %{$self->{pairwise_compara_dba}}) {
$dba->dbc->disconnect_if_idle;
}
}
my $ancestor_genome_db = $self->{'comparaDBA'}->get_GenomeDBAdaptor->fetch_by_name_assembly("Ancestral sequences");
my $ancestor_dba = $ancestor_genome_db->db_adaptor;
$ancestor_dba->dbc->disconnect_if_idle;
$self->{'comparaDBA'}->dbc->disconnect_if_idle;
if (!$do_hack) {
$runnable->run_analysis;
}
$self->parse_results(); } |
sub species_order
{ my $self = shift;
$self->{'_species_order'} = [] unless (defined $self->{'_species_order'});
if (@_) {
my $value = shift;
push @{$self->{'_species_order'}}, $value;
}
return $self->{'_species_order'}; } |
sub synteny_region_id
{ my $self = shift;
$self->{'_synteny_region_id'} = shift if(@_);
return $self->{'_synteny_region_id'}; } |
sub tree_string
{ my $self = shift;
$self->{'_tree_string'} = shift if(@_);
return $self->{'_tree_string'}; } |
sub write_output
{ my ($self) = @_;
print "WRITE OUTPUT\n" if $self->debug;
if ($self->{'_runnable'}->{tree_to_save}) {
my $meta_container = $self->{'comparaDBA'}->get_MetaContainer;
$meta_container->store_key_value("synteny_region_tree_".$self->synteny_region_id,
$self->{'_runnable'}->{tree_to_save});
}
my $mlssa = $self->{'comparaDBA'}->get_MethodLinkSpeciesSetAdaptor;
my $mlss = $mlssa->fetch_by_dbID($self->method_link_species_set_id);
my $dnafrag_adaptor = $self->{'comparaDBA'}->get_DnaFragAdaptor;
my $gaba = $self->{'comparaDBA'}->get_GenomicAlignBlockAdaptor;
my $gaa = $self->{'comparaDBA'}->get_GenomicAlignAdaptor;
my $gaga = $self->{'comparaDBA'}->get_GenomicAlignGroupAdaptor;
my $gata = $self->{'comparaDBA'}->get_GenomicAlignTreeAdaptor;
my $ancestor_genome_db = $self->{'comparaDBA'}->get_GenomeDBAdaptor->fetch_by_name_assembly("Ancestral sequences");
my $ancestor_dba = $ancestor_genome_db->db_adaptor;
my $slice_adaptor = $ancestor_dba->get_SliceAdaptor();
my $ancestor_coord_system_adaptor = $ancestor_dba->get_CoordSystemAdaptor();
my $ancestor_coord_system;
eval{
$ancestor_coord_system = $ancestor_coord_system_adaptor->fetch_by_name("ancestralsegment");
};
if(!$ancestor_coord_system){
$ancestor_coord_system = new Bio::EnsEMBL::CoordSystem(
-name => "ancestralsegment",
-VERSION => undef,
-DEFAULT => 1,
-SEQUENCE_LEVEL => 1,
-RANK => 1
);
$ancestor_coord_system_adaptor->store($ancestor_coord_system);
}
foreach my $genomic_align_tree (@{$self->{'_runnable'}->output}) {
foreach my $genomic_align_node (@{$genomic_align_tree->get_all_nodes}) {
foreach my $genomic_align (@{$genomic_align_node->genomic_align_group->get_all_GenomicAligns}) {
$genomic_align->adaptor($gaa);
$genomic_align->method_link_species_set($mlss);
$genomic_align->level_id(1);
if ($genomic_align->dnafrag_id == -1) {
my $length = length($genomic_align->original_sequence);
$slice_adaptor->dbc->db_handle->do("LOCK TABLES seq_region WRITE, dna WRITE");
my $last_id = $slice_adaptor->dbc->db_handle->selectrow_array("SELECT max(seq_region_id) FROM seq_region");
$last_id++;
my $name = "Ancestor$last_id";
my $slice = new Bio::EnsEMBL::Slice(
-seq_region_name => $name,
-start => 1,
-end => $length,
-seq_region_length => $length,
-strand => 1,
-coord_system => $ancestor_coord_system,
);
$slice_adaptor->store($slice,\$ genomic_align->original_sequence);
$slice_adaptor->dbc->db_handle->do("UNLOCK TABLES");
my $dnafrag = new Bio::EnsEMBL::Compara::DnaFrag(
-name => $name,
-genome_db => $ancestor_genome_db,
-length => $length,
-coord_system_name => "ancestralsegment",
);
$dnafrag_adaptor->store($dnafrag);
$genomic_align->dnafrag_id($dnafrag->dbID);
$genomic_align->dnafrag_end($length);
$genomic_align->dnafrag($dnafrag);
}
}
}
my $split_trees;
if ($self->max_block_size() &&
$genomic_align_tree->length > $self->max_block_size()) {
for (my $start = 1; $start <= $genomic_align_tree->length;
$start += $self->max_block_size()) {
my $end = $start+$self->max_block_size()-1;
if ($end > $genomic_align_tree->length) {
$end = $genomic_align_tree->length;
}
my $new_gat = $genomic_align_tree->restrict_between_alignment_positions($start, $end, "skip_empty_GenomicAligns");
push @$split_trees, $new_gat;
}
$gata->store_group($split_trees);
foreach my $tree (@$split_trees) {
$self->_write_gerp_dataflow($tree->modern_genomic_align_block_id,
$mlss);
}
} else {
$gata->store($genomic_align_tree);
$self->_write_gerp_dataflow(
$genomic_align_tree->modern_genomic_align_block_id,
$mlss);
}
}
chdir("$self->worker_temp_directory");
foreach(glob("*")){
unlink($_);
}
return 1;
}
} |
General documentation
The fetch_input method reads the parameters of the analysis (analysis.parameters) first and then
the input_id of the analysis job (analysis_job.input_id). Both are expected to be string
representing hash references like {key1 => "value1", key2 => "value2"}. Values defined in the
analysis_job.input_id column will overwrite values in the analysis.parameters.
* synteny_region_id (int)
Ortheus will align the segments defined in the SyntenyRegion with this dbID.
* method_link_species_set_id (int)
Ortheus will store alignments with this method_link_species_set_id
* java_options
Options used to run java eg: '-server -Xmx1000M'
* tree_file
FIXME
* tree_analysis_data_id (int)
FIXME
* pairwise_analysis_data_id (int)
Optional. A list of database locations and method_link_species_set_id pairs for the 2X geonome BLASTZ_NET alignments. The database locations should be identified using the url format.ie mysql://user:pass\@host:port/db_name.
* reference_species
Optional. The reference species for the 2X genome BLASTZ_NET alignments
* options
Additional pecan options eg "-p 15"
* max_block_size (int)
If an alignment is longer than this value, it will be split in several blocks in the database. All resulting blocks will share the same genomic_align_group_id.
Javier Herrero
The rest of the documentation details each of the object methods.
Internal methods are usually preceded with a _