Bio::EnsEMBL::Analysis::RunnableDB
UTR_Builder
Toolbar
Summary
Bio::EnsEMBL::Analysis::RunnableDB::UTR_Builder
Package variables
Privates (from "my" definitions)
$totalgenes = 0
Included modules
Inherit
Synopsis
my $utrbuilder_runnable = new Bio::EnsEMBL::Analysis::RunnableDB::UTR_Builder(
-db => $db,
-input_id => $input_id
);
$utrbuilder_runnable->fetch_input();
$utrbuilder_runnable->run();
$utrbuilder_runnable->output();
$utrbuilder_runnable->write_output(); #writes to DB
Description
This is the new version of the UTR-addition procedure.
It combines predictions made from proteins with cDNA alignments to add UTR regions to the
gene models. It can also inlcude ESTs and ditags. It uses code from Coalescer/Consensus to produce
score for the alternative models and chose the best option.
It also includes ("look-for-both") code to correct the phases of the transcripts unless they are "blessed"
and inculdes the option to check for predefined protein/cDNA pairing as a first step,
looking for NM/NPentries in a GeneBank file.
Config files to set-up are
Bio::EnsEMBL::Analysis::Config::GeneBuild::UTR_Builder
Bio::EnsEMBL::Analysis::Config::Databases
Bio::EnsEMBL::Analysis::Config::GeneBuild::TranscriptConsensus (just a copy of the example file)
Bio::EnsEMBL::Analysis::Config::GeneBuild::KillListFilter
Methods
Methods description
Arg [1] : optional Bio::EnsEMBL::DBSQL::DBAdaptor Description: get/set for db storing exonerate alignments of cDNAs Returntype : Bio::EnsEMBL::DBSQL::DBAdaptor Exceptions : Throws if input isn't DBAdaptor or db parameters haven't been defined |
Arg [1] : optional Bio::EnsEMBL::DBSQL::DBAdaptor Description: get/set for db storing exonerate alignments of ditags Returntype : Bio::EnsEMBL::DBSQL::DBAdaptor Exceptions : Throws if input isn't DBAdaptor or db parameters haven't been defined |
Arg [1] : optional Bio::EnsEMBL::DBSQL::DBAdaptor Description: get/set for db storing exonerate alignments of ESTs Returntype : Bio::EnsEMBL::DBSQL::DBAdaptor Exceptions : Throws if input isn't DBAdaptor or db parameters haven't been defined |
Arg [1] : optional Bio::EnsEMBL::DBSQL::DBAdaptor Description: get/set for db to read input genes from Returntype : Bio::EnsEMBL::DBSQL::DBAdaptor Exeptions : Throws if input isn't DBAdaptor or db parameters haven't been defined |
Arg [1] : optional Bio::EnsEMBL::DBSQL::DBAdaptor Description: get/set for db in which UTR modified genes should be stored Returntype : Bio::EnsEMBL::DBSQL::DBAdaptor Exceptions : Throws if input isn't DBAdaptor or db parameters haven't been defined |
Arg [1] : Description: Returntype : |
Arg [1] : Description: Returntype : |
Arg [1] : eg-transcript Arg [2] : first matching exon Arg [3] : (int) UTR bases of first matching exon Arg [4] : last matching exon Arg [5] : (int) UTR bases of first matching exon Description: add up genomic extend of UTR regions Returntype : int (basepairs) total UTR-length, left UTR-length, right UTR-lenght Exceptions : none |
Arg [1] : ref to array of Bio::EnsEMBL::Gene Arg [2] : flag whether these genes are EST genes rather than cDNA genes Description: This method checks the cDNA and EST genes translation is not checked as these genescome without translation Returntype : ref to array of filtered Bio::EnsEMBL::Gene |
Name : get_evidence_set( $logic_name_or_biotype ) Arg : String Func : returns the name of the evidence_set of a genee / PredictionTranscript Returnval: String describing evidence_set_name |
Arg [1] : optional hash ref Description: store the links between NM and NP entries Returntype : has ref |
Arg [1] : ref to array of Bio::EnsEMBL::Gene Description: merges adjacent exons if they are frameshifted; stores component exons as subSeqFeatures of the merged exon Returntype : ref to array of Bio::EnsEMBL::Gene |
Arg [1] : Bio::EnsEMBL::Transcript Arg [2] : Bio::EnsEMBL::Analysis::Tools::Algorithms::GeneCluster Description: Check whether any of the exons of the transcript overlap any of the exons of the cluster. Returntype : boolean, true if overlap exists |
Arg[1] : Bio::EnsEMBL::Transcript Arg[2] : strand of the transcript Description : a transcript is used as evidence for genomewise to recalculate the ORF. The idea is to use this when the peptide has been shortened, due to a genewise model being incompatible with the cdna splicing. This can happen when genewise cannot find very short exons and attaches them to one of the flanking exons. We tell genomewise to keep the splice boundaries pretty much static, so that we preserve the original splicing structure. Returntype : Bio::EnsEMBL::Transcript |
Arg [1] : reference to Bio::EnsEMBL::Tanscript $combined_transcript Arg [2] : reference to Bio::EnsEMBL::Transcript $cdna_transcript Description: transfers cdna evidence to combined transcript Returntype: Bio::EnsEMBL::Transcript |
Arg [1] : Description: $exoncount tells us which position in the array of e2g exons corresponds to the end of the genewise transcript so we add back exons 3 to that position. $exon and $transcript are references to Exon and Transcript objects. Returntype : |
Arg [1] : Description: Returntype : |
Arg [1] : Description: get/set for blessed gene array clones the genes Returntype : |
Arg [1] : genewise transcript Arg [2] : transcript adding UTR region Description: calculate a score favouring transcripts that add UTR on both sides Could also favour longer UTRs if desired Returntype : int score |
Arg [1] : Description: get/set for e2g gene array Returntype : |
Arg [1] : Bio::EnsEMBL::Gene CDS gene Arg [2] : Boolean to indicate blessed gene Description: check whether there is a specific cDNA assigned to the given gene model Returntype : Bio::EnsEMBL::Gene cDNA gene |
Arg [1] : ref to array of Bio::EnsEMBL::Gene Description: Separates CDSs according to strand and then clusters each set by calling Bio::EnsEMBL::Analysis::Tools::Algorithms::ClusterUtils:cluster_Genes Returntype : ref to array of Bio::EnsEMBL::Analysis::Tools::Algorithms::GeneCluster |
Arg [1] : genewise gene Arg [2] : cDNA gene Description: combine gene with matching cDNA Returntype : Bio::EnsEMBL::Transcript |
Arg [1] : ref to Bio::EnsEMBL::Gene Description: get/set for combined gene array Returntype : |
Arg [1] : ref to array of Bio::EnsEMBL::Gene Description: converts transcripts to Bio::EnsEMBL::Analysis::Tools::GeneBuildUtils::TranscriptExtended Returntype : ref to array of TranscriptExtended Exceptions : throws if gene has multiple transcripts |
Arg [1] : none Description: Read GenBank sequence file linking NP proteins to a specific NM cDNA to use as "known" UTR evidence Returns : nothing |
Arg [1] : (optional) ref to array with ditags Description: get/set ditags Returntype : array ref with ditag objects Exceptions : none |
Arg [1] : (optional) ref to array with ests Description: get/set for ests Returntype : array ref with ST objects |
Arg [1] : Description: $exon is the terminal exon in the genewise transcript, $transcript. We need to expand any frameshifts we merged in the terminal genewise exon. The expansion is made by putting $exon to be the last (3 end) component, so we modify its start but not its end. The rest of the components are added. The translation end will have to be modified, this happens in the method _transcript_from_multi_exon.... Returntype : |
Arg [1] : none Description: Get all raw data needed from the databases Returntype : none |
Arg [1] : ref to array of Bio::EnsEMBL::Gene Description: We do not want to add UTRs to fragments of genes otherwise they get boosted in importance during the final gene build and produce rubbish alternative transcripts and in the worst cases can misjoin clusters especially if there is some misassembled/duplicated sequence. One solution is to cluster the genewises at this stage in much the same way as we do during the GeneBuilder run, and attempt to add UTRs ony to the best ones eg fullest genomic extent. Also checks for existing UTR, these genes are just passed on Returntype : ref to array of Bio::EnsEMBL::Gene |
Arg [1] : ref to array of GeneClusters Description: screens UTR modified transcripts for those that potentially misjoin genewise clusters, and remove them Returntype : ref to array of Bio::EnsEMBL::Gene |
Arg [1] : Description: get/set for genewise clusters Returntype : |
Arg [1] : String protein-id Description: Find the cDNA-id a given protein-id is linked to as a "known" evidence Returntype : String cDNA-id |
Arg [1] : Description: get/set for genewise gene array Returntype : Exceptions : Example : |
Arg [1] : optional hash ref with kill list Description: get/set kill list object Returntype : ref to hash |
Arg [1] : Bio:EnsEMBL:Gene Description: a copy of Steve's look_for_both-script, checks phases, etc. Returntype : Bio:EnsEMBL:Gene |
Arg [1] : string representing genetyoe to be associated with genes Arg [2] : array of Bio::EnsEMBL::Transcript Description: Constructs Bio::EnsEMBL::Gene objects from UTR-modified transcripts Returntype : none; new genes are stored in self->combined_genes Exceptions : throws when missing genetype or analysis |
Arg [1] : Bio::EnsEMBL::Gene Arg [2] : int is_est Description: this method tries to find the cdnas that can be merged with the genewise genes. Basically it checks for exact internal splice matching in the 5' and 3' exons of the genewise gene. In order to match a starting genewise exon with an cDNA exon, we need to have a. exactly coinciding exon boundaries b. either cdna exon has start <= genewise exon start, OR cdna exon start lies within $exon_slop bp of genewise exon start AND the cdna transcript will add exra UTR exons. Substitute "end" for "start" for 3prime ends of transcripts BUT do not allow through any cDNA that will result just in a shortened peptide without additional UTR exons. Returntype : ref to array of Bio::EnsEMBL::Gene, ref to hash relating Bio::EnsEMBL::Gene to UTR length Exceptions : |
Arg [1] : Description: get/set for pairs of frameshift merged and unmerged genes. Key is merged gene, value is unmerged
Returntype : |
Arg [1] : Description: get/set for pairs of UTR modified and unmodified genes. Key is modified gene, value is unmodified Returntype : |
Arg [1] : Bio::EnsEMBL::Analysis::RunnableDB::UTR_Builder Function : instatiates object & check config file Returntype: Bio::EnsEMBL::Analysis::RunnableDB::UTR_Builder |
Arg [1] : optional RunnableDB output Description: get/set for result of Analysis Returntype : array ref with genes |
Arg [1] : String mol-type Description: read ids to ignore from KillList db Returntype : ref to hash |
Arg [1] : (optional) bool Description: get/set for option to prune genes |
Arg [1] : ref to array of GeneClusters Description: rejects duplicate CDS, makes sure they are kept on unmatched_genes or they (and their supporting evidence) will be lost for the rest of the build Note: pruning will not be used unless specified otherwise Returntype : ref to array of Bio::EnsEMBL::Gene |
Description: strictly speaking this is no longer remapping anything... checks translation, set start/stop, can set biotype Returntype : ref to array of Bio::EnsEMBL::Gene |
Arg [1] : Bio::EnsEMBL::Gene Description: Returns unmeregd, frameshifted version of input gene Returntype : Bio::EnsEMBL::Gene |
Arg [1] : Description: get/set for genewise clusters Returntype : Exceptions : Example : |
Description: general run method Returntype : none |
Arg [1] : ref to array of Bio::EnsEMBL::Gene Arg [2] : string representing genetype to be associated with UTR-modified genes Arg [3] : bool indicating that we are NOT trying to map to predefined cDNA (blessed genes) Description: main function which matches CDS (genewise) predictions to cDNA alignments considering only terminal CDS exons, looking for "knowns" first, optionally using ESTs and ditags. Returntype : none |
Arg [1] : array ref with mapped ditags to analyse Arg [2] : transcript to analyse Arg [3] : (optional) int index, where to start looking in the ditag-array Description: score ditags in the region of a exon, transcripts or est that might support it Give a higher score to those features that match well to ditags and have a high tag_count. Returntype : int score: high value indicates many/well matching ditags int index: updated array index Exceptions : none |
Arg [1] : Description: This method will actually do the combination of both cdna and genewise gene. Note that if there is a match on one end but not on the other, the code will extend one end, but will leave the other as it is in the genewise genes. This will explit cdna matches that look fine on one end and we disregard the mismatching part. Returntype : |
Arg [1] : Description: Returntype : |
Arg [1] : Description: Returntype : Exceptions : Example : |
Arg [1] : Description: This method will actually do the combination of both cdna and genewise gene. Note that if there is a match on one end but not on the other, the code will extend one end, but will leave the other as it is in the genewise genes. This will explit cdna matches that look fine on one end and we disregard the mismatching part. Returntype : |
Arg [1] : array of genes Description: get/set for unmatched gene array Returntype : none |
Arg [1] : Bio::EnsEMBL::Gene Description: checks start and end coordinates of each exon of each transcript are sane Returntype : 1 if gene is valid, otherwise zero |
Arg [1] : none Description: Do some cleaning up uf the result and store the genes to the database Returntype : none |
Methods code
sub BLESSED_DB
{ my( $self, $blessed_db ) = @_;
if ($blessed_db){
$self->{_blessed_db} = get_db_adaptor_by_string($blessed_db,1);
}
return $self->{_blessed_db}; } |
sub BLESSED_GENETYPES
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_blessed_genetypes} = $value;
}
return $self->{_blessed_genetypes}; } |
sub BLESSED_UTR_GENETYPE
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_blessed_utr_genetype} = $value;
}
return $self->{_blessed_utr_genetype}; } |
sub CDNA_DB
{ my( $self, $cdna_db ) = @_;
if ($cdna_db){
$self->{_cdna_db} = get_db_adaptor_by_string($cdna_db,1);
}
return $self->{_cdna_db}; } |
sub DITAG_DB
{ my( $self, $ditag_db ) = @_;
if ($ditag_db){
$self->{_ditag_db} = get_db_adaptor_by_string($ditag_db,1);
}
if((defined $self->{_ditag_db}) && (!$self->{_ditag_db}) && (scalar @{$self->DITAG_TYPE_NAMES})){
throw("You have defined DITAG_TYPE_NAMES, but no DITAG_DB parameters.\n");
}
return $self->{_ditag_db}; } |
sub DITAG_TYPE_NAMES
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_ditag_type_names} = $value;
}
return $self->{_ditag_type_names}; } |
sub DITAG_WINDOW
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_ditag_window} = $value;
}
return $self->{_ditag_window}; } |
sub EST_DB
{ my( $self, $est_db ) = @_;
if ($est_db){
$self->{_est_db} = get_db_adaptor_by_string($est_db,1);
}
return $self->{_est_db}; } |
sub EST_GENETYPE
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_est_genetype} = $value;
}
return $self->{_est_genetype}; } |
EXTEND_BIOTYPE_OF_UNCHANGED_GENES | description | prev | next | Top |
sub EXTEND_BIOTYPE_OF_UNCHANGED_GENES
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_extend_biotype_of_unchanged_genes} = $value;
}
return $self->{_extend_biotype_of_unchanged_genes}; } |
sub EXTEND_ORIGINAL_BIOTYPE
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_extend_original_biotype} = $value;
}
return $self->{_extend_original_biotype}; } |
sub FILTER_ESTS
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_filter_ests} = $value;
}
return $self->{_filter_ests}; } |
sub INPUT_DB
{ my( $self, $input_db ) = @_;
if ($input_db){
$self->{_input_db} = get_db_adaptor_by_string($input_db,1);
}
if(!$self->{_input_db}){
throw("Please define database parameters for input db.\n");
}
return $self->{_input_db}; } |
sub INPUT_GENETYPES
{ my ($self, $input_genetypes) = @_;
if($input_genetypes){
$self->{'_input_genetypes'} = $input_genetypes;
}
return $self->{'_input_genetypes'}; } |
sub KNOWNUTR_FILE
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_knownutr_file} = $value;
}
return $self->{_knownutr_file};
}
1; } |
sub KNOWN_UTR_GENETYPE
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_known_utr_genetype} = $value;
}
return $self->{_known_utr_genetype}; } |
sub LOOK_FOR_KNOWN
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_look_for_known} = $value;
}
return $self->{_look_for_known}; } |
sub MAX_EXON_LENGTH
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_max_exon_length} = $value;
}
return $self->{_max_exon_length}; } |
sub MAX_INTRON_LENGTH
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_max_intron} = $value;
}
return $self->{_max_intron}; } |
sub OUTPUT_DB
{ my( $self, $output_db ) = @_;
if ($output_db){
$self->{_output_db} = get_db_adaptor_by_string($output_db,1);
}
if(!$self->{_output_db}){
throw("Please define database parameters for output db.\n");
}
return $self->{_output_db};
}
} |
sub PRUNE_GENES
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_prune_genes} = $value;
}
return $self->{_prune_genes}; } |
sub UTR_GENETYPE
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_utr_genetype} = $value;
}
return $self->{_utr_genetype}; } |
sub VERBOSE
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_verbose} = $value;
}
return $self->{_verbose}; } |
sub _cdna_evidence
{ my ($self, $cdna_evidence) = @_;
if($cdna_evidence){
$self->{'_cdna_evidence'} = $cdna_evidence;
}
return $self->{'_cdna_evidence'}; } |
sub _cdna_slice
{ my ($self, $slice) = @_;
if($slice){
$self->{'_cdna_slice'} = $slice;
}
return $self->{'_cdna_slice'}; } |
sub _compute_UTRlength
{ my ($self, $transcript, $left_exon, $left_diff, $right_exon, $right_diff) = @_;
my $strand = $transcript->start_Exon->strand;
my @exons = sort { $a->start <=> $b->start } @{ $transcript->get_all_Exons };
my $UTRlength = 0;
my $in_UTR = 1;
my $start_flag = 0;
my $left_UTR = $left_diff;
my $right_UTR = $right_diff;
foreach my $exon ( @exons ){
if ( defined $left_exon && $exon == $left_exon ){
$UTRlength += $left_diff;
$in_UTR = 0;
}
elsif( defined $right_exon && $exon == $right_exon ){
$UTRlength += $right_diff;
$in_UTR = 1;
}
elsif( $in_UTR == 1 ){
$UTRlength += $exon->length;
if(!$start_flag){
$left_UTR += $UTRlength;
$start_flag = 1;
}
else{
$right_UTR += $UTRlength;
}
}
}
return($UTRlength, $left_UTR, $right_UTR); } |
sub _filter_cdnas
{ my ($self, $cdna_arrayref, $ests) = @_;
my @newcdna;
my %cdna_evidence;
cDNA_GENE:
foreach my $cdna (@{$cdna_arrayref}) {
cDNA_TRANSCRIPT:
foreach my $tran (@{$cdna->get_all_Transcripts}) {
$tran->sort;
if(!$ests){
foreach my $evidence (@{ $tran->get_all_supporting_features }){
my $evidence_name = $evidence->hseqname();
$evidence_name =~ s/\.\S+$//;
$cdna_evidence{$evidence_name} = $cdna;
}
}
next cDNA_TRANSCRIPT unless ( is_Transcript_sane($tran)
&& all_exons_are_valid($tran, $self->MAX_EXON_LENGTH, 1)
&& intron_lengths_all_less_than_maximum($tran, $self->MAX_INTRON_LENGTH)
&& has_no_unwanted_evidence($tran) );
push(@newcdna, $cdna);
}
}
if(!$ests){
$self->_cdna_evidence(\%cdna_evidence);
}
return\@ newcdna; } |
sub _get_evidence_set
{ my ($self, $logic_name_or_biotype) = @_ ;
my %ev_sets = %{ $self->{evidence_sets} } ;
my $result_set_name;
for my $set_name (keys %ev_sets){
my @logic_names = @{$ev_sets{$set_name}} ;
for my $ln (@logic_names ) {
if ($ln eq $logic_name_or_biotype){
$result_set_name = $set_name ;
}
}
}
return $result_set_name;
}
} |
sub _known_pairs
{ my ($self, $hashref) = @_;
if (defined $hashref) {
$self->{_known_pairs} = $hashref;
}
return $self->{_known_pairs};
}
} |
sub _merge_genes
{ my ($self, $genesref) = @_;
my @merged;
my $count = 1;
UNMERGED_GENE:
foreach my $unmerged (@{$genesref}){
my $gene = new Bio::EnsEMBL::Gene;
$gene->dbID($unmerged->dbID);
my @pred_exons;
my $ecount = 0;
my @trans = @{$unmerged->get_all_Transcripts};
if(scalar(@trans) != 1) {
throw("Gene with dbID " . $unmerged->dbID .
" has NO or more than one related transcript, where a 1-gene-to-1-transcript-relation is assumed.".
" Check preceding analysis\n ");
}
$trans[0]->sort;
if( !is_Transcript_sane($trans[0])
|| !all_exons_are_valid($trans[0], $self->MAX_EXON_LENGTH, 1)
|| !has_no_unwanted_evidence($trans[0])
|| !intron_lengths_all_less_than_maximum($trans[0], $self->MAX_INTRON_LENGTH) ){
print STDERR "transcript ".$unmerged->dbID."did NOT pass sanity check!\n";
print STDERR "found at ".$unmerged->seq_region_name.", ".$unmerged->seq_region_start.", ".$unmerged->seq_region_end."\n";
next UNMERGED_GENE;
}
my $cloned_translation = new Bio::EnsEMBL::Translation;
my @unmerged_exons = @{$trans[0]->get_all_Exons};
my $strand = $unmerged_exons[0]->strand;
my $previous_exon;
EXON:
foreach my $exon(@unmerged_exons){
$ecount++;
my $separation = 0;
my $merge_it = 0;
if (defined($previous_exon)){
if ($strand == 1){
$separation = $exon->start - $previous_exon->end - 1;
}
elsif( $strand == -1 ){
$separation = $previous_exon->end - $exon->start - 1;
}
if ($separation <=10){
$merge_it = 1;
}
}
if ( defined($previous_exon) && $merge_it == 1){
if ($strand == 1) {
$previous_exon->end($exon->end);
} else {
$previous_exon->start($exon->start);
}
$previous_exon->add_sub_SeqFeature($exon,'');
if ( $exon == $trans[0]->translation->end_Exon ){
$cloned_translation->end_Exon( $previous_exon );
$cloned_translation->end($trans[0]->translation->end);
}
my %evidence_hash;
foreach my $sf( @{$exon->get_all_supporting_features}){
if ( $evidence_hash{$sf->hseqname}{$sf->hstart}{$sf->hend}{$sf->start}{$sf->end} ){
next;
}
$evidence_hash{$sf->hseqname}{$sf->hstart}{$sf->hend}{$sf->start}{$sf->end} = 1;
$previous_exon->add_supporting_features($sf);
}
next EXON;
}
else{
my $cloned_exon = clone_Exon($exon);
$cloned_exon->add_sub_SeqFeature($exon,'');
if ( $exon == $trans[0]->translation->start_Exon ){
$cloned_translation->start_Exon( $cloned_exon );
$cloned_translation->start($trans[0]->translation->start);
}
if ( $exon == $trans[0]->translation->end_Exon ){
$cloned_translation->end_Exon( $cloned_exon );
$cloned_translation->end($trans[0]->translation->end);
}
$previous_exon = $cloned_exon;
push(@pred_exons, $cloned_exon);
}
}
my $merged_transcript = new Bio::EnsEMBL::Transcript;
$merged_transcript->dbID($trans[0]->dbID);
foreach my $pe(@pred_exons){
$merged_transcript->add_Exon($pe);
}
$merged_transcript->sort;
$merged_transcript->translation($cloned_translation);
my @seqeds = @{$trans[0]->translation->get_all_SeqEdits};
if (scalar(@seqeds)) {
print "Copying sequence edits\n" if $self->VERBOSE;
foreach my $se (@seqeds) {
my $new_se =
Bio::EnsEMBL::SeqEdit->new(
-CODE => $se->code,
-NAME => $se->name,
-DESC => $se->description,
-START => $se->start,
-END => $se->end,
-ALT_SEQ => $se->alt_seq
);
my $attribute = $new_se->get_Attribute();
$cloned_translation->add_Attributes($attribute);
}
}
my @support = @{$trans[0]->get_all_supporting_features};
if (scalar(@support)) {
$merged_transcript->add_supporting_features(@support);
}
$merged_transcript->add_Attributes(@{$trans[0]->get_all_Attributes});
$gene->add_Transcript($merged_transcript);
$gene->biotype($unmerged->biotype);
push(@merged, $gene);
$count++;
$self->merged_unmerged_pairs($gene,$unmerged);
}
return @merged; } |
sub _overlapping_genes
{ my ($gene1, $cluster) = @_;
if ($gene1->end < $cluster->start || $gene1->start > $cluster->end) {
return 0;
}
foreach my $exon1 (@{$gene1->get_all_Exons}){
foreach my $exon2 (@{$cluster->get_all_Exons}){
if ( ($exon1->overlaps($exon2)) && ($exon1->strand == $exon2->strand) ){
return 1;
}
}
}
return 0; } |
sub _recalculate_translation
{ my ($self, $mytranscript, $strand) = @_;
my $this_is_my_transcript = clone_Transcript($mytranscript);
compute_translation($mytranscript);
unless(validate_Translation_coords($mytranscript, 1)
&& contains_internal_stops($mytranscript)
&& $mytranscript->translate){
print STDERR "problem with the translation. Returning the original transcript\n";
return $this_is_my_transcript;
}
return $mytranscript; } |
sub _transfer_evidence
{ my ($self, $combined_transcript, $cdna_transcript) = @_;
my $first_support_id;
foreach my $combined_exon(@{$combined_transcript->get_all_Exons}){
foreach my $cdna_exon(@{$cdna_transcript->get_all_Exons}){
if($combined_exon->overlaps($cdna_exon)){
if($combined_exon->strand != $cdna_exon->strand){
print STDERR "OVERLAPPING-BUT-DIFFERENT_STRANDS!\n";
}
else{
transfer_supporting_evidence($cdna_exon, $combined_exon);
}
}
}
}
my $cdna_trans = $cdna_transcript->get_all_Transcripts()->[0];
foreach my $tsf (@{$cdna_trans->get_all_supporting_features}) {
print STDERR "adding supporting feature: ".$tsf->hseqname."\n" if $self->VERBOSE;
$combined_transcript->add_supporting_features($tsf);
}
return $combined_transcript; } |
sub add_3prime_exons
{ my ($self, $transcript, $exoncount, @e2g_exons) = @_;
my $c = $#e2g_exons;
my $modified = 0;
while($c > $exoncount){
my $newexon = new Bio::EnsEMBL::Exon;
my $oldexon = $e2g_exons[$c];
$newexon->start($oldexon->start);
$newexon->end($oldexon->end);
$newexon->strand($oldexon->strand);
$newexon->phase(-1);
$newexon->end_phase(-1);
$newexon->slice($oldexon->slice);
my %evidence_hash;
foreach my $sf( @{$oldexon->get_all_supporting_features }){
if ( $evidence_hash{$sf->hseqname}{$sf->hstart}{$sf->hend}{$sf->start}{$sf->end} ){
next;
}
$evidence_hash{$sf->hseqname}{$sf->hstart}{$sf->hend}{$sf->start}{$sf->end} = 1;
$newexon->add_supporting_features($sf);
}
$transcript->add_Exon($newexon);
$modified = 1;
$c--;
}
if ($modified == 1){
$transcript->translation->end_Exon->end_phase(-1);
} } |
sub add_5prime_exons
{ my ($self, $transcript, $exoncount, @e2g_exons) = @_;
my $c = 0;
my $modified = 0;
while($c < $exoncount){
my $newexon = new Bio::EnsEMBL::Exon;
my $oldexon = $e2g_exons[$c];
$newexon->start($oldexon->start);
$newexon->end($oldexon->end);
$newexon->strand($oldexon->strand);
$newexon->phase(-1);
$newexon->end_phase(-1);
$newexon->slice($oldexon->slice);
my %evidence_hash;
foreach my $sf( @{$oldexon->get_all_supporting_features} ){
if ( $evidence_hash{$sf->hseqname}{$sf->hstart}{$sf->hend}{$sf->start}{$sf->end} ){
next;
}
$evidence_hash{$sf->hseqname}{$sf->hstart}{$sf->hend}{$sf->start}{$sf->end} = 1;
$newexon->add_supporting_features($sf);
}
$transcript->add_Exon($newexon);
$modified = 1;
$c++;
}
if ($modified == 1){
$transcript->translation->start_Exon->phase(-1);
}
}
} |
sub blessed_genes
{ my ($self, $slice, $genes) = @_;
if (!defined($self->{'_blessed_genes'})) {
$self->{'_blessed_genes'} = [];
}
if (defined $slice && defined $genes && scalar(@{$genes})) {
OLDGENE:
foreach my $gene(@{$genes}){
foreach my $transcript(@{$gene->get_all_Transcripts}){
my $newgene = new Bio::EnsEMBL::Gene;
$newgene->biotype($gene->biotype);
foreach my $dblink (@{$gene->get_all_DBLinks()}){
$newgene->add_DBEntry($dblink);
}
my $newtranscript = new Bio::EnsEMBL::Transcript;
$newtranscript->slice($slice);
my $newtranslation = new Bio::EnsEMBL::Translation;
my @seqeds = @{$transcript->translation->get_all_SeqEdits};
if (scalar(@seqeds)) {
print "Copying sequence edits\n" if $self->VERBOSE;
foreach my $se (@seqeds) {
my $new_se =
Bio::EnsEMBL::SeqEdit->new(
-CODE => $se->code,
-NAME => $se->name,
-DESC => $se->description,
-START => $se->start,
-END => $se->end,
-ALT_SEQ => $se->alt_seq
);
my $attribute = $new_se->get_Attribute();
$newtranslation->add_Attributes($attribute);
}
}
my @support = @{$transcript->get_all_supporting_features};
if (scalar(@support)) {
$newtranscript->add_supporting_features(@support);
}
$newtranscript->add_Attributes(@{$transcript->get_all_Attributes});
$newtranscript->translation($newtranslation);
foreach my $exon(@{$transcript->get_all_Exons}){
my $newexon = clone_Exon($exon);
$newexon->stable_id('');
if ( $exon == $transcript->translation->start_Exon ){
$newtranslation->start_Exon( $newexon );
$newtranslation->start($transcript->translation->start);
}
if ( $exon == $transcript->translation->end_Exon ){
$newtranslation->end_Exon( $newexon );
$newtranslation->end($transcript->translation->end);
}
$newtranscript->add_Exon($newexon);
$newexon->add_supporting_features(@{$exon->get_all_supporting_features});
}
$newgene->add_Transcript($newtranscript);
push(@{$self->{'_blessed_genes'}}, $newgene);
}
}
}
return $self->{'_blessed_genes'}; } |
sub cDNA_GENETYPE
{ my ($self,$value) = @_;
if (defined $value) {
$self->{_cdna_genetype} = $value;
}
return $self->{_cdna_genetype}; } |
sub calculate_UTR_score
{ my ($self, $gw_gene, $matching_gene) = @_;
my $BONUS_FOR_BOTH_SIDES = 1;
my $UTR_score = 0;
if(($gw_gene->start - $matching_gene->start) && ($matching_gene->end - $gw_gene->end)){
$UTR_score += $BONUS_FOR_BOTH_SIDES;
}
return $UTR_score; } |
sub cdna_genes
{ my ($self, $genes) = @_;
if (!defined($self->{'_cdna_genes'})) {
$self->{'_cdna_genes'} = [];
}
if (defined $genes && scalar(@{$genes})) {
push(@{$self->{'_cdna_genes'}},@{$genes});
}
return ($self->{'_cdna_genes'}); } |
sub check_for_predefined_pairing
{ my ($self, $gene, $blessed) = @_;
my $cdna = undef;
my $protein_id;
my $cdna_id;
my $cdna_evidence = $self->_cdna_evidence();
EXON:
foreach my $exon(@{$gene->get_all_Exons}){
my @feat = @{$exon->get_all_supporting_features};
foreach my $feat(@feat){
$protein_id = $feat->hseqname;
last EXON if (defined $protein_id);
}
}
if ((!defined $protein_id || $protein_id eq '') && $blessed){
$protein_id = 'blessed';
my @xrefs = @{$gene->get_all_DBLinks()};
if(scalar @xrefs){
foreach my $xref (@xrefs){
if($xref->display_id =~ "^NM_"){
$cdna_id = $xref->display_id;
$cdna_id =~ s/\.\S+$//;
print STDERR "have xref $cdna_id\n" if $self->VERBOSE;
last;
}
}
}
else{ print STDERR "no suitable xref\n" if $self->VERBOSE; }
}
else{
if (!defined $protein_id || $protein_id eq ''){
print STDERR "no protein_id for gene.\n" if $self->VERBOSE;
return undef;
}
$protein_id =~ s/\.\S+$//;
$cdna_id = $self->get_cdna_id_from_protein_id($protein_id);
}
if (!defined $cdna_id || $cdna_id eq ''){
print STDERR "no predefined cDNA found.\n" if $self->VERBOSE;
return undef;
}
else{
print STDERR "found predefined cDNA $cdna_id for $protein_id\n" if $self->VERBOSE;
}
if(defined ($self->kill_list()->{$cdna_id})){
print STDERR "skipping " . $cdna_id . " present in kill list\n";
return undef;
}
my $cdna_gene = $cdna_evidence->{$cdna_id};
if($cdna_gene){
if($self->VERBOSE){ print STDERR "found cdna gene $cdna_gene / $cdna_id.\n" }
if( !(($gene->seq_region_name eq $cdna_gene->seq_region_name)
&& ($gene->strand == $cdna_gene->strand)
&& ($gene->seq_region_start < $cdna_gene->seq_region_end)
&& ($cdna_gene->seq_region_start < $gene->seq_region_end)
&& ($gene->get_all_Transcripts->[0]->get_all_Exons->[0]->strand
== $cdna_gene->get_all_Transcripts->[0]->get_all_Exons->[0]->strand)) ){
print STDERR "not overlapping properly, not using." if $self->VERBOSE;
$cdna_gene = undef;
}
}
else {
print STDERR "Unable to fetch cDNA gene [$cdna_id].\n";
return undef;
}
return $cdna_gene; } |
sub cluster_CDS
{ my ($self, $CDSref) = @_;
my @forward_CDS;
my @reverse_CDS;
foreach my $gene (@$CDSref){
my @exons = @{ $gene->get_all_Exons };
if ( $exons[0]->strand == 1 ){
push( @forward_CDS, $gene );
}
else{
push( @reverse_CDS, $gene );
}
}
my (@clusters, @forward_clusters, @reverse_clusters);
my ($forward_clusters, $forward_clusters_rest);
my ($reverse_clusters, $reverse_clusters_rest);
if ( @forward_CDS ){
($forward_clusters, $forward_clusters_rest) = cluster_Genes(\@ forward_CDS, $self->{evidence_sets} );
push(@forward_clusters, @$forward_clusters);
push(@forward_clusters, @$forward_clusters_rest);
}
if ( @reverse_CDS ){
($reverse_clusters, $reverse_clusters_rest) = cluster_Genes(\@ reverse_CDS, $self->{evidence_sets} );
push(@reverse_clusters, @$reverse_clusters);
push(@reverse_clusters, @$reverse_clusters_rest);
}
if ( scalar @forward_clusters ){
$self->forward_genewise_clusters(\@forward_clusters);
push( @clusters, @forward_clusters);
}
if ( scalar @reverse_clusters ){
$self->reverse_genewise_clusters(\@reverse_clusters);
push( @clusters, @reverse_clusters);
}
return\@ clusters; } |
sub combine_genes
{ my ($self, $gw, $e2g) = @_;
my $modified_peptide = 0;
my @combined_transcripts = ();
my @gw_tran = @{$gw->get_all_Transcripts};
$gw_tran[0]->sort;
my @gw_exons = @{$gw_tran[0]->get_all_Exons};
my @egtran = @{$e2g->get_all_Transcripts};
$egtran[0]->sort;
my @e2g_exons = @{$egtran[0]->get_all_Exons};
my $newtranscript = new Bio::EnsEMBL::Transcript;
if ( $self->EXTEND_ORIGINAL_BIOTYPE && (length($self->EXTEND_ORIGINAL_BIOTYPE)>0) ) {
$newtranscript->biotype($gw->biotype) ;
}
foreach my $exon(@gw_exons){
my $new_exon = clone_Exon($exon);
$newtranscript->add_Exon($new_exon);
}
my @support = @{$gw_tran[0]->get_all_supporting_features};
if (scalar(@support)) {
$newtranscript->add_supporting_features(@support);
}
$newtranscript->add_Attributes(@{$gw_tran[0]->get_all_Attributes});
my $translation = new Bio::EnsEMBL::Translation;
$translation->start($gw_tran[0]->translation->start);
$translation->end($gw_tran[0]->translation->end);
$translation->start_Exon($gw_tran[0]->translation->start_Exon);
$translation->end_Exon($gw_tran[0]->translation->end_Exon);
my @seqeds = @{$gw_tran[0]->translation->get_all_SeqEdits};
if (scalar(@seqeds)) {
print "Copying sequence edits\n" if $self->VERBOSE;
foreach my $se (@seqeds) {
my $new_se =
Bio::EnsEMBL::SeqEdit->new(
-CODE => $se->code,
-NAME => $se->name,
-DESC => $se->description,
-START => $se->start,
-END => $se->end,
-ALT_SEQ => $se->alt_seq
);
my $attribute = $new_se->get_Attribute();
$translation->add_Attributes($attribute);
}
}
$newtranscript->translation($translation);
$newtranscript->translation->start_Exon($newtranscript->start_Exon);
$newtranscript->translation->end_Exon($newtranscript->end_Exon);
my $eecount = 0;
my $modified_peptide_flag = 0;
EACH_E2G_EXON:
foreach my $ee (@e2g_exons){
if ($ee->strand != $gw_exons[0]->strand){
warn("gw and e2g exons have different strands - can't combine genes\n") ;
return undef;
}
if(scalar(@gw_exons) == 1) {
($newtranscript, $modified_peptide_flag) = $self->transcript_from_single_exon_genewise( $ee,
$gw_exons[0],
$newtranscript,
$translation,
$eecount,
@e2g_exons);
}
else {
($newtranscript, $modified_peptide_flag) = $self->transcript_from_multi_exon_genewise($ee,
$newtranscript,
$translation,
$eecount,
$gw,
$e2g)
}
if ( $modified_peptide_flag ){
$modified_peptide = 1;
}
$eecount++;
}
my $biotype = $gw->biotype;
if(defined($newtranscript) && $modified_peptide && (($self->{'blessed_type'}) =~ m/$biotype/)){ print STDERR "translation of blessed gene would need to be modified - not using combined gene.\n"; return undef;
}
if (defined($newtranscript)){
$newtranscript->sort;
foreach my $ex (@{$newtranscript->get_all_Exons}){
if($ex->sub_SeqFeature && scalar($ex->sub_SeqFeature) > 1 ){
my @sf = sort {$a->start <=> $b->start} $ex->sub_SeqFeature;
my $first = shift(@sf);
$ex->end($first->end);
foreach my $s(@sf){
$newtranscript->add_Exon($s);
$newtranscript->sort;
}
$ex->flush_sub_SeqFeature;
}
}
unless( is_Transcript_sane($newtranscript)
&& all_exons_are_valid($newtranscript, $self->MAX_EXON_LENGTH, 1)
&& intron_lengths_all_less_than_maximum($newtranscript, $self->MAX_INTRON_LENGTH)
&& has_no_unwanted_evidence($newtranscript) ){
print STDERR "problems with this combined transcript, return undef\n";
return undef;
}
unless( validate_Translation_coords($newtranscript, 1)
&& validate_Translation_coords($newtranscript, 1)
&& !contains_internal_stops($newtranscript)
&& $newtranscript->translate){
print STDERR "problems with this combined translation, return undef\n";
return undef;
}
my $foundtrans = 0;
my $newtrans;
if ( $modified_peptide ){
my $strand = $newtranscript->start_Exon->strand;
$newtrans = $self->_recalculate_translation($newtranscript, $strand);
unless( validate_Translation_coords($newtrans, 1) && is_Transcript_sane($newtrans)
&& all_exons_are_valid($newtrans, $self->MAX_EXON_LENGTH, 1)
&& intron_lengths_all_less_than_maximum($newtrans, $self->MAX_INTRON_LENGTH) ){
print STDERR "problems with this genomewise alternative model, returning original transript.\n";
$newtrans = $newtranscript
}
}
else{
$newtrans = $newtranscript;
}
return $newtrans;
}
else{
warn("No combination could be built\n");
return undef;
} } |
sub combined_genes
{ my ($self, $genesref) = @_;
if (!defined($self->{'_combined_genes'})) {
$self->{'_combined_genes'} = [];
}
if (defined $genesref) {
push(@{$self->{'_combined_genes'}}, $genesref);
}
return $self->{'_combined_genes'}; } |
sub convert_to_extended_genes
{ my ($self, $genes) =@_;
my @new_genes;
for my $g (@$genes){
my ( $set ) = $self->_get_evidence_set ( $g->biotype ) ;
for my $t (@{$g->get_all_Transcripts}){
throw ("gene has more than one trancript - only processing 1-gene-1-transcript-genes")
if (@{$g->get_all_Transcripts}>1);
my $gene_from_pt = Bio::EnsEMBL::Gene->new(
-start => $t->start ,
-end => $t->end ,
-strand => $t->strand ,
-slice => $t->slice ,
-biotype => $g->biotype,
-analysis => $t->analysis,
);
my $new_tr = Bio::EnsEMBL::Analysis::Tools::GeneBuildUtils::TranscriptExtended->new(
-BIOTYPE => $g->biotype ,
-ANALYSIS => $t->analysis ,
);
$new_tr->ev_set($set);
$new_tr->add_supporting_features(@{$t->get_all_supporting_features});
my @pt_exons = @{$t->get_all_Exons} ;
for (my $i=0 ; $i<scalar(@pt_exons) ; $i++) {
my $pte = $pt_exons[$i];
bless $pte,"Bio::EnsEMBL::Analysis::Tools::GeneBuildUtils::ExonExtended";
$pte->biotype($g->biotype);
$pte->ev_set($set);
$pte->end_phase(0);
$pte->phase(0);
$pte->next_exon($pt_exons[$i+1]) ;
if ($i == 0 ) {
$pte->prev_exon(0);
} else {
$pte->prev_exon($pt_exons[$i-1]);
}
$pte->transcript($new_tr) ;
$pte->analysis($t->analysis) ;
};
for (@pt_exons) {
$new_tr->add_Exon($_);
}
$new_tr->sort;
$gene_from_pt->add_Transcript($new_tr);
push @new_genes , $gene_from_pt;
}
}
return\@ new_genes; } |
sub create_predefined_pairing
{ my ($self, $known_utr_file) = @_;
print STDERR "Parsing GenBank file $known_utr_file for KnowUTR pairing.\n" if $self->VERBOSE;
open(REFSEQ, "<$known_utr_file") or die "Can't open ".$known_utr_file.": $@\n";
my $cdna_id;
my $protein_id;
my %predefined_pairing = ();
while(<REFSEQ>){
next unless /^VERSION|DBSOURCE/;
if(/VERSION/){
next unless /(NP\S+)/;
if(defined $protein_id){
die("previous protein_id [$protein_id] has not been cleared out\n");
}
if(defined $cdna_id){
die("previous cdna_id [$cdna_id] has not been cleared out ...\n");
}
$protein_id = $1;
}
if(/DBSOURCE/){
if(/(NC\_\S+)|(NG\_\S+)|(XM\S+)|(AC\_\S+)/){
$cdna_id = undef;
$protein_id = undef;
next;
}
if(!defined $protein_id){
die("something very wrong - no protein_id for $_\n");
}
if (defined $cdna_id){
die("previous cdna_id [$cdna_id] has not been cleared out ...\n");
}
next unless /(NM\S+)/;
$cdna_id = $1;
$cdna_id =~ s/\.\S+$//;
$protein_id =~ s/\.\S+$//;
if(exists($predefined_pairing{$protein_id})){ print STDERR $protein_id." allready exists!\n"; }
$predefined_pairing{$protein_id} = $cdna_id;
$cdna_id = undef;
$protein_id = undef;
}
}
close REFSEQ;
$self->_known_pairs(\%predefined_pairing); } |
sub ditags
{ my ($self, $ditags) = @_;
if (!defined($self->{'_ditags'})) {
$self->{'_ditags'} = [];
}
if (defined $ditags && scalar(@{$ditags})) {
push(@{$self->{'_ditags'}}, @{$ditags});
}
return($self->{'_ditags'}); } |
sub ests
{ my ($self, $ests) = @_;
if (!defined($self->{'_ests'})) {
$self->{'_ests'} = [];
}
if (defined $ests && scalar(@{$ests})) {
push(@{$self->{'_ests'}},@{$ests});
}
return($self->{'_ests'}); } |
sub expand_3prime_exon
{ my ($self, $exon, $transcript, $strand) = @_;
if(defined($exon->sub_SeqFeature) && (scalar($exon->sub_SeqFeature) > 1)){
my @sf = $exon->sub_SeqFeature;
my $last = pop(@sf);
$exon->start($last->start); $exon->dbID($last->dbID);
$exon->phase($last->phase);
foreach my $s(@sf){
$transcript->add_Exon($s);
$transcript->sort;
}
$exon->flush_sub_SeqFeature;
return 1;
}
return 0; } |
sub fetch_input
{ my ($self) = @_;
my $slice = $self->fetch_sequence(undef, $self->INPUT_DB);
if(!$slice){ throw "can't fetch input slice!\n" }
$self->query($slice);
foreach my $input_genetype (@{$self->INPUT_GENETYPES}) {
my $input_genes = $self->query->get_all_Genes_by_type($input_genetype);
print STDERR "got " . scalar(@{$input_genes}) . " $input_genetype genes [ ".
$self->INPUT_DB->dbname() . "@" . $self->INPUT_DB->host ." ]\n";
$self->gw_genes( $input_genes );
}
my $blessed_slice;
my @blessed_genes;
my $blessed_type;
if($self->BLESSED_DB and scalar(@{$self->BLESSED_GENETYPES})){
if ($self->BLESSED_DB){
$blessed_slice = $self->BLESSED_DB->get_SliceAdaptor->fetch_by_name($self->input_id);
}
else{
$blessed_slice = $self->query;
}
foreach my $bgt( @{$self->BLESSED_GENETYPES} ){
my $blessed_genes = $blessed_slice->get_all_Genes_by_type($bgt);
print STDERR "got " . scalar(@{$blessed_genes}) . " $bgt genes [ ".
$self->BLESSED_DB->dbname() . "@" . $self->BLESSED_DB->host ." ]\n";
$self->blessed_genes( $blessed_slice, $blessed_genes );
$blessed_type .= $bgt."";
}
$self->{'blessed_type'} = ($blessed_type.$self->BLESSED_UTR_GENETYPE);
}
if(!scalar @{$self->gw_genes} and !scalar @{$self->blessed_genes}){
print STDERR "\nNo genes found here.\n\n";
return 0;
}
if(!(defined($self->cDNA_GENETYPE) and $self->CDNA_DB)){
warn("NOT using any cDNAs!?\n");
}
else{
my $cdna_vc = $self->CDNA_DB->get_SliceAdaptor->fetch_by_name($self->input_id);
$self->_cdna_slice($cdna_vc);
foreach my $cdna_type (@{$self->cDNA_GENETYPE}){
my @cdna_genes = @{$self->_cdna_slice->get_all_Genes_by_type($cdna_type)};
print STDERR "got ".scalar(@cdna_genes)." ".$cdna_type." cDNAs.\n" if $self->VERBOSE;
$self->CDNA_DB->dbc->disconnect_when_inactive(1);
my $filtered_cdna = $self->_filter_cdnas(\@cdna_genes, 0);
$self->cdna_genes($filtered_cdna);
print STDERR "got " . scalar(@{$filtered_cdna}) . " cdnas after filtering.\n" if $self->VERBOSE;
}
}
if(defined($self->EST_GENETYPE) && $self->EST_DB){
my $est_vc = $self->EST_DB->get_SliceAdaptor->fetch_by_name($self->input_id);
my @est_genes = @{$est_vc->get_all_Genes_by_type($self->EST_GENETYPE)};
print STDERR "got " . scalar(@est_genes) . " ".$self->EST_GENETYPE." ESTs.\n";
if($self->FILTER_ESTS){
my $filtered_ests = $self->_filter_cdnas(\@est_genes, 1);
$self->ests($filtered_ests);
}
else{
$self->ests(\@est_genes);
}
print STDERR "got " . scalar(@{$self->ests}) . " ESTs after filtering.\n" if $self->VERBOSE;
}
my ($dfa, $ditag_slice);
my @ditags;
if((scalar @{$self->DITAG_TYPE_NAMES}) && $self->DITAG_DB){
$dfa = $self->DITAG_DB->get_DitagFeatureAdaptor;
$ditag_slice = $self->DITAG_DB->get_SliceAdaptor->fetch_by_name($self->input_id);
foreach my $ditag_type (@{$self->DITAG_TYPE_NAMES}) {
my @type_ditags = @{$dfa->fetch_pairs_by_Slice($ditag_slice, $ditag_type)};
print STDERR "got " . scalar(@type_ditags) . " ".$ditag_type." ditags.\n" if $self->VERBOSE;
push(@ditags, @type_ditags);
}
}
if(scalar @ditags){
@ditags = sort {($a->{'start'} <=> $b->{'start'}) or ($a->{'end'} <=> $b->{'end'})} @ditags;
$self->ditags(\@ditags);
print STDERR "got " . scalar(@ditags) ." ditags.\n";
}
else{
print STDERR "not using Ditags.\n";
}
$self->INPUT_DB->dbc->disconnect_when_inactive(1);
$self->CDNA_DB->dbc->disconnect_when_inactive(1)
if($self->CDNA_DB);
$self->EST_DB->dbc->disconnect_when_inactive(1)
if($self->EST_DB);
$self->BLESSED_DB->dbc->disconnect_when_inactive(1)
if($self->BLESSED_DB);
$self->DITAG_DB->dbc->disconnect_when_inactive(1)
if($self->DITAG_DB);
my (@est_biotypes, @cdna_logicnames);
push(@est_biotypes, $self->EST_GENETYPE);
my @simgw_biotypes = @{ $self->INPUT_GENETYPES };
push(@cdna_logicnames, @{$self->cDNA_GENETYPE});
$self->{evidence_sets} = {
'est' =>\@ est_biotypes,
'simgw' =>\@ simgw_biotypes,
'cdna' =>\@ cdna_logicnames,
};
$self->prune($self->PRUNE_GENES);
$self->{'remove_redundant'} = 0;
my $kill_list_type = "cDNA";
my $kill_list = $self->populate_kill_list($kill_list_type);
$self->kill_list($kill_list);
if($self->LOOK_FOR_KNOWN){
$self->create_predefined_pairing($self->KNOWNUTR_FILE);
} } |
sub filter_genes
{ my ($self, $genesref, $blessed) = @_;
my @tested_genes;
my $pruned_CDS;
GENES:
foreach my $test_gene (@$genesref){
if( $test_gene ){
my $test_transcript = $test_gene->get_all_Transcripts->[0];
$test_transcript->sort;
if($test_transcript->three_prime_utr or $test_transcript->five_prime_utr){
print STDERR "Gene has UTR already. Skipping.\n" if $self->VERBOSE;
$self->unmatched_genes($test_gene);
next GENES;
}
if($blessed){
if($test_transcript->start < 1 && $test_transcript->end > 1){
print STDERR "ignoring blessed gene: falls off the slice by its lower end\n" if $self->VERBOSE;
}
else{
push(@tested_genes, $test_gene);
}
}
else{
if( is_Transcript_sane($test_transcript)
&& all_exons_are_valid($test_transcript, $self->MAX_EXON_LENGTH)
&& intron_lengths_all_less_than_maximum($test_transcript, $self->MAX_INTRON_LENGTH)
&& ($test_transcript->start < $self->query->length)
&& ($test_transcript->end > 1)
&& has_no_unwanted_evidence($test_transcript) ) {
push(@tested_genes, $test_gene);
}
else{
if($test_transcript->start < 1 && $test_transcript->end > 1){
print STDERR "ignoring gene: falls off the slice by its lower end\n" if $self->VERBOSE;
}
else{
$self->unmatched_genes($test_gene);
}
}
}
}
else{
throw "\nERROR: Cant check genes!\n\n";
}
}
$genesref =\@ tested_genes;
if(!$blessed){
my $clustered_genes = $self->cluster_CDS($genesref);
if($self->prune){
$genesref = $self->prune_CDS($clustered_genes);
}
}
return $genesref; } |
sub find_cluster_joiners
{ my ($self, $transcript) = @_;
my @clusters;
my $transcript_start;
my $transcript_end;
my $overlaps_previous_cluster = 0;
my $matching_clusters = 0;
if ($transcript->get_all_Exons->[0]->strand == 1){
@clusters = @{$self->forward_genewise_clusters};
$transcript_start = $transcript->start_Exon->start;
$transcript_end = $transcript->end_Exon->end;
}
else{
@clusters = @{$self->reverse_genewise_clusters};
$transcript_start = $transcript->end_Exon->start;
$transcript_end = $transcript->start_Exon->end;
}
print STDERR "transcript spans $transcript_start - $transcript_end ".
$transcript->seq_region_start." - ".$transcript->seq_region_end."\n".
"clusters span:\n " if $self->VERBOSE;
if(!scalar(@clusters)){
print STDERR "Odd, no clusters\n" if $self->VERBOSE;
return 0;
}
foreach my $cluster(@clusters){
if( _overlapping_genes($transcript, $cluster) ){
$matching_clusters++;
}
if($matching_clusters>1){
print STDERRR "transcript joins clusters - discard it\n" if $self->VERBOSE;
return 1;
}
}
return 0; } |
sub forward_genewise_clusters
{ my ($self, $cluster_ref) = @_;
if (!defined($self->{'_forward_genewise_clusters'})) {
$self->{'_forward_genewise_clusters'} = [];
}
if (defined $cluster_ref && scalar(@{$cluster_ref})) {
push(@{$self->{'_forward_genewise_clusters'}},@{$cluster_ref});
@{$self->{'_forward_genewise_clusters'}} = sort { $a->start <=> $b->start } @{$self->{'_forward_genewise_clusters'}};
}
return $self->{'_forward_genewise_clusters'}; } |
sub get_cdna_id_from_protein_id
{ my ($self, $protein_id) = @_;
my $cdna_id = undef;
if(defined($self->_known_pairs()->{$protein_id})){
$cdna_id = $self->_known_pairs()->{$protein_id};
}
return $cdna_id; } |
sub gw_genes
{ my ($self, $genes) = @_;
if (!defined($self->{'_gw_genes'})) {
$self->{'_gw_genes'} = [];
}
if (defined $genes && scalar(@{$genes})) {
push(@{$self->{'_gw_genes'}},@{$genes});
}
return $self->{'_gw_genes'}; } |
sub kill_list
{ my ($self, $kill_list) = @_;
if($kill_list){
$self->{'_kill_list'} = $kill_list;
}
return $self->{'_kill_list'}; } |
sub look_for_both
{ my ($self, $gene) = @_;
my $time = time;
my $nupdated_start = 0;
my $nupdated_end = 0;
my $metcnt = 1;
my $maxterdist = 150;
foreach my $trans (@{$gene->get_all_Transcripts}) {
if ($trans->translation) {
my $tln = $trans->translation;
my $coding_start = $trans->cdna_coding_start;
my $orig_coding_start = $coding_start;
$trans->sort;
my $cdna_seq = uc($trans->spliced_seq);
my @pepgencoords = $trans->pep2genomic(1,1);
if(scalar(@pepgencoords) > 2) {
print STDERR "pep start does not map cleanly\n";
goto TLNEND; }
my $pepgenstart = $pepgencoords[0]->start;
my $pepgenend = $pepgencoords[$#pepgencoords]->end;
unless($pepgencoords[0]->isa('Bio::EnsEMBL::Mapper::Coordinate')) {
print STDERR "pep start maps to gap\n";
goto TLNEND; }
unless($pepgencoords[$#pepgencoords]->isa('Bio::EnsEMBL::Mapper::Coordinate')) {
print STDERR "pep start (end of) maps to gap\n";
goto TLNEND; }
print STDERR "Pep genomic location = " . $pepgenstart . " " . $pepgenend . "\n" if $self->VERBOSE;
my $startseq= substr($cdna_seq,$coding_start-1,3);
print "cdna seq for pep start = " . $startseq . "\n";
if ($startseq ne "ATG") {
if ($coding_start > 3) {
my $had_stop = 0;
while ($coding_start > 3 && !$had_stop) {
my $testseq = substr($cdna_seq,$coding_start-4,3);
if ($testseq eq "ATG") {
print_Translation($trans) if $self->VERBOSE;
my @coords = $trans->cdna2genomic($coding_start-3,$coding_start-1,$gene->strand);
my $new_start;
my $new_end;
if(scalar(@coords) > 2) {
print STDERR "Shouldn't happen - new start does not map cleanly - I'm out of here\n";
next;
} elsif (scalar(@coords) == 2) {
print STDERR "WOW ISN'T NATURE HORRIBLE: new start crosses intron\n";
print STDERR "coord[0] = " . $coords[0]->start . " " . $coords[0]->end ."\n";
print STDERR "coord[1] = " . $coords[1]->start . " " . $coords[1]->end ."\n";
if ($gene->strand == 1) {
$new_start = $coords[0]->start;
$new_end = $coords[$#coords]->end;
} else {
$new_start = $coords[0]->end;
$new_end = $coords[$#coords]->start;
}
} else {
$new_start = $coords[0]->start;
$new_end = $coords[0]->end;
}
unless($coords[0]->isa('Bio::EnsEMBL::Mapper::Coordinate')) {
print STDERR "Shouldn't happen - new start maps to gap - I'm out of here\n";
next;
}
unless($coords[$#coords]->isa('Bio::EnsEMBL::Mapper::Coordinate')) {
print STDERR "Shouldn't happen - new start (end of) maps to gap - I'm out of here\n";
next;
}
print STDERR "genomic pos for new start = " . $new_start . " " . $new_end . "\n" if $self->VERBOSE;
if ($new_end - $new_start == 2) {
$nupdated_start++;
my $newstartexon;
foreach my $exon (@{$trans->get_all_Exons}) {
if ($exon->end >= $new_start && $exon->start <= $new_start) {
$newstartexon = $exon;
last;
}
}
if ($newstartexon == $tln->start_Exon) {
if ($tln->start_Exon->strand == 1) {
$tln->start($new_start - $tln->start_Exon->start + 1);
} else {
$tln->start($tln->start_Exon->end - $new_end + 1);
}
$trans->{'transcript_mapper'} = undef;
$trans->{'coding_region_start'} = undef;
$trans->{'coding_region_end'} = undef;
$trans->{'cdna_coding_start'} = undef;
$trans->{'cdna_coding_end'} = undef;
} else {
if (!defined($newstartexon)) {
print STDERR "Failed finding new start exon - how can this be?\n";
next;
}
my $copyexon = new Bio::EnsEMBL::Exon(
-start => $tln->start_Exon->start,
-end => $tln->start_Exon->end,
-strand => $gene->strand,
);
my $copynewstartexon = new Bio::EnsEMBL::Exon(
-start => $newstartexon->start,
-end => $newstartexon->end,
-strand => $gene->strand,
);
$copyexon->end_phase($tln->start_Exon->end_phase);
$copyexon->contig($tln->start_Exon->contig);
if ($tln->start_Exon->stable_id) {
$copyexon->stable_id($tln->start_Exon->stable_id . "MET" . $metcnt++);
$copyexon->created($time);
$copyexon->modified($time);
$copyexon->version(1);
}
$copynewstartexon->phase($newstartexon->phase);
$copynewstartexon->contig($newstartexon->contig);
if ($newstartexon->stable_id) {
$copynewstartexon->stable_id($newstartexon->stable_id . "MET" . $metcnt++);
$copynewstartexon->created($time);
$copynewstartexon->modified($time);
$copynewstartexon->version(1);
}
if ($copynewstartexon->strand == 1) {
$tln->start($new_start - $copynewstartexon->start + 1);
} else {
$tln->start($copynewstartexon->end - $new_end + 1);
}
my @newexons;
my $inrange = 0;
foreach my $exon (@{$trans->get_all_Exons}) {
if ($inrange) {
$exon->phase( $newexons[$#newexons]->end_phase );
$exon->end_phase(($exon->length + $exon->phase) % 3);
}
if ($exon == $tln->start_Exon) {
$copyexon->phase( $newexons[$#newexons]->end_phase );
push @newexons,$copyexon;
$inrange = 0;
} elsif ($exon == $newstartexon) {
push @newexons,$copynewstartexon;
$copynewstartexon->end_phase(($exon->length - $tln->start + 1)%3);
print STDERR "Setting end_phase on new start exon to " . $copynewstartexon->end_phase .
" l = " . $exon->length . " ts = " . $tln->start . "\n" if $self->VERBOSE;
$inrange = 1;
} else {
push @newexons,$exon;
}
}
$trans->flush_Exons;
foreach my $exon (@newexons) {
$trans->add_Exon($exon);
}
if ($tln->end_Exon == $tln->start_Exon) {
$tln->end_Exon($copyexon);
}
$tln->start_Exon($copynewstartexon);
}
print_Translation($trans);
} else {
print STDERR "Across exons - not handling this\n";
}
last;
} else {
if ($testseq =~ /TAA/ or $testseq =~ /TGA/ or $testseq =~ /TAG/) {
$had_stop = 1;
} else {
$coding_start -= 3;
}
}
}
} else {
print STDERR "Not enough bases upstream - NOT looking into genomic\n"if $self->VERBOSE;
}
}
TLNEND:
{
my $coding_end = $trans->cdna_coding_end;
my $orig_coding_end = $coding_end;
$trans->sort;
my $peplen = $trans->translate->length;
my @pepgencoords = $trans->pep2genomic($peplen,$peplen);
if(scalar(@pepgencoords) > 2) {
print STDERR "pep end does not map cleanly\n";
next;
}
my $pepgenstart = $pepgencoords[0]->start;
my $pepgenend = $pepgencoords[$#pepgencoords]->end;
unless($pepgencoords[0]->isa('Bio::EnsEMBL::Mapper::Coordinate')) {
print STDERR "pep end maps to gap\n";
next;
}
unless($pepgencoords[$#pepgencoords]->isa('Bio::EnsEMBL::Mapper::Coordinate')) {
print STDERR "pep start (end of) maps to gap\n";
next;
}
my $endseq= substr($cdna_seq,$coding_end-3,3);
my $cdnalen = length($cdna_seq);
my $longendseq= substr($cdna_seq,$coding_end-6,12);
if ($endseq ne "TGA" and $endseq ne "TAA" and $endseq ne "TAG") {
if (($cdnalen-$coding_end) > 3) {
while (($cdnalen-$coding_end) > 3 && ($coding_end-$orig_coding_end) <= $maxterdist) {
my $testseq = substr($cdna_seq,$coding_end,3);
if ($testseq eq "TGA" or $testseq eq "TAA" or $testseq eq "TAG") {
my @coords = $trans->cdna2genomic($coding_end+1,$coding_end+3,$gene->strand);
my $new_start;
my $new_end;
if(scalar(@coords) > 2) {
throw("new end does not map cleanly\n");
next;
} elsif (scalar(@coords) == 2) {
print STDERR "WOW ISN'T NATURE HORRIBLE: new end crosses intron\n";
print STDERR "coord[0] = " . $coords[0]->start . " " . $coords[0]->end ."\n";
print STDERR "coord[1] = " . $coords[1]->start . " " . $coords[1]->end ."\n";
if ($gene->strand == 1) {
$new_start = $coords[0]->start;
$new_end = $coords[$#coords]->end;
} else {
$new_start = $coords[0]->end;
$new_end = $coords[$#coords]->start;
}
} else {
$new_start = $coords[0]->start;
$new_end = $coords[0]->end;
}
unless($coords[0]->isa('Bio::EnsEMBL::Mapper::Coordinate')) {
print STDERR "new start maps to gap\n";
next;
}
unless($coords[$#coords]->isa('Bio::EnsEMBL::Mapper::Coordinate')) {
print STDERR "new start (end of) maps to gap\n";
next;
}
if ($new_end - $new_start == 2) {
$nupdated_end++;
my $newendexon;
foreach my $exon (@{$trans->get_all_Exons}) {
if ($exon->end >= $new_start && $exon->start <= $new_start) {
$newendexon = $exon;
last;
}
}
if ($newendexon == $tln->end_Exon) {
if ($tln->end_Exon->strand == 1) {
$tln->end($new_end - $tln->end_Exon->start + 1);
} else {
$tln->end($tln->end_Exon->end - $new_start + 1);
}
$trans->{'transcript_mapper'} = undef;
$trans->{'coding_region_start'} = undef;
$trans->{'coding_region_end'} = undef;
$trans->{'cdna_coding_start'} = undef;
$trans->{'cdna_coding_end'} = undef;
} else {
if (!defined($newendexon)) {
print STDERR "Failed finding new end exon - how can this be?\n";
next;
}
my $copyexon = new Bio::EnsEMBL::Exon(
-start => $tln->end_Exon->start,
-end => $tln->end_Exon->end,
-strand => $gene->strand,
);
my $copynewendexon = new Bio::EnsEMBL::Exon(
-start => $newendexon->start,
-end => $newendexon->end,
-strand => $gene->strand,
);
$copyexon->phase($tln->end_Exon->phase);
$copyexon->end_phase($tln->end_Exon->end_phase);
$copyexon->contig($tln->end_Exon->contig);
if ($tln->end_Exon->stable_id) {
$copyexon->stable_id($tln->end_Exon->stable_id . "TER" . $metcnt++);
$copyexon->created($time);
$copyexon->modified($time);
$copyexon->version(1);
}
$copynewendexon->phase($newendexon->phase);
$copynewendexon->contig($newendexon->contig);
if ($newendexon->stable_id) {
$copynewendexon->stable_id($newendexon->stable_id . "TER" . $metcnt++);
$copynewendexon->created($time);
$copynewendexon->modified($time);
$copynewendexon->version(1);
}
if ($copynewendexon->strand == 1) {
$tln->end($new_end - $copynewendexon->start + 1);
} else {
$tln->end($copynewendexon->end - $new_start + 1 );
my $tercodon = $copynewendexon->seq->subseq($copynewendexon->end - $new_start-1, $copynewendexon->end - $new_start +1);
}
my @newexons;
my $inrange = 0;
foreach my $exon (@{$trans->get_all_Exons}) {
if ($inrange) {
print STDERR "in range exon before phase = " . $exon->phase . " endphase " . $exon->end_phase . "\n" if $self->VERBOSE;
$exon->phase( $newexons[$#newexons]->end_phase );
$exon->end_phase(($exon->length + $exon->phase) % 3);
print STDERR "in range exon after phase = " . $exon->phase . " endphase " . $exon->end_phase . "\n" if $self->VERBOSE;
}
if ($exon == $tln->end_Exon) {
my $phase = $exon->phase;
if ($phase == -1) {
$phase = 0;
}
if ($exon == $tln->start_Exon) {
$copyexon->end_phase(($exon->length - $tln->start + 1)%3);
} else {
$copyexon->end_phase(($exon->length + $exon->phase)%3);
}
print STDERR "Setting end_phase on old end exon to " . $copyexon->end_phase . " l = " . $exon->length . "\n" if $self->VERBOSE;
push @newexons,$copyexon;
$inrange = 1;
} elsif ($exon == $newendexon) {
$copynewendexon->phase( $newexons[$#newexons]->end_phase );
$copynewendexon->end_phase( -1);
push @newexons,$copynewendexon;
$inrange = 0;
} else {
push @newexons,$exon;
}
}
$trans->flush_Exons;
foreach my $exon (@newexons) {
$trans->add_Exon($exon);
}
if ($tln->end_Exon == $tln->start_Exon) {
$tln->start_Exon($copyexon);
}
$tln->end_Exon($copynewendexon);
}
print_Translation($trans);
} else {
print STDERR "Across exons - not handling this\n" if $self->VERBOSE;
}
last;
}
$coding_end += 3;
}
} else {
print STDERR "Not enough bases downstream - NOT looking into genomic\n" if $self->VERBOSE;
}
}
}
}
$trans->get_all_supporting_features();
my @exons= @{$trans->get_all_Exons};
my $get = $trans->translation;
$trans->_translation_id(undef);
foreach my $exon (@exons) {
$exon->stable_id;
$exon->contig($gene->slice);
$exon->get_all_supporting_features;
}
}
return($gene); } |
sub make_gene
{ my ($self, $genetype, @transcripts) = @_;
unless ( $genetype ){
throw("You must define UTR_GENETYPE in Bio::EnsEMBL::Analysis::Conf::GeneBuild::UTR_Builder");
}
my $analysis = $self->analysis;
unless ($analysis){
throw("You have to pass an analysis to this RunnableDB through new()");
}
my @genes;
my $count=0;
foreach my $trans(@transcripts){
$trans->sort;
unless ( is_Transcript_sane($trans)
&& intron_lengths_all_less_than_maximum($trans, $self->MAX_INTRON_LENGTH)
&& all_exons_are_valid($trans, $self->MAX_EXON_LENGTH, 1)
&& has_no_unwanted_evidence($trans) ){
print STDERR "\nrejecting transcript\n";
return;
}
my $gene = new Bio::EnsEMBL::Gene;
$gene->biotype($genetype);
$trans->biotype($genetype);
$trans->analysis($analysis);
$gene->add_Transcript($trans);
$gene->analysis($analysis);
if($self->validate_gene($gene)){
push (@genes,$gene);
$count++;
}
else{
print STDERR "\nCould not validate gene!\n";
}
}
return(\@genes); } |
sub match_protein_to_cdna
{ my ($self, $gw, $is_est) = @_;
print STDERR "Looking at ESTs.\n" if $is_est;
my (%UTR_hash, %UTR_side_indicator_hash);
my @other_genes;
my @matching_e2g;
my @gw_tran = @{$gw->get_all_Transcripts};
my @gw_exons = @{$gw_tran[0]->get_all_Exons};
my $strand = $gw_exons[0]->strand;
if($gw_exons[$#gw_exons]->strand != $strand){
warn("first and last gw exons have different strands ".
"- can't make a sensible combined gene\n with ".$gw_tran[0]->dbId );
return undef;
}
if ( @gw_exons ){
if ($strand == 1 ){
@gw_exons = sort { $a->start <=> $b->start } @gw_exons;
}
else{
@gw_exons = sort { $b->start <=> $a->start } @gw_exons;
}
}
else{
warn("gw gene without exons: ".$gw->dbID.", skipping it");
return undef;
}
my $exon_slop = 20;
my $cds_length = length($gw_tran[0]->translateable_seq());
my @genes;
if($is_est){
@genes = @{$self->ests};
}
else{
@genes = @{$self->cdna_genes};
}
cDNA:
foreach my $e2g(@genes){
my @egtran = @{$e2g->get_all_Transcripts};
my @eg_exons = @{$egtran[0]->get_all_Exons};
$strand = $eg_exons[0]->strand;
if($eg_exons[$#eg_exons]->strand != $strand){
warn("first and last e2g exons have different strands - skipping transcript ".$egtran[0]->dbID);
next cDNA;
}
if ($strand == 1 ){
@eg_exons = sort { $a->start <=> $b->start } @eg_exons;
}
else{
@eg_exons = sort { $b->start <=> $a->start } @eg_exons;
}
my $fiveprime_match = 0;
my $threeprime_match = 0;
my $left_exon;
my $right_exon;
my $left_diff = 0;
my $right_diff = 0;
if ($#gw_exons == 0) {
foreach my $current_exon (@eg_exons) {
if($current_exon->strand != $gw_exons[0]->strand){
next cDNA;
}
if ($gw_exons[0]->end <= $current_exon->end &&
$gw_exons[0]->start >= $current_exon->start){
$fiveprime_match = 1;
$threeprime_match = 1;
$left_exon = $current_exon;
$right_exon = $current_exon;
$left_diff = $gw_exons[0]->start - $current_exon->start;
$right_diff = $current_exon->end - $gw_exons[0]->end;
}
}
}
else {
cDNA_EXONS:
foreach my $current_exon (@eg_exons) {
if($current_exon->strand != $gw_exons[0]->strand){
next cDNA;
}
if($gw_exons[0]->strand == 1){
if ($gw_exons[0]->end == $current_exon->end &&
($current_exon->start <= $gw_exons[0]->start ||
(abs($current_exon->start - $gw_exons[0]->start) <= $exon_slop && $current_exon != $eg_exons[0]))){
$fiveprime_match = 1;
$left_exon = $current_exon;
$left_diff = $gw_exons[0]->start - $current_exon->start;
}
elsif($gw_exons[$#gw_exons]->start == $current_exon->start &&
($current_exon->end >= $gw_exons[$#gw_exons]->end ||
(abs($current_exon->end - $gw_exons[$#gw_exons]->end) <= $exon_slop &&
$current_exon != $eg_exons[$#eg_exons]))){
$threeprime_match = 1;
$right_exon = $current_exon;
$right_diff = $current_exon->end - $gw_exons[0]->end;
}
}
elsif($gw_exons[0]->strand == -1){
if ($gw_exons[0]->start == $current_exon->start &&
($current_exon->end >= $gw_exons[0]->end ||
(abs($current_exon->end - $gw_exons[0]->end) <= $exon_slop &&
$current_exon != $eg_exons[0]))){
$fiveprime_match = 1;
$right_exon = $current_exon;
$right_diff = $current_exon->end - $gw_exons[0]->end;
}
elsif ($gw_exons[$#gw_exons]->end == $current_exon->end &&
($current_exon->start <= $gw_exons[$#gw_exons]->start ||
(abs($current_exon->start - $gw_exons[$#gw_exons]->start) <= $exon_slop &&
$current_exon != $eg_exons[$#eg_exons]))){
$threeprime_match = 1;
$left_exon = $current_exon;
$left_diff = $gw_exons[0]->start - $current_exon->start;
}
}
}
}
if($fiveprime_match || $threeprime_match){
my ($UTR_length, $left_UTR_length, $right_UTR_length) =
$self->_compute_UTRlength($egtran[0], $left_exon, $left_diff, $right_exon, $right_diff);
my $UTR_diff = $egtran[0]->length;
if(($cds_length * 10) > $UTR_diff){
$UTR_hash{$e2g} = $UTR_length;
$UTR_side_indicator_hash{$e2g} = 1;
print STDERR "considering cDNA ".$e2g->seq_region_start."-".$e2g->seq_region_end."[".($cds_length * 10)." > ".$UTR_diff."]\n";
push(@matching_e2g, $e2g);
}
else{
print STDERR "didnt pass UTR length check [".($cds_length * 10)." > ".$UTR_diff."]: ".$e2g->seq_region_start."-".$e2g->seq_region_end."\n" if $self->VERBOSE;
}
}
}
print STDERR "returning ".(scalar @matching_e2g)." candidates.\n";
return (\@matching_e2g,\%UTR_hash,\% UTR_side_indicator_hash); } |
sub merged_unmerged_pairs
{ my ($self, $merged_gene, $unmerged_gene) = @_;
if (!defined($self->{'_merged_unmerged_pairs'})) {
$self->{'_merged_unmerged_pairs'} = {};
}
if ($unmerged_gene && $merged_gene) {
$self->{'_merged_unmerged_pairs'}{$merged_gene}= $unmerged_gene;
}
return $self->{'_merged_unmerged_pairs'}; } |
sub modified_unmodified_pairs
{ my ($self, $modified_gene, $unmodified_gene) = @_;
if (!defined($self->{'_modified_unmodified_pairs'})) {
$self->{'_modified_unmodified_pairs'} = {};
}
if ($unmodified_gene && $modified_gene) {
$self->{'_modified_unmodified_pairs'}{$modified_gene}= $unmodified_gene;
}
return $self->{'_modified_unmodified_pairs'}; } |
sub new
{ my ($class, @args) = @_;
my $self = $class->SUPER::new(@args);
$self->read_and_check_config($UTR_BUILDER_CONFIG_BY_LOGIC);
return $self;
}
my $totalgenes = 0; } |
sub output
{ my ($self,@genes) = @_;
if (!defined($self->{'_output'})) {
$self->{'_output'} = [];
}
if(@genes){
push(@{$self->{'_output'}},@genes);
}
return $self->{'_output'};
}
} |
sub populate_kill_list
{ my ($self, $type) = @_;
my $kill_list_object = Bio::EnsEMBL::KillList::KillList->new(-TYPE => $type);
my %kill_list = %{$kill_list_object->get_kill_list()};
return\% kill_list; } |
sub prune
{ my ($self, $bool) = @_;
if (!defined($self->{'_prune_genes'})) {
$self->{'_prune_genes'} = undef;
}
if (defined $bool) {
$self->{'_prune_genes'} = $bool;
}
return ($self->{'_prune_genes'}); } |
sub prune_CDS
{ my ($self, $gene_cluster_ref) = @_;
my $cluster_count = 0;
my @pruned_transcripts;
my @pruned_genes;
CLUSTER:
foreach my $gene_cluster ( @$gene_cluster_ref ){
$cluster_count++;
my @unpruned_genes;
foreach my $unpruned_gene ($gene_cluster->get_Genes){
if(scalar(@{$unpruned_gene->get_all_Transcripts}) > 1){
throw($unpruned_gene->dbID . " has >1 transcript - can't handle it yet\n ");
}
push(@unpruned_genes, $unpruned_gene);
}
@unpruned_genes = sort{$b->get_all_Transcripts->[0]->length <=> $a->get_all_Transcripts->[0]->length} @unpruned_genes;
my $maxexon_number = 0;
foreach my $gene (@unpruned_genes){
my $exon_number = scalar(@{$gene->get_all_Transcripts->[0]->get_all_Exons});
if ( $exon_number > $maxexon_number ){
$maxexon_number = $exon_number;
}
}
if ($maxexon_number == 1){ push (@pruned_genes, $unpruned_genes[0]);
print STDERR "VAC: found single_exon_transcript: " .$unpruned_genes[0]->dbID. "\n" if $self->VERBOSE;
shift @unpruned_genes;
$self->unmatched_genes(@unpruned_genes);
next CLUSTER;
}
my %pairhash;
my %exonhash;
GENE:
foreach my $gene (@unpruned_genes) {
my @exons = @{$gene->get_all_Transcripts->[0]->get_all_Exons};
my $i = 0;
my $found = 1;
EXONS:
for ($i = 0; $i < $#exons; $i++) {
my $foundpair = 0;
my $exon1 = $exons[$i];
my $exon2 = $exons[$i+1];
EXONHASH:
foreach my $first_exon_id (keys %pairhash) {
my $first_exon = $exonhash{$first_exon_id};
foreach my $second_exon_id (keys %{$pairhash{$first_exon}}) {
my $second_exon = $exonhash{$second_exon_id};
if ( $exon1->overlaps($first_exon) && $exon2->overlaps($second_exon) ) {
$foundpair = 1;
last EXONHASH;
}
}
}
if ($foundpair == 0) { $found = 0;
$exonhash{$exon1} = $exon1;
$exonhash{$exon2} = $exon2;
$pairhash{$exon1}{$exon2} = 1;
}
}
if ($found == 0) {
push(@pruned_genes, $gene);
}
elsif ($found == 1 && $#exons == 0){
$self->unmatched_genes($gene);
}
else {
$self->unmatched_genes($gene);
if ( $gene == $unpruned_genes[0] ){
}
}
} }
return\@ pruned_genes; } |
sub remap_genes
{ my ($self, $genes, $biotype_suffix) = @_;
my @remapped_genes = ();
my $blessed_type = $self->{'blessed_type'};
print STDERR "remapping ".scalar @$genes." genes.\n" if $self->VERBOSE;
GENE:
foreach my $gene (@$genes) {
my $biotype = $gene->biotype;
if($biotype_suffix && (length($biotype_suffix) > 0)){
$gene->biotype($gene->biotype().$biotype_suffix);
}
if(defined $blessed_type && $blessed_type =~ m/$biotype/){ print STDERR "not remapping ".$biotype."\n" if $self->VERBOSE; push(@remapped_genes, $gene);
next GENE;
}
print STDERR "remapping ".$gene->biotype."\n" if $self->VERBOSE;
my @t = @{$gene->get_all_Transcripts};
my $tran = $t[0];
unless(validate_Translation_coords($tran, 1)
&& !contains_internal_stops($tran)
&& $tran->translate){
print STDERR "\nERROR: Gene at ".$gene->seq_region_name." ".$gene->seq_region_start."-".$gene->seq_region_end." ".
$gene->seq_region_strand." doesn't translate!\n\n";
push(@remapped_genes, $gene); next GENE;
}
foreach my $transcript ( @{$gene->get_all_Transcripts} ){
if($biotype){
$transcript->biotype($gene->biotype);
$transcript->analysis($gene->analysis);
}
set_start_codon($transcript);
set_stop_codon($transcript);
}
push(@remapped_genes, $gene);
}
return\@ remapped_genes; } |
sub retrieve_unmerged_gene
{ my ($self, $merged_gene) = @_;
my %pairs = %{$self->merged_unmerged_pairs()};
if(!exists $pairs{$merged_gene}){
print STDERR "Can't retrieve unmerged\n ";
print STDERR "for gene ".$merged_gene->dbID."\n ";
}
return $pairs{$merged_gene}; } |
sub reverse_genewise_clusters
{ my ($self, $cluster_ref) = @_;
if (!defined($self->{'_reverse_genewise_clusters'})) {
$self->{'_reverse_genewise_clusters'} = [];
}
if (defined $cluster_ref && scalar(@{$cluster_ref})) {
push(@{$self->{'_reverse_genewise_clusters'}},@{$cluster_ref});
@{$self->{'_reverse_genewise_clusters'}} = sort { $a->start <=> $b->start } @{$self->{'_reverse_genewise_clusters'}};
}
return $self->{'_reverse_genewise_clusters'}; } |
sub run
{ my ($self) = @_;
print "TOTAL GENES IN: ".(scalar @{$self->gw_genes} + scalar @{$self->blessed_genes})."\n";
my $ininumber = scalar @{$self->gw_genes};
my $filtered_genes = $self->filter_genes($self->gw_genes);
print STDERR "filtered from $ininumber to ".(scalar @$filtered_genes). " genes.\n" if $self->VERBOSE && $filtered_genes;
$ininumber = scalar @{$self->blessed_genes};
my $filtered_blessed_genes = $self->filter_genes($self->blessed_genes, 1);
print STDERR "filtered blessed from $ininumber to ".(scalar @$filtered_blessed_genes). " genes.\n"
if $self->VERBOSE && $filtered_blessed_genes;
$self->run_matching($filtered_blessed_genes, $self->BLESSED_UTR_GENETYPE, 1);
$self->run_matching($filtered_genes, $self->UTR_GENETYPE, 0);
print STDERR "Have ".(scalar @{$self->combined_genes})." combined_genes & ".
(scalar @{$self->unmatched_genes})." unmatched_genes.\n" if $self->VERBOSE;
my @remapped;
push( @remapped, @{$self->remap_genes( $self->combined_genes, undef )} );
push( @remapped, @{$self->remap_genes( $self->unmatched_genes, $self->EXTEND_BIOTYPE_OF_UNCHANGED_GENES )} );
$self->output(@remapped);
print "TOTAL GENES OUT: ".(scalar @remapped)."\n";
if($self->VERBOSE){
foreach ( @remapped ) {
print "final biotype : " . $_->biotype . "\n";
}
} } |
sub run_matching
{ my ($self, $genesref, $combined_genetype, $blessed) = @_;
my @merged_genes = ();
@merged_genes = $self->_merge_genes($genesref);
print STDERR "\n ---\n got " . scalar(@merged_genes) . " merged " . $combined_genetype . " genes\n" if $self->VERBOSE;
@merged_genes = sort {
my $result = ( ($b->get_all_Transcripts->[0]->end - $b->get_all_Transcripts->[0]->start + 1) <=>
($a->get_all_Transcripts->[0]->end - $b->get_all_Transcripts->[0]->start + 1) );
unless ($result){
return ( ($b->get_all_Transcripts->[0]->length) <=>
($a->get_all_Transcripts->[0]->length) );
}
return $result;
} @merged_genes;
CDS:
foreach my $cds (@merged_genes){
print STDERR "--next cds: ".$cds->seq_region_start."-".$cds->seq_region_end." --\n" if $self->VERBOSE;
my @cds_tran = @{$cds->get_all_Transcripts};
my @cds_exons = @{$cds_tran[0]->get_all_Exons}; my $strand = $cds_exons[0]->strand;
my $cdna_match;
my $usingKnown = 0;
if($cds_exons[$#cds_exons]->strand != $strand){
warn("first and last cds exons have different strands - can't make a sensible combined gene\n");
my $unmerged_cds = $self->retrieve_unmerged_gene($cds);
$self->unmatched_genes($unmerged_cds);
next CDS;
}
my $predef_match = undef;
my $combined_transcript = undef;
if($self->LOOK_FOR_KNOWN){
$predef_match = $self->check_for_predefined_pairing($cds, $blessed);
if(defined $predef_match){
$combined_transcript = $self->combine_genes($cds, $predef_match);
$combined_transcript->sort;
if ( $combined_transcript
&& is_Transcript_sane($combined_transcript)
&& all_exons_are_valid($combined_transcript, $self->MAX_EXON_LENGTH, 1)
&& intron_lengths_all_less_than_maximum($combined_transcript, $self->MAX_INTRON_LENGTH)
&& validate_Translation_coords($combined_transcript, 1)
&& !contains_internal_stops($combined_transcript)
&& $combined_transcript->translate
&& has_no_unwanted_evidence($combined_transcript) ){
if($self->find_cluster_joiners($combined_transcript)){
print STDERR "Found a cluster_joiner!\n" if $self->VERBOSE;
print STDERR $combined_transcript->seq_region_start."/".$combined_transcript->seq_region_end."\n" if $self->VERBOSE;
$combined_transcript = undef;
}
}
else{
print SDERR "Did not pass tests.\n" if $self->VERBOSE;
$combined_transcript = undef;
}
}
}
if($predef_match && $combined_transcript){
print STDERR "Using predefined cDNA!\n" if $self->VERBOSE;
$cdna_match = $predef_match;
$usingKnown = 1;
}
else{
my ($matching_cdnas, $utr_length_hash, $UTR_side_indicator_hash) = $self->match_protein_to_cdna($cds, 0);
my %utr_length_hash = %$utr_length_hash;
if(!(scalar @$matching_cdnas)){
warn("Could not identify any matching cDNA!\n");
($matching_cdnas, $utr_length_hash, $UTR_side_indicator_hash) = $self->match_protein_to_cdna($cds, 1);
%utr_length_hash = %$utr_length_hash;
if(!(scalar @$matching_cdnas)){
my $unmerged_cds = $self->retrieve_unmerged_gene($cds);
warn("Could not identify any matching ESTs either!\n");
$self->unmatched_genes($unmerged_cds);
next CDS;
}
}
my $matching_extended_cdnas = $self->convert_to_extended_genes($matching_cdnas);
my ($clusters, $non_clusters) = cluster_Genes( $matching_extended_cdnas, $self->{evidence_sets} ) ;
my $genes_by_strand;
my @possible_transcripts;
foreach my $gene (@$matching_extended_cdnas){
push @{$genes_by_strand->{$gene->strand}}, $gene;
}
my @cluster_to_use = ();
if(scalar @$clusters){
push(@cluster_to_use, @$clusters);
}
if(scalar @$non_clusters){
push(@cluster_to_use, @$non_clusters);
}
foreach my $cluster (@cluster_to_use){
print STDERR "CLUSTER: ".$cluster->start." ".$cluster->end." ".$cluster->strand."\n" if $self->VERBOSE;
my $collapsed_cluster = Bio::EnsEMBL::Analysis::Runnable::TranscriptConsensus->collapse_cluster($cluster, $genes_by_strand);
my @potential_genes = $cluster->get_Genes;
foreach my $potential_gene (@potential_genes){
foreach my $potential_trans (@{$potential_gene->get_all_Transcripts}) {
my @exon_array = ( @{$potential_trans->get_all_Exons} );
my $new_transcript = Bio::EnsEMBL::Analysis::Runnable::TranscriptConsensus->transcript_from_exons(\@
exon_array, undef, $collapsed_cluster);
$new_transcript->add_supporting_features(@{$potential_trans->get_all_supporting_features});
print STDERR "new_transcript score :".$new_transcript->score()."\n" if $self->VERBOSE;
my ($ditag_score, $index) = $self->score_ditags($self->ditags, $new_transcript, 0);
$new_transcript->score($new_transcript->score() + $ditag_score);
my $UTR_score = $self->calculate_UTR_score($cds_tran[0], $new_transcript);
$new_transcript->score($new_transcript->score() + $UTR_score);
push(@possible_transcripts, $new_transcript);
}
}
}
@possible_transcripts = sort { $a->score <=> $b->score } @possible_transcripts if @possible_transcripts;
my ($cdnaname, $genename);
my $round = 0;
POS:
while(my $chosen_transcript = pop @possible_transcripts){
my $chosen_feats = $chosen_transcript->get_all_supporting_features;
$cdna_match = Bio::EnsEMBL::Gene->new;
$cdna_match->slice($chosen_transcript->slice);
$cdna_match->add_Transcript($chosen_transcript);
$cdna_match->analysis($chosen_transcript->analysis);
$cdna_match->get_all_Transcripts->[0]->add_supporting_features(@$chosen_feats);
$combined_transcript = $self->combine_genes($cds, $cdna_match);
if ( defined($combined_transcript)){
$combined_transcript->sort;
if(is_Transcript_sane($combined_transcript)
&& all_exons_are_valid($combined_transcript, $self->MAX_EXON_LENGTH, 1)
&& intron_lengths_all_less_than_maximum($combined_transcript, $self->MAX_INTRON_LENGTH)
&& has_no_unwanted_evidence($combined_transcript)
){
if($self->find_cluster_joiners($combined_transcript)){
print STDERR "Found a cluster_joiner!\n" if $self->VERBOSE;
print STDERR $combined_transcript->seq_region_start."/".$combined_transcript->seq_region_end."\n" if $self->VERBOSE;
$combined_transcript = undef;
}
}
else{
print STDERR "Didn't pass checks.\n" if $self->VERBOSE;
$combined_transcript = undef;
}
}
else{
$combined_transcript = undef;
}
if(defined $combined_transcript){
last POS;
}
}
}
if (defined $combined_transcript){
$combined_transcript = $self->_transfer_evidence($combined_transcript, $cdna_match);
my $genetype;
if ( $self->EXTEND_ORIGINAL_BIOTYPE && (length($self->EXTEND_ORIGINAL_BIOTYPE)>0)) {
my $sep = "";
if($usingKnown){
$sep = "_" unless ( $self->KNOWN_UTR_GENETYPE =~ m/^_/ ); $genetype = $combined_transcript->biotype. $sep. $self->KNOWN_UTR_GENETYPE;
}
else{
$sep = "_" unless ( $self->UTR_GENETYPE =~ m/^_/ ); $genetype = $combined_transcript->biotype. $sep. $self->UTR_GENETYPE;
}
}
else {
if($usingKnown){
$genetype = $self->KNOWN_UTR_GENETYPE;
}
else{
$genetype = $combined_genetype;
}
}
my $combined_genes = $self->make_gene($genetype, $combined_transcript);
my $combined_gene;
if(!$blessed){
$combined_gene = $self->look_for_both($combined_genes->[0]);
my $combined_transcript = $combined_gene->get_all_Transcripts->[0];
$combined_transcript->sort;
if(! (is_Transcript_sane($combined_transcript)
&& all_exons_are_valid($combined_transcript, $self->MAX_EXON_LENGTH, 1)
&& intron_lengths_all_less_than_maximum($combined_transcript, $self->MAX_INTRON_LENGTH)
&& validate_Translation_coords($combined_transcript, 1)
&& !contains_internal_stops($combined_transcript)
&& $combined_transcript->translate
&& has_no_unwanted_evidence($combined_transcript) )
){
$combined_gene = $combined_genes->[0];
}
}
else{
$combined_gene = $combined_genes->[0];
}
$self->combined_genes($combined_gene);
print STDERR "RESULT-CombinedGene: ".$combined_gene->seq_region_name." ".$combined_gene->seq_region_start.
"-".$combined_gene->seq_region_end."\n";
}
else{
my $unmerged_cds = $self->retrieve_unmerged_gene($cds);
$self->unmatched_genes($unmerged_cds);
print STDERR "RESULT-UnmergedGene: ".$unmerged_cds->seq_region_name." ".$unmerged_cds->seq_region_start.
"-".$unmerged_cds->seq_region_end."\n";
next CDS;
}
}
print STDERR "At the end of the matching I have ".(scalar @{$self->combined_genes})." combined_genes genes".
" and ".(scalar @{$self->unmatched_genes})." unmatched_genes\n" if $self->VERBOSE;
if(!$blessed && ($self->{'remove_redundant'})){
my $unique_unmatched_genes = $self->remove_redundant_models($self->combined_genes, $self->unmatched_genes);
$self->{'_unmatched_genes'} = [];
$self->unmatched_genes(@{$unique_unmatched_genes});
print STDERR "without redundant models, we have ".(scalar @{$self->combined_genes})." combined_genes genes".
" and ".(scalar @{$self->unmatched_genes})." unmatched_genes\n" if $self->VERBOSE;
} } |
sub score_ditags
{ my ($self, $ditags, $feature, $index) = @_;
my $ditag_score = 0;
if(!$index){ $index = 0 }
for(my $i = $index; $i < (scalar @$ditags); $i++){
my $ditag = $ditags->[$i];
if(($ditag->{'start'} < $feature->end) && ($ditag->{'end'} > $feature->start)){
my $start_distance = abs($ditag->{'start'} - $feature->start);
my $end_distance = abs($ditag->{'end'} - $feature->end);
if(($start_distance < $self->DITAG_WINDOW) && ($end_distance < $self->DITAG_WINDOW)){
my $score = 0;
$score += ($ditag->{'tag_count'} / (($start_distance +1) * 0.9)); $score += ($ditag->{'tag_count'} / (($end_distance +1) * 0.9));
if($score > 0){ $ditag_score += $score; }
}
}
if($ditag->{'end'} < $feature->start){
$index++;
}
elsif($ditag->{'start'} > $feature->start){
last;
}
}
print STDERR " returning ditagscore $ditag_score.\n" if $self->VERBOSE;
return($ditag_score, $index); } |
sub transcript_from_multi_exon_genewise
{ my ($self, $current_exon, $transcript, $translation, $exoncount, $gw_gene, $eg_gene) = @_;
my @gwtran = @{$gw_gene->get_all_Transcripts};
$gwtran[0]->sort;
my @gwexons = @{$gwtran[0]->get_all_Exons};
my @egtran = @{$eg_gene->get_all_Transcripts};
$egtran[0]->sort;
my @egexons = @{$egtran[0]->get_all_Exons};
if($gwexons[0]->strand == 1){
return $self->transcript_from_multi_exon_genewise_forward($current_exon, $transcript, $translation,
$exoncount, $gw_gene, $eg_gene);
}
elsif( $gwexons[0]->strand == -1 ){
return $self->transcript_from_multi_exon_genewise_reverse($current_exon, $transcript, $translation,
$exoncount, $gw_gene, $eg_gene);
} } |
sub transcript_from_multi_exon_genewise_forward
{ my ($self, $current_exon, $transcript, $translation, $exoncount, $gw_gene, $eg_gene) = @_;
my $modified_peptide = 0;
my @gwtran = @{$gw_gene->get_all_Transcripts};
$gwtran[0]->sort;
my @gwexons = @{$gwtran[0]->get_all_Exons};
my @egtran = @{$eg_gene->get_all_Transcripts};
$egtran[0]->sort;
my @egexons = @{$egtran[0]->get_all_Exons};
my $orig_tend = $translation->end;
my $exon_slop = 20;
if ( $gwexons[0]->end == $current_exon->end &&
($current_exon->start <= $gwexons[0]->start ||
(abs($current_exon->start - $gwexons[0]->start) <= $exon_slop &&
$current_exon != $egexons[0]))){
my $current_start = $current_exon->start;
my $gwstart = $gwexons[0]->start;
my $ex = $transcript->start_Exon;
$ex->phase(-1);
if ( $current_exon->start < $gwexons[0]->start ){
$ex->start($current_exon->start);
}
elsif( $current_exon->start == $gwexons[0]->start ){
$ex->start($gwstart);
$ex->phase($gwexons[0]->phase);
}
elsif( $current_start > $gwstart && $exoncount != 0 ) {
$ex->start($current_exon->start);
}
transfer_supporting_evidence($current_exon, $ex);
$self->add_5prime_exons($transcript, $exoncount, @egexons);
if($gwstart >= $current_start){
my $tstart = $translation->start;
print STDERR "Forward 5': original translation start: $tstart "; $tstart += ($gwstart - $current_start);
$translation->start($tstart);
print STDERR "re-setting translation start to: $tstart\n"; }
elsif($gwstart < $current_start && $exoncount != 0){
$modified_peptide = 1;
my $diff = $current_start - $gwstart;
my $tstart = $translation->start;
warn("this is a case where gw translation starts at $tstart > 1") if ($tstart>1);
print STDERR "gw translation start: ".$tstart."\n";
if($diff % 3 == 0) {
$translation->start(1);
}
elsif ($diff % 3 == 1) {
$translation->start(3);
}
elsif ($diff % 3 == 2) {
$translation->start(2);
}
else {
$translation->start(1);
warn("very odd - $diff mod 3 = " . $diff % 3 . "\n");
}
}
else{
}
throw("setting very dodgy translation start: " . $translation->start. "\n")
unless $translation->start > 0;
}
elsif ( $gwexons[$#gwexons]->start == $current_exon->start &&
($current_exon->end >= $gwexons[$#gwexons]->end ||
(abs($current_exon->end - $gwexons[$#gwexons]->end) <= $exon_slop &&
$current_exon != $egexons[$#egexons]))){
my $end_translation_shift = 0;
my $ex = $transcript->end_Exon;
$ex->end_phase(-1);
if ( $current_exon->end > $gwexons[$#gwexons]->end ){
$ex->end($current_exon->end);
}
elsif( $current_exon->end == $gwexons[$#gwexons]->end ){
$ex->end($gwexons[$#gwexons]->end);
$ex->end_phase($gwexons[$#gwexons]->end_phase);
}
elsif ( $current_exon->end < $gwexons[$#gwexons]->end && $exoncount != $#egexons ){
$modified_peptide = 1;
my $diff = $gwexons[$#gwexons]->end - $current_exon->end;
my $tend = $translation->end;
my $gw_exon_length = $gwexons[$#gwexons]->end - $gwexons[$#gwexons]->start + 1;
my $cdna_exon_length = $current_exon->end - $current_exon->start + 1;
my $length_diff = $gw_exon_length - $cdna_exon_length;
$ex->end($current_exon->end);
if($diff % 3 == 0) {
$translation->end($cdna_exon_length);
$end_translation_shift = $length_diff;
}
elsif ($diff % 3 == 1) {
$translation->end($cdna_exon_length - 2);
$end_translation_shift = $length_diff + 2;
}
elsif ($diff % 3 == 2) {
$translation->end($cdna_exon_length - 1);
$end_translation_shift = $length_diff + 1;
}
else {
$translation->end($cdna_exon_length);
warn("very odd - $diff mod 3 = " . $diff % 3 . "\n");
}
}
my $end_ex = $transcript->end_Exon;
$translation->end_Exon($end_ex);
my $expanded = $self->expand_3prime_exon($ex, $transcript, 1);
if($expanded){
$translation->end($orig_tend - $end_translation_shift);
}
transfer_supporting_evidence($current_exon, $ex);
$self->add_3prime_exons($transcript, $exoncount, @egexons);
}
return ($transcript,$modified_peptide); } |
sub transcript_from_multi_exon_genewise_reverse
{ my ($self, $current_exon, $transcript, $translation, $exoncount, $gw_gene, $eg_gene) = @_;
my $modified_peptide = 0;
my @gwtran = @{$gw_gene->get_all_Transcripts};
$gwtran[0]->sort;
my @gwexons = @{$gwtran[0]->get_all_Exons};
my @egtran = @{$eg_gene->get_all_Transcripts};
$egtran[0]->sort;
my @egexons = @{$egtran[0]->get_all_Exons};
my $orig_tend = $translation->end;
my $exon_slop = 20;
if ($gwexons[0]->start == $current_exon->start &&
($current_exon->end >= $gwexons[0]->end ||
(abs($current_exon->end - $gwexons[0]->end) <= $exon_slop &&
$current_exon != $egexons[0]))){
my $tstart = $translation->start;
if($current_exon->end >= $gwexons[0]->end){
$tstart += $current_exon->end - $gwexons[0]->end;
$translation->start($tstart);
}
elsif( $current_exon->end < $gwexons[0]->end && $current_exon != $egexons[0] ){
$modified_peptide = 1;
my $diff = $gwexons[0]->end - $current_exon->end;
my $gwstart = $gwexons[0]->end;
my $current_start = $current_exon->end;
my $tstart = $translation->start;
if ($diff % 3 == 0) { $translation->start(1); }
elsif ($diff % 3 == 1) { $translation->start(3); }
elsif ($diff % 3 == 2) { $translation->start(2); }
else {
$translation->start(1);
warn("very odd - $diff mod 3 = " . $diff % 3 . "\n");}
}
throw("setting very dodgy translation start: " . $translation->start. "\n")
unless $translation->start > 0;
my $ex = $transcript->start_Exon;
$ex->phase(-1);
if ( $current_exon->end > $gwexons[0]->end){
$ex->end($current_exon->end);
$ex->phase(-1);
}
elsif ( $current_exon->end == $gwexons[0]->end){
$ex->end($gwexons[0]->end);
$ex->phase($gwexons[0]->phase);
}
elsif ( ($current_exon->end < $gwexons[0]->end) && ($current_exon != $egexons[0]) ){
$ex->end($current_exon->end);
}
$translation->start_Exon($ex);
transfer_supporting_evidence($current_exon, $ex);
$self->add_5prime_exons($transcript, $exoncount, @egexons);
}
elsif ($gwexons[$#gwexons]->end == $current_exon->end &&
($current_exon->start <= $gwexons[$#gwexons]->start ||
(abs($current_exon->start - $gwexons[$#gwexons]->start) <= $exon_slop &&
$current_exon != $egexons[$#egexons]))){
my $end_translation_shift = 0;
my $ex = $transcript->end_Exon;
$ex->end_phase(-1);
if ( $current_exon->start < $gwexons[$#gwexons]->start ){
$ex->start($current_exon->start);
$ex->end_phase(-1);
}
elsif( $current_exon->start == $gwexons[$#gwexons]->start){
$ex->start($gwexons[$#gwexons]->start);
$ex->end_phase($gwexons[$#gwexons]->end_phase);
}
elsif ( $current_exon->start > $gwexons[$#gwexons]->start && $exoncount != $#egexons ){
$modified_peptide = 1;
my $diff = $current_exon->start - $gwexons[$#gwexons]->start;
my $tend = $translation->end;
my $gw_exon_length = $gwexons[$#gwexons]->end - $gwexons[$#gwexons]->start + 1;
my $cdna_exon_length = $current_exon->end - $current_exon->start + 1;
my $length_diff = $gw_exon_length - $cdna_exon_length;
$ex->start($current_exon->start);
if($diff % 3 == 0) {
$translation->end($cdna_exon_length);
$end_translation_shift = $length_diff;
}
elsif ($diff % 3 == 1) {
$translation->end($cdna_exon_length - 2);
$end_translation_shift = $length_diff + 2;
}
elsif ($diff % 3 == 2) {
$translation->end($cdna_exon_length - 1);
$end_translation_shift = $length_diff + 1;
}
else {
$translation->end($cdna_exon_length);
warn("very odd - $diff mod 3 = " . $diff % 3 . "\n");
}
}
my $expanded = $self->expand_3prime_exon($ex, $transcript, -1);
my $end_ex = $transcript->end_Exon;
$translation->end_Exon($end_ex);
if($expanded){
$translation->end($orig_tend - $end_translation_shift);
}
transfer_supporting_evidence($current_exon, $ex);
$self->add_3prime_exons($transcript, $exoncount, @egexons);
}
return ($transcript, $modified_peptide); } |
sub transcript_from_single_exon_genewise
{ my ($self, $eg_exon, $gw_exon, $transcript, $translation, $exoncount, @e2g_exons) = @_;
my $orig_tend = $translation->end;
if ($gw_exon->start >= $eg_exon->start && $gw_exon->end <= $eg_exon->end){
my $egstart = $eg_exon->start;
my $egend = $eg_exon->end;
my $gwstart = $gw_exon->start;
my $gwend = $gw_exon->end;
my $ex = $transcript->start_Exon;
$ex->start($eg_exon->start);
$ex->end($eg_exon->end);
$translation->start_Exon($ex);
$translation->end_Exon($ex);
if($gw_exon->strand == 1){
my $diff = $gwstart - $egstart;
my $tstart = $translation->start;
my $tend = $translation->end;
$translation->start($tstart + $diff);
$translation->end($tend + $diff);
if($translation->start < 0){
warn("Forward strand: setting very dodgy translation start: " . $translation->start. "\n");
return(undef, 1);
}
if($translation->end > $translation->end_Exon->length){
warn("Forward strand: setting dodgy translation end: " . $translation->end .
" exon_length: " . $translation->end_Exon->length . "\n");
return(undef, 1);
}
}
elsif($gw_exon->strand == -1){
my $diff = $egend - $gwend;
my $tstart = $translation->start;
my $tend = $translation->end;
$translation->start($tstart+$diff);
$translation->end($tend + $diff);
if($translation->start < 0){
warn("Forward strand: setting very dodgy translation start: " . $translation->start. "\n");
return(undef, 1);
}
if($translation->end > $translation->end_Exon->length){
warn("Forward strand: setting dodgy translation end: " . $translation->end .
" exon_length: " . $translation->end_Exon->length . "\n");
return(undef, 1);
}
}
if(defined($ex->sub_SeqFeature) && (scalar($ex->sub_SeqFeature) > 1)){
my @sf = $ex->sub_SeqFeature;
my $cstart = $ex->start;
my $cend = $ex->end;
my $exlength = $ex->length;
my $first = shift(@sf);
$ex->end($first->end);
my $last = pop(@sf);
$last->end($cend);
$transcript->add_Exon($last);
$translation->end_Exon($last);
$translation->end($orig_tend);
foreach my $s(@sf){
$transcript->add_Exon($s);
}
$transcript->sort;
$ex->flush_sub_SeqFeature;
}
$self->add_5prime_exons($transcript, $exoncount, @e2g_exons);
$self->add_3prime_exons($transcript, $exoncount, @e2g_exons);
}
return ($transcript,0); } |
sub unmatched_genes
{ my ($self, @genes) = @_;
if (!defined($self->{'_unmatched_genes'})) {
$self->{'_unmatched_genes'} = [];
}
if(@genes){
push(@{$self->{'_unmatched_genes'}},@genes);
}
return $self->{'_unmatched_genes'}; } |
sub validate_gene
{ my ($self, $gene) = @_;
my @transcripts = @{$gene->get_all_Transcripts};
if(scalar(@transcripts) != 1) {
print STDERR "Rejecting gene - should have one transcript, not " . scalar(@transcripts) . "\n";
return 0;
}
foreach my $transcript(@transcripts){
foreach my $exon(@{$transcript->get_all_Exons}){
unless( validate_Exon_coords($exon, 1) ){
print STDERR "Rejecting gene because of invalid exon\n";
return 0;
}
}
}
return 1; } |
sub write_output
{ my($self) = @_;
my $gene_adaptor = $self->OUTPUT_DB->get_GeneAdaptor;
print STDERR "Have ".scalar (@{$self->output})."(".$totalgenes.") genes to write\n";
GENE:
foreach my $gene (@{$self->output}) {
if(!$gene->analysis ||
$gene->analysis->logic_name ne $self->analysis->logic_name){
$gene->analysis($self->analysis);
}
$gene->recalculate_coordinates;
foreach my $exon (@{$gene->get_all_Exons}) {
$exon->adaptor(undef);
$exon->dbID(undef);
}
$gene->dbID(undef) ;
$gene->adaptor(undef);
for ( @{$gene->get_all_Transcripts} ) {
for ( @{$_->get_all_supporting_features}){
$_->dbID(undef);
$_->adaptor(undef);
}
$_->dbID(undef);
$_->adaptor(undef);
}
eval {
$gene_adaptor->store($gene);
print STDERR "wrote gene dbID " . $gene->dbID . "\t".$gene->biotype . "\n" ;
};
if( $@ ) {
print STDERR "UNABLE TO WRITE GENE " . $gene->dbID. "of type " . $gene->type . "\n\n$@\n\nSkipping this gene\n";
}
} } |
General documentation
The rest of the documentation details each of the object methods. Internal methods are
usually preceded with a '_'
Arg [1] : optional Bio::EnsEMBL::DBSQL::DBAdaptor
Description: get/set for db storing blessed gene structures
Returntype : Bio::EnsEMBL::DBSQL::DBAdaptor
Exceptions : Throws if input isn't DBAdaptor or db parameters haven't been defined
various_config_variable_methods | Top |
Arg [1] : optional parameter
Description: setter / getter for config vars
Returntype : config value