sub _save_hit
{ my($self,$rares,$raalign)=@_;
if (@$rares) {
my @nres;
if (@$rares == 13) {
@nres=( @$rares[0..9, 12, 11, 10] );
} elsif (@$rares == 12) {
@nres=( @$rares[0..7], 'W', @$rares[8..11] );
}
my $raw_align;
if (@$raalign) {
my $st1=$nres[5];
my $en1=$nres[6];
my $st2;
my $en2;
my $dirl;
if ($$raalign[2] eq 'C') {
$$raalign[3]='C';
$$raalign[0]=reverse($$raalign[0]);
$$raalign[1]=reverse($$raalign[1]);
$dirl='C';
$st2=$nres[11];
$en2=$nres[10];
} else {
$st2=$nres[10];
$en2=$nres[11];
}
my $l1=length($$raalign[0]);
my $l2=length($$raalign[1]);
if ($l1!=$l2) {
print "Lengths of alignment are different [$l1,$l2]\n$$raalign[0]\n$$raalign[1]\n";
print join(',',@nres)."\n";
throw( "failed at CrossMatch :_save_hit");
}
$raw_align="$st1:$dirl$st2";
my $seq1=$$raalign[0];
my $seq2=$$raalign[1];
{
my($s1a,$s1b,$s1c,$s2a,$s2b,$s2c);
if ($seq1=~/^([^\-]+)(\-+)(\S+)$/) {
($s1a,$s1b,$s1c)=($1,$2,$3);
} else {
$s1a=$seq1;
$s1b=$s1c='';
}
if ($seq2=~/^([^\-]+)(\-+)(\S+)$/) {
($s2a,$s2b,$s2c)=($1,$2,$3);
} else {
$s2a=$seq2;
$s2b=$s2c='';
}
next if(length($s1c)==0 && length($s2c)==0);
my $lab;
if ( length($s1a.$s1b) == 0 ||
length($s2a.$s2b) == 0 ) {
print STDERR "Dodgy alignment processing catch! Bugging out\n";
next;
}
if (length($s1a.$s1b)<length($s2a.$s2b)) {
$lab=length($s1a.$s1b);
$st1+=length($s1a);
$seq1=$s1c;
$seq2=~s/^\S{$lab}//; } else {
$lab=length($s2a);
$seq2=$s2c;
my $l2ab=length($s2a.$s2b);
$seq1=~s/^\S{$l2ab}//; $st1+=$l2ab;
}
if ($dirl eq 'C') {
$st2-=$lab;
} else {
$st2+=$lab;
}
$raw_align.=",$st1:$dirl$st2";
redo;
}
$raw_align.=",$en1:$dirl$en2";
}
$self->hit(@nres,$raw_align);
@$rares=();
@$raalign=();
}
}
} |
new
Create a new CrossMatch factory object. Optional arguments to new is a
list of directories to search (see
dir below).
crossMatch
Do a cross_match on the two
fasta formatted sequence files supplied
as agruments, returning a
CrossMatch object. The
CrossMatchobject records the parameters used in the match, and the match data
returned my cross_match (if any).
Note: The two fasta files may each contain multiple sequences.
dir
Change the list of directories in which
crossMatch searches for
files. The list of directories supplied to
dir completely replaces
the existing list. Directories are searched in the order supplied.
This list defaults to the current working directory when the
CrossMatch::Factory object is created.
extn
A convenience function which allows you to supply a list of filename
extensions to be considered by
crossMatch when finding a file to
pass to cross_match. For example:
$matcher->extn( 'seq', 'aa' );
$cm = $matcher->crossMatch( 'dJ334P19', 'cB49C12' );
crossMatch will look for files "dJ334P19.seq", "dJ334P19.aa". If
no
extns had been set, then only "dJ334P19" would have been
searched for.
minMatch
Set the
minmatch parameter passed to cross_match. Defaults to 50.
minScore
Set the
minscore parameter passed to cross_match. Defaults to 50.
maskLevel
Set the
masklevel parameter passed to
cross_match. Defaults to 101, which displays all
overlapping matches.
alignments
Causes the full crossmatch alignments to be parsed and stored in the
Crossmatch object. These can be accessed by the a2b and fp methods.
Data stored in
CrossMatch objects, generated by a
CrossMatch::Factory, can be retrieved with a variety of queries.
All the matches are stored internally in an array, in the order in
which they were generated by cross_match. You can apply filters and
sorts (sorts are not yet implemented) to the object, which re-order or
hide matches. These operations save their changes by altering the
active list, which is simply an array of indices which refer to
elements in the array of matches. The data access funtions all
operate through this active list, and the active list can be reset to
show all the matches found by calling the
unfilter method.
SYNOPSIS
$firstName = $cm->aName;
@allscores = $cm->score();
@someSores = $cm->score(0,3,4);
FUNCTIONS
count
Returns a count of the number of hits in the active list
list
Returns a list of array indices for the current list.
filter
$cm->filter(\&CrossMatch::endHits);
Takes a reference to a subroutine (or an anonymous subroutine), and
alters the active list to contain only those hits for which the
subroutine returns true. The subroutine should expect to be passed a
CrossMatch object, and an integer corresponding to the index of a
match. Applying a series of filters to the same object removes
successively more objects from the active list.
unfilter
Resets the active list to include all the hits, in the order in which
they were generated by cross_match. Returns a list of the active
indices.
FILTERS
endHits
A filter which returns true if one of the sequnces in a hit has its
end in the hit.
PARAMETERS
The parameters used by cross_match when doing the match can be
retrieved from the resulting Match object. The two sequences matched
are labelled
a and
b, where
a is the sequence from the first
file passed to
crossMatch, and
b, the sequence from the second
file. For example:
$path_to_a = $cm->aFile;
retrieves the path of the file supplied as the first argument to
cross_match.
aFile bFile
The full paths to the files used for sequences
a and
b.
minM
The minmatch parameter used.
minS
The minscore parameter used.
mask
The masklevel parameter used.
DATA FIELDS
Syntax:
$field = $match->FIELD();
$field = $match->FIELD(INDEX);
@fields = $match->FIELD();
@fields = $match->FIELD(LIST);
Examples:
$firstScore = $match->score(0);
@aEnds = $match->aEnd();
These methods provide access to all the data fields for each hit found
by cross_match. Each of the fields is described below. In scalar
context, a single data field is returned, which is the field from the
first row in the active list if no
INDEX is given. In array
context it returns a list with the
FIELD from all the hits in the
active list when called without arguments, or from the hits specified
by a
LIST of indices.
score
The Smith-Waterman score of a hit, adjusted for the complexity of the
matching sequence.
pSub
The percent substitutions in the hit.
pDel
The percent deletions in the hit (sequence
a relative to sequence
b).
pIns
The percent insertions in the hit (sequence
a relative to sequence
b).
aName bName
ID of the first sequence (sequence
a) and second seqeunce (
b) in
the match, respecitvely.
aStart bStart
Start of hit in
a, or
b, respectively.
aEnd bEnd
End of hit in
a, or
b, respectively.
aRemain
Number of bases in
a after the end of the hit. ("0" if the hit
extends to the end of
a.)
bRemain
The equivalent for sequence
b of aRemain, but note that if the
strand of
b which matches is the reverse strand, then this is the
number of bases left in
b after the hit, back towards the beginning
of
b.
strand
The strand on sequence
b which was matched. This is "
W" for the
forward (or parallel or
Watson) strand, and "
C" for the reverse
(or anti-parallel or
Crick) strand.
a2b
Returns coordinate in sequence
b equivalent to value in sequence
a passed to method. If no corresponding base, returns
-1.
fp
Returns an array of strings contining full list of ungapped alignment
fragment coordinates, filtered by score value passed to method.
QUERYING BY SEQUENCE NAME
These methods allow access to the
start,
end, and
remainfields for those occasions where you know the name of your sequence,
but don't necessarily know if it is sequence
a or
b. Syntax and
behaviour is the same as for the
DATA FIELDS functions, but the
first argument to the function is the name of the sequence you want to
get data about.
Note: These methods perform a filtering operation, reducing the
active list to only those hits which match the name given (in either
the
a or
b columns). You'll therefore need to
unfilter your
match if you need data from hits which have now become hidden.
If the sequence name is found in both columns
a and
b, then a
warning is printed, and the
a column is chosen.
For example, suppose you have matched two fasta files containing only
the sequences cN75H12 and bK1109B5 (in that order), then the following
calls retrieve the same data:
@ends = $match->aEnd();
@ends = $match->end('cN75H12');
$start = $match->bStart(0);
$start = $match->start('bK1109B5', 0);
A warning is printed to STDERR if a match contains hits from the name
supplied in both columns, and only hits from the a column are
returned.
start end remain
Access the aStart or bStart, aEnd or bEnd, and aRemain or bRemain
fields, depending upon wether the name supplied matches the aName or
bName fields respectively.