GFF documentation.

  use Bio::DB::GFF;

  # Open the sequence database
  my $db      = Bio::DB::GFF->new( -adaptor => 'dbi::mysqlopt',
                                   -dsn     => 'dbi:mysql:elegans',
				   -fasta   => '/usr/local/fasta_files'
				 );

  # fetch a 1 megabase segment of sequence starting at landmark "ZK909"
  my $segment = $db->segment('ZK909', 1 => 1000000);

  # pull out all transcript features
  my @transcripts = $segment->features('transcript');

  # for each transcript, total the length of the introns
  my %totals;
  for my $t (@transcripts) {
    my @introns = $t->Intron;
    $totals{$t->name} += $_->length foreach @introns;
  }

  # Sort the exons of the first transcript by position
  my @exons = sort {$a->start <=> $b->start} $transcripts[0]->Exon;

  # Get a region 1000 bp upstream of first exon
  my $upstream = $exons[0]->segment(-1000,0);

  # get its DNA
  my $dna = $upstream->dna;

  # and get all curated polymorphisms inside it
  @polymorphisms = $upstream->contained_features('polymorphism:curated');

  # get all feature types in the database
  my @types = $db->types;

  # count all feature types in the segment
  my %type_counts = $segment->types(-enumerate=>1);

  # get an iterator on all curated features of type 'exon' or 'intron'
  my $iterator = $db->get_seq_stream(-type     => ['exon:curated','intron:curated']);

  while (my $s = $iterator->next_seq) {
      print $s,"\n";
  }

  # find all transcripts annotated as having function 'kinase'
  my $iterator = $db->get_seq_stream(-type=>'transcript',
			             -attributes=>{Function=>'kinase'});
  while (my $s = $iterator->next_seq) {
      print $s,"\n";
  }

Description

Bio::DB::GFF provides fast indexed access to a sequence annotation
database. It supports multiple database types (ACeDB, relational),
and multiple schemas through a system of adaptors and aggregators.
The following operations are supported by this module:

  - retrieving a segment of sequence based on the ID of a landmark
  - retrieving the DNA from that segment
  - finding all annotations that overlap with the segment
  - finding all annotations that are completely contained within the
    segment
  - retrieving all annotations of a particular type, either within a
    segment, or globally
  - conversion from absolute to relative coordinates and back again,
    using any arbitrary landmark for the relative coordinates
  - using a sequence segment to create new segments based on relative 
    offsets

The data model used by Bio::DB::GFF is compatible with the GFF flat
file format (http://www.sanger.ac.uk/localsw/GFF). The module can
load a set of GFF files into the database, and serves objects that
have methods corresponding to GFF fields.
The objects returned by Bio::DB::GFF are compatible with the
SeqFeatureI interface, allowing their use by the Bio::Graphics and
Bio::DAS modules. The bioperl distribution includes several scripts that make it easier
to work with Bio::DB::GFF databases. They are located in the scripts
directory under a subdirectory named Bio::DB::GFF:

bp_load_gff.pl

    This script will load a Bio::DB::GFF database from a flat GFF file of
sequence annotations. Only the relational database version of
Bio::DB::GFF is supported. It can be used to create the database from
scratch, as well as to incrementally load new data.
    This script takes a --fasta argument to load raw DNA into the database
as well. However, GFF databases do not require access to the raw DNA
for most of their functionality.
    load_gff.pl also has a --upgrade option, which will perform a
non-destructive upgrade of older schemas to newer ones.

bp_bulk_load_gff.pl

This script will populate a Bio::DB::GFF database from a flat GFF file
of sequence annotations. Only the MySQL database version of
Bio::DB::GFF is supported. It uses the "LOAD DATA INFILE" query in
order to accelerate loading considerably; however, it can only be used
for the initial load, and not for updates.
This script takes a --fasta argument to load raw DNA into the database
as well. However, GFF databases do not require access to the raw DNA
for most of their functionality.

bp_fast_load_gff.pl

This script is as fast as bp_bulk_load_gff.pl but uses Unix pipe
tricks to allow for incremental updates. It only supports the MySQL
database version of Bio::DB::GFF and is guaranteed not to work on
non-Unix platforms.
Arguments are the same as bp_load_gff.pl

gadfly_to_gff.pl

This script will convert the GFF-like format used by the Berkeley
Drosophila Sequencing project into a format suitable for use with this
module.

sgd_to_gff.pl

This script will convert the tab-delimited feature files used by the
Saccharomyces Genome Database into a format suitable for use with this
module.
The GFF format is a flat tab-delimited file, each line of which
corresponds to an annotation, or feature. Each line has nine columns
and looks like this:

 Chr1  curated  CDS 365647  365963  .  +  1  Transcript "R119.7"

The 9 columns are as follows:

1. reference sequence

This is the ID of the sequence that is used to establish the
coordinate system of the annotation. In the example above, the
reference sequence is "Chr1".

2. source

The source of the annotation. This field describes how the annotation
was derived. In the example above, the source is "curated" to
indicate that the feature is the result of human curation. The names
and versions of software programs are often used for the source field,
as in "tRNAScan-SE/1.2".

3. method

The annotation method. This field describes the type of the
annotation, such as "CDS". Together the method and source describe
the annotation type.

4. start position

The start of the annotation relative to the reference sequence.

5. stop position

The stop of the annotation relative to the reference sequence. Start
is always less than or equal to stop.

6. score

For annotations that are associated with a numeric score (for example,
a sequence similarity), this field describes the score. The score
units are completely unspecified, but for sequence similarities, it is
typically percent identity. Annotations that don't have a score can
use "."

7. strand

For those annotations which are strand-specific, this field is the
strand on which the annotation resides. It is "+" for the forward
strand, "-" for the reverse strand, or "." for annotations that are
not stranded.

8. phase

For annotations that are linked to proteins, this field describes the
phase of the annotation on the codons. It is a number from 0 to 2, or
"." for features that have no phase\.

9. group

    GFF provides a simple way of generating annotation hierarchies ("is
composed of" relationships) by providing a group field. The group
field contains the class and ID of an annotation which is the logical
parent of the current one. In the example given above, the group is
the Transcript named "R119.7".
    The group field is also used to store information about the target of
sequence similarity hits, and miscellaneous notes. See the next
section for a description of how to describe similarity targets.
    The format of the group fields is "Class ID" with a single space (not
a tab) separating the class from the ID. It is VERY IMPORTANT to
follow this format, or grouping will not work properly.
The sequences used to establish the coordinate system for annotations
can correspond to sequenced clones, clone fragments, contigs or
super-contigs. Thus, this module can be used throughout the lifecycle
of a sequencing project.
In addition to a group ID, the GFF format allows annotations to have a
group class. For example, in the ACeDB representation, RNA
interference experiments have a class of "RNAi" and an ID that is
unique among the RNAi experiments. Since not all databases support
this notion, the class is optional in all calls to this module, and
defaults to "Sequence" when not provided.
Double-quotes are sometimes used in GFF files around components of the
group field. Strictly, this is only necessary if the group name or
class contains whitespace. Some annotations do not need to be individually named. For example,
it is probably not useful to assign a unique name to each ALU repeat
in a vertebrate genome. Others, such as predicted genes, correspond
to named biological objects; you probably want to be able to fetch the
positions of these objects by referring to them by name.
To accomodate named annotations, the GFF format places the object
class and name in the group field. The name identifies the object,
and the class prevents similarly-named objects, for example clones and
sequences, from collding.
A named object is shown in the following excerpt from a GFF file:

 Chr1  curated transcript  939627 942410 . +  . Transcript Y95B8A.2

This object is a predicted transcript named Y95BA.2. In this case,
the group field is used to identify the class and name of the object,
even though no other annotation belongs to that group.
It now becomes possible to retrieve the region of the genome covered
by transcript Y95B8A.2 using the segment() method:

  $segment = $db->segment(-class=>'Transcript',-name=>'Y95B8A.2');

It is not necessary for the annotation's method to correspond to the
object class, although this is commonly the case.
As explained above, each annotation in a GFF file refers to a
reference sequence. It is important that each reference sequence also
be identified by a line in the GFF file. This allows the Bio::DB::GFF
module to determine the length and class of the reference sequence,
and makes it possible to do relative arithmetic.
For example, if "Chr1" is used as a reference sequence, then it should
have an entry in the GFF file similar to this one:

 Chr1 assembly chromosome 1 14972282 . + . Sequence Chr1

This indicates that the reference sequence named "Chr1" has length
14972282 bp, method "chromosome" and source "assembly". In addition,
as indicated by the group field, Chr1 has class "Sequence" and name
"Chr1".
The object class "Sequence" is used by default when the class is not
specified in the segment() call. This allows you to use a shortcut
form of the segment() method:

 $segment = $db->segment('Chr1');          # whole chromosome
 $segment = $db->segment('Chr1',1=>1000);  # first 1000 bp

For your convenience, if, during loading a GFF file, Bio::DB::GFF
encounters a line like the following:

  ##sequence-region Chr1 1 14972282

It will automatically generate the following entry:

 Chr1 reference Component 1 14972282 . + . Sequence Chr1

This is sufficient to use Chr1 as a reference point.
The ##sequence-region line is frequently found in the GFF files
distributed by annotation groups. There are two cases in which an annotation indicates the relationship
between two sequences. The first case is a similarity hit, where the
annotation indicates an alignment. The second case is a map assembly,
in which the annotation indicates that a portion of a larger sequence
is built up from one or more smaller ones.
Both cases are indicated by using the Target tag in the group
field. For example, a typical similarity hit will look like this:

 Chr1 BLASTX similarity 76953 77108 132 + 0 Target Protein:SW:ABL_DROME 493 544

The group field contains the Target tag, followed by an identifier for
the biological object referred to. The GFF format uses the notation
Class:Name for the biological object, and even though this is
stylistically inconsistent, that's the way it's done. The object
identifier is followed by two integers indicating the start and stop
of the alignment on the target sequence.
Unlike the main start and stop columns, it is possible for the target
start to be greater than the target end. The previous example
indicates that the the section of Chr1 from 76,953 to 77,108 aligns to
the protein SW:ABL_DROME starting at position 493 and extending to
position 544.
A similar notation is used for sequence assembly information as shown
in this example:

 Chr1        assembly Link   10922906 11177731 . . . Target Sequence:LINK_H06O01 1 254826
 LINK_H06O01 assembly Cosmid 32386    64122    . . . Target Sequence:F49B2       6 31742

This indicates that the region between bases 10922906 and 11177731 of
Chr1 are composed of LINK_H06O01 from bp 1 to bp 254826. The region
of LINK_H0601 between 32386 and 64122 is, in turn, composed of the
bases 5 to 31742 of cosmid F49B2. While not intended to serve as a general-purpose sequence database
(see bioperl-db for that), GFF allows you to tag features with
arbitrary attributes. Attributes appear in the Group field following
the initial class/name pair. For example:

 Chr1  cur trans  939 942 . +  . Transcript Y95B8A.2 ; Gene sma-3 ; Alias sma3

This line tags the feature named Transcript Y95B8A.2 as being "Gene"
named sma-3 and having the Alias "sma3". Features having these
attributes can be looked up using the fetch_feature_by_attribute() method.
Two attributes have special meaning: "Note" is for backward
compatibility and is used for unstructured text remarks. "Alias" is
considered as a synonym for the feature name and will be consulted
when looking up a feature by its name. This module uses a system of adaptors and aggregators in order to make
it adaptable to use with a variety of databases.

Adaptors

The core of the module handles the user API, annotation coordinate
arithmetic, and other common issues. The details of fetching
information from databases is handled by an adaptor, which is
specified during Bio::DB::GFF construction. The adaptor encapsulates
database-specific information such as the schema, user authentication
and access methods.
Currently there are two adaptors: 'dbi::mysql' and 'dbi::mysqlopt'.
The former is an interface to a simple Mysql schema. The latter is an
optimized version of dbi::mysql which uses a binning scheme to
accelerate range queries and the Bio::DB::Fasta module for rapid
retrieval of sequences. Note the double-colon between the words.

Aggregators

    The GFF format uses a "group" field to indicate aggregation properties
of individual features. For example, a set of exons and introns may
share a common transcript group, and multiple transcripts may share
the same gene group.
    Aggregators are small modules that use the group information to
rebuild the hierarchy. When a Bio::DB::GFF object is created, you
indicate that it use a set of one or more aggregators. Each
aggregator provides a new composite annotation type. Before the
database query is generated each aggregator is called to
"disaggregate" its annotation type into list of component types
contained in the database. After the query is generated, each
aggregator is called again in order to build composite annotations
from the returned components.
    For example, during disaggregation, the standard
"processed_transcript" aggregator generates a list of component
feature types including "UTR", "CDS", and "polyA_site". Later, it
aggregates these features into a set of annotations of type
"processed_transcript".
    During aggregation, the list of aggregators is called in reverse
order. This allows aggregators to collaborate to create multi-level
structures: the transcript aggregator assembles transcripts from
introns and exons; the gene aggregator then assembles genes from sets
of transcripts.
    Three default aggregators are provided:

      transcript   assembles transcripts from features of type
                   exon, CDS, 5'UTR, 3'UTR, TSS, and PolyA
      clone        assembles clones from Clone_left_end, Clone_right_end
                   and Sequence features.
      alignment    assembles gapped alignments from features of type
                   "similarity".

    In addition, this module provides the optional "wormbase_gene"
aggregator, which accomodates the WormBase representation of genes.
This aggregator aggregates features of method "exon", "CDS", "5'UTR",
"3'UTR", "polyA" and "TSS" into a single object. It also expects to
find a single feature of type "Sequence" that spans the entire gene.
    The existing aggregators are easily customized.
    Note that aggregation will not occur unless you specifically request
the aggregation type. For example, this call:

  @features = $segment->features('alignment');

will generate an array of aggregated alignment features. However,
this call:

  @features = $segment->features();

will return a list of unaggregated similarity segments.
For more informnation, see the manual pages for
Bio::DB::GFF::Aggregator::processed_transcript, Bio::DB::GFF::Aggregator::clone,
etc.

Methods

_delete	No description	Code
_delete_features	Description	Code
_delete_groups	No description	Code
_feature_by_attribute	No description	Code
_feature_by_id	Description	Code
_feature_by_name	Description	Code
_features	Description	Code
_multiple_return_args	No description	Code
_split_gff2_group	Description	Code
_split_gff3_group	Description	Code
abs_segment	No description	Code
abscoords	Description	Code
absolute	Description	Code
add_aggregator	Description	Code
aggregators	Description	Code
all_seqfeatures	Description	Code
attributes	Description	Code
automerge	Description	Code
classes	Description	Code
clear_aggregators	Description	Code
contained_features	Description	Code
contained_in	Description	Code
debug	Description	Code
default_aggregators	Description	Code
default_class	No description	Code
default_meta_values	Description	Code
delete	Description	Code
delete_features	Description	Code
delete_groups	Description	Code
dna	Description	Code
dna_chunk_size	No description	Code
do_attributes	Description	Code
do_initialize	Description	Code
do_load_gff	Description	Code
error	Description	Code
fast_queries	Description	Code
features	Description	Code
features_in_range	No description	Code
finish_load	Description	Code
get_Seq_by_accession	Description	Code
get_Seq_by_id	Description	Code
get_Stream_by_group	Description	Code
get_Stream_by_id	Description	Code
get_Stream_by_name	No description	Code
get_abscoords	Description	Code
get_dna	Description	Code
get_feature_by_attribute	Description	Code
get_feature_by_gid	Description	Code
get_feature_by_id	Description	Code
get_feature_by_name	Description	Code
get_feature_by_target	Description	Code
get_features	Description	Code
get_features_iterator	Description	Code
get_seq_stream	Description	Code
get_types	Description	Code
initialize	Description	Code
insert_sequence	No description	Code
insert_sequence_chunk	No description	Code
load_fasta	Description	Code
load_gff	Description	Code
load_gff_line	Description	Code
load_sequence	Description	Code
load_sequence_string	Description	Code
lock_on_load	Description	Code
make_aggregated_feature	No description	Code
make_feature	Description	Code
make_match_sub	Description	Code
make_object	Description	Code
meta	Description	Code
new	Description	Code
overlapping_features	Description	Code
parse_types	Description	Code
refclass	No description	Code
segment	Description	Code
setup_argv	No description	Code
setup_load	Description	Code
setup_segment_args	No description	Code
split_group	Description	Code
strict_bounds_checking	Description	Code
types	Description	Code
unescape	No description	Code

Methods description

_delete_features(), _delete_groups(),_delete()