Compara Database Schema
This document refers to the schema of EnsEMBL Compara version 51.
Introduction
EnsEMBL Compara DB is divided into two parts:
- Genomic Alignments
- This part is dedicated to DNA-DNA alignments. It includes alignments made with different alignment tools, and a list of syntenic regions.
- Homologues and Families (or Protein Clusters)
- This part is dedicated to protein-protein alignments and it includes protein families, protein trees and homologues defined from the protein trees.
The protein families are clusters of proteins. They are defined using the full set of Ensembl proteins plus all the metazoan proteins from SwissProt and SPTrEMBL.
The protein trees are built using the longest transcript of each protein coding Ensembl gene. Homologies are derived from them.
-
General Tables
-
Genomic Alignment Tables
-
Orthologues and Protein Tree Tables
-
Original sequence: AACGCTT
- ortholog_one2one
- apparent_ortholog_one2one
- ortholog_one2many
- ortholog_many2many
- between_species_paralog
- within_species_paralog
- Homo sapiens
- Amniota
- Diptera
- Etc...
-
First peptide sequence: SERCQVVVISIGPISVLSMILDFY
-
Second peptide sequence: SDRCQVLVISILSMIGLDFY
-
First corresponding cigar line: 20MD4M
-
Second corresponding cigar line: 11M5D9M
List of Tables
General Tables
meta
Contains configuration variables.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
meta_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
species_id | int(10) unsigned | NO | PRI | 1 | Only used in core databases | |
meta_key | varchar(40) | NO | MUL | |||
meta_value | varchar(255) | NO | MUL |
Meta table stores miscelaneous values:
mysql> SELECT * FROM meta WHERE meta_key = "schema_version";
+---------+------------+----------------+------------+
| meta_id | species_id | meta_key | meta_value |
+---------+------------+----------------+------------+
| 1 | NULL | schema_version | 51 |
+---------+------------+----------------+------------+
This entry defines which API version must be used to access this database.
ncbi_taxa_node
Contains a description of the taxonomic relathionships between all the taxa used in this database. It is usually read together with the ncbi_taxa_name table
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
taxon_id | int(10) unsigned | NO | PRI | the NCBI taxonomy ID | ||
parent_id | int(10) unsigned | NO | MUL | the parent taxonomy ID for this node (refers to ncbi_taxa_node.taxon_id) | ||
rank | char(32) | NO | MUL | e.g. kingdom, family, genus, etc. | ||
genbank_hidden_flag | tinyint(1) | NO | 0 | boolean value which defines whether this rank is used or not in the abbreviated lineage | ||
left_index | int(10) | NO | sub-set left index. All sub-nodes have left_index and right_index values larger than this left_index | |||
right_index | int(10) | NO | sub-set right index. All sub-nodes have left_index and right_index values smaller than this right_index | |||
root_id | int(10) | 1 | the root taxonomy ID for this node (refers to ncbi_taxa_node.taxon_id) |
(c.f. ncbi_taxa_name for examples)
ncbi_taxa_name
Contains descriptions the taxonimc nodes defined in the ncbi_taxa_node table.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
taxon_id | int(10) unsigned | NO | MUL | external reference to ncbi_taxa_node.taxon_id | ||
name | varchar(255) | YES | MUL | NULL | information assigned to this taxon_id | |
name_class | varchar(255) | YES | MUL | NULL | type of information. e.g. common name, genbank_synonym, scientif name, etc. |
Here is an example on how to get the taxonomic ID for a species:
mysql> SELECT * FROM ncbi_taxa_name WHERE name_class = "scientific name" AND name = "Homo sapiens"; +----------+--------------+-----------------+ | taxon_id | name | name_class | +----------+--------------+-----------------+ | 9606 | Homo sapiens | scientific name | +----------+--------------+-----------------+
Here is another example on how to get the lineage for the hominidae family:
mysql> SELECT * FROM ncbi_taxa_node WHERE left_index > 351172 AND left_index < 351217; +----------+-----------+---------+---------------------+------------+-------------+---------+ | taxon_id | parent_id | rank | genbank_hidden_flag | left_index | right_index | root_id | +----------+-----------+---------+---------------------+------------+-------------+---------+ | 40017 | 40016 | genus | 0 | 351187 | 351198 | 1 | | 40018 | 40017 | species | 1 | 351190 | 351191 | 1 | | 119900 | 40016 | genus | 0 | 351199 | 351216 | 1 | | 119901 | 119900 | species | 1 | 351214 | 351215 | 1 | | 119902 | 119900 | species | 1 | 351208 | 351209 | 1 | | 119903 | 40017 | species | 1 | 351196 | 351197 | 1 | | 121093 | 40017 | species | 1 | 351192 | 351193 | 1 | | 178106 | 119900 | species | 1 | 351212 | 351213 | 1 | | 323204 | 40016 | genus | 0 | 351177 | 351186 | 1 | | 323205 | 323204 | species | 1 | 351184 | 351185 | 1 | | 323207 | 323204 | species | 1 | 351178 | 351179 | 1 | | 323208 | 323204 | species | 1 | 351182 | 351183 | 1 | | 323209 | 119900 | species | 1 | 351206 | 351207 | 1 | | 323210 | 119900 | species | 1 | 351204 | 351205 | 1 | | 323211 | 119900 | species | 1 | 351202 | 351203 | 1 | | 323212 | 40017 | species | 1 | 351194 | 351195 | 1 | | 323213 | 40017 | species | 1 | 351188 | 351189 | 1 | | 323220 | 147820 | species | 1 | 351174 | 351175 | 1 | | 323222 | 323204 | species | 1 | 351180 | 351181 | 1 | | 323223 | 119900 | species | 1 | 351210 | 351211 | 1 | | 376202 | 119900 | species | 1 | 351200 | 351201 | 1 | +----------+-----------+---------+---------------------+------------+-------------+---------+
genome_db
Contains information about the version of the genome assemblies used in this database.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
genome_db_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
taxon_id | int(10) unsigned | NO | MUL | 0 | external reference to ncbi_taxa_node.taxon_id | |
name | varchar(40) | NO | MUL | name of the species | ||
assembly | varchar(100) | NO | assembly version of the genome | |||
assembly_default | tinyint(1) | YES | 1 | boolean value describing if this assembly is the default one or not, so that we can handle more than one assempbly version for a given species. | ||
genebuild | varchar(100) | NO | version of the genebuild | |||
locator | varchar(255) | YES | NULL | used for production purposes or for user configuration in in-house installation. |
Eg the rows:
mysql> SELECT * FROM genome_db WHERE name IN ("Homo sapiens", "Gallus gallus"); +--------------+----------+---------------+----------+------------------+-----------------+---------+ | genome_db_id | taxon_id | name | assembly | assembly_default | genebuild | locator | +--------------+----------+---------------+----------+------------------+-----------------+---------+ | 42 | 9031 | Gallus gallus | WASHUC2 | 1 | 2006-08-Ensembl | | | 22 | 9606 | Homo sapiens | NCBI36 | 1 | 2006-12-Ensembl | | +--------------+----------+---------------+----------+------------------+-----------------+---------+
correspond to the Human and Chicken genomes
species_set
Contains groups or sets of species which are used in the method_link_species_set table.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
species_set_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal (non-unique) ID |
genome_db_id | int(10) unsigned | NO | PRI | 0 | external reference to genome_db.genome_db_id |
E.g. the rows
mysql> SELECT * FROM species_set WHERE species_set_id = 171; +----------------+--------------+ | species_set_id | genome_db_id | +----------------+--------------+ | 171 | 3 | | 171 | 22 | +----------------+--------------+
correspond to the species_set for the Human and Rat genomes:
mysql> SELECT species_set_id, name, assembly FROM species_set LEFT JOIN genome_db USING (genome_db_id) WHERE species_set_id = 171; +----------------+-------------------+----------+ | species_set_id | name | assembly | +----------------+-------------------+----------+ | 171 | Rattus norvegicus | RGSC3.4 | | 171 | Homo sapiens | NCBI36 | +----------------+-------------------+----------+
method_link
Contains the list of alignment methods used to find links (homologies) between entities in compara.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
method_link_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
type | varchar(50) | NO | MUL | the common name of the linking method between species. | ||
class | varchar(50) | NO | Description of type of data associated with the \"type\" field and the main table to find these data |
Current values are:
mysql> SELECT * FROM method_link; +----------------+--------------------------+---------------------------------------+ | method_link_id | type | class | +----------------+--------------------------+---------------------------------------+ | 1 | BLASTZ_NET | GenomicAlignBlock.pairwise_alignment | | 6 | TRANSLATED_BLAT | GenomicAlignBlock.pairwise_alignment | | 7 | TRANSLATED_BLAT_NET | GenomicAlignBlock.pairwise_alignment | | 101 | SYNTENY | SyntenyRegion.synteny | | 201 | ENSEMBL_ORTHOLOGUES | Homology.homology | | 202 | ENSEMBL_PARALOGUES | Homology.homology | | 301 | FAMILY | Family.family | | 9 | MLAGAN | GenomicAlignBlock.multiple_alignment | | 401 | PROTEIN_TREES | ProteinTree.protein_tree_node | | 204 | ENSEMBL_HOMOLOGUES | Homology.homology | | 10 | PECAN | GenomicAlignBlock.multiple_alignment | | 11 | GERP_CONSTRAINED_ELEMENT | GenomicAlignBlock.constrained_element | | 501 | GERP_CONSERVATION_SCORE | ConservationScore.conservation_score | | 12 | ORTHEUS | GenomicAlignTree.tree_alignment | +----------------+--------------------------+---------------------------------------+
NOTE: We use method_link_ids between 1 and 100 for DNA-DNA alignments, between 101 and 200 for genomic syntenies, between 201 and 300 for protein homologies, between 301 and 400 for protein families and between 401 and 500 for protein trees. Each category corresponds to data stored in different tables
method_link_species_set
This table contains information about the comparisons stored in the database. A given method_link_species_set_id exist for each comparison made and relates a method_link.method_link_id with a set of species (species_set.species_set_id.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
method_link_species_set_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID. |
method_link_id | int(10) unsigned | YES | MUL | NULL | external reference to method_link.method_link_id | |
species_set_id | int(10) unsigned | NO | 0 | external reference to species_set.species_set_id | ||
name | varchar(255) | NO | NULL | human-readable description for this method_link_species_set | ||
source | varchar(255) | NO | ensembl | source of the data. Currently either ensembl or ucsc if data were imported from UCSC | ||
url | varchar(255) | NO | A URL where you can find the orignal data if they were imported. |
E.g. the rows
mysql> SELECT * FROM method_link_species_set WHERE method_link_id = 12; +----------------------------+----------------+----------------+---------------------------+---------+-----+ | method_link_species_set_id | method_link_id | species_set_id | name | source | url | +----------------------------+----------------+----------------+---------------------------+---------+-----+ | 340 | 12 | 31348 | 4 catarrhini primates EPO | ensembl | | | 341 | 12 | 31349 | 9 eutherian mammals EPO | ensembl | | | 342 | 12 | 31350 | 21 eutherian mammals EPO | ensembl | | +----------------------------+----------------+----------------+---------------------------+---------+-----+
correspond to all the EPO alignments in this database.
analysis
This table is mainly used for production purposes.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
analysis_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
created | datetime | NO | 0000-00-00 00:00:00 | date to distinguish newer and older versions off the same analysis. Not well maintained so far. | ||
logic_name | varchar(40) | NO | UNI | string to identify the analysis. Used mainly inside pipeline. | ||
db | varchar(120) | YES | NULL | db should be a database name, db version the version of that db | ||
db_version | varchar(40) | YES | NULL | |||
db_file | varchar(120) | YES | NULL | the file system location of that database, probably wiser to generate from just db and configurations | ||
program | varchar(80) | YES | NULL | The binary used to create a feature. Similar semantic to above | ||
program_version | varchar(40) | YES | NULL | |||
program_file | varchar(80) | YES | NULL | |||
parameters | varchar(255) | YES | NULL | a parameter string which is processed by the perl module | ||
module | varchar(80) | YES | NULL | Perl module names (RunnableDBS usually) executing this analysis | ||
module_version | varchar(40) | YES | NULL | |||
gff_source | varchar(40) | YES | NULL | how to make a gff dump from features with this analysis | ||
gff_feature | varchar(40) | YES | NULL |
analysis_description
This table has been added to comply with core Bio::EnsEMBL::DBSQL::AnalysisAdaptor requirements.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
analysis_id | int(10) unsigned | NO | PRI | external reference to analysis.analysis_id | ||
description | text | YES | NULL | |||
display_label | varchar(255) | YES | NULL | |||
displayable | tinyint(1) | NO | 1 | |||
web_data | text | YES | NULL |
Genomic Alignment Tables
dnafrag
This table defines the genomic sequences used in the comparative genomics analyisis. It is used by the genomic_align_block table to define aligned sequences. It is also used by the dnafrag_region table to define syntenic regions.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
dnafrag_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
length | int(11) | NO | 0 | |||
name | varchar(40) | NO | name of the DNA sequence (e.g., the name of the chromosome) | |||
genome_db_id | int(10) unsigned | NO | MUL | 0 | external reference to genome_db.genome_db_id | |
coord_system_name | varchar(40) | YES | NULL | refers to the coord system in which this dnafrag has been defined |
E.g. the row
mysql> SELECT dnafrag.* FROM dnafrag LEFT JOIN genome_db USING (genome_db_id) WHERE dnafrag.name = "14" AND genome_db.name = "Homo sapiens"; +------------+-----------+------+--------------+-------------------+ | dnafrag_id | length | name | genome_db_id | coord_system_name | +------------+-----------+------+--------------+-------------------+ | 905410 | 106368585 | 14 | 22 | chromosome | +------------+-----------+------+--------------+-------------------+
refer to the chromosome 14 of the Human genome (genome_db.genome_db_id = 22 refers to Human genome in this example) which is 106368585 nucleotides long.
genomic_align_block
This table is the key table for the genomic alignments. The software used to align the genomic blocks is refered as an external key to the method_link table. Nevertheless, actual aligned sequences are defined in the genomic_align table.
Tree alignments (EPO alignments) are best accessed through the genomic_align_tree table although the alignments are also indexed in this table. This allows the user to also access the tree alignments as normal multiple alignments.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
genomic_align_block_id | bigint(20) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
method_link_species_set_id | int(10) unsigned | NO | MUL | 0 | external reference to method_link_species_set.method_link_species_set_id | |
score | double | YES | NULL | score returned by the homology search program | ||
perc_id | tinyint(3) unsigned | YES | NULL | Used for pairwise comparison. Defines the percentage of identity between both sequences | ||
length | int(10) | YES | NULL | total length of the alignment | ||
group_id | bigint(20) unsigned | YES | NULL | used to group alignments |
E.g. the row
mysql> SELECT * FROM genomic_align_block WHERE genomic_align_block_id = 3400000000010; +------------------------+----------------------------+-------+---------+--------+----------+ | genomic_align_block_id | method_link_species_set_id | score | perc_id | length | group_id | +------------------------+----------------------------+-------+---------+--------+----------+ | 3400000000010 | 340 | NULL | NULL | 2448 | NULL | +------------------------+----------------------------+-------+---------+--------+----------+
will refer to a primates EPO alignment (method_link_species_set.method_link_species_set_id = 340) with a length of 2448 nucleotides. The actual sequences corresponding to this aligment are defined in the genomic_align table.
genomic_align
This table contains the coordinates and all the information needed to rebuild genomic alignments. Every entry corresponds to one of the aligned sequences. It also contains an external key to the method_link_species_set which refers to the software and set of species used for getting the corresponding alignment. The aligned sequence is defined by an external reference to the dnafrag table, the starting and ending position within this dnafrag, the strand and a cigar_line.
The original aligned sequence is not stored but it can be retrieved using the cigar_line field and the original sequence. The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is:
The aligned sequence will be:
M | M | D | M | M | M | D | D | M | M |
---|---|---|---|---|---|---|---|---|---|
A | A | - | C | G | C | - | - | T | T |
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
genomic_align_id | bigint(20) unsigned | NO | PRI | NULL | auto_increment | unique internal id |
genomic_align_block_id | bigint(20) unsigned | NO | MUL | external reference to genomic_align_block.genomic_align_block_id | ||
method_link_species_set_id | int(10) unsigned | NO | 0 | external reference to method_link_species_set.method_link_species_set_id. This information is redundant because it also appears in the genomic_align_block table but it is used to speed up the queries | ||
dnafrag_id | int(10) unsigned | NO | MUL | 0 | external reference to dnafrag.dnafrag_id | |
dnafrag_start | int(10) | NO | 0 | starting position within the dnafrag defined by dnafrag_id | ||
dnafrag_end | int(10) | NO | 0 | ending position within the dnafrag defined by dnafrag_id | ||
dnafrag_strand | tinyint(4) | NO | 0 | strand in the dnafrag defined by dnafrag_id | ||
cigar_line | mediumtext | YES | NULL | internal description of the aligned sequence | ||
level_id | tinyint(2) unsigned | NO | 0 | level of orhologous layer. 1 corresponds to the first layer of orthologous sequences found, 2 and over are addiotional layers. Use for building the syntenies (based on level_id = 1 only) |
E.g. the rows
mysql> SELECT * FROM genomic_align WHERE genomic_align_block_id = 3400000000010; +------------------+------------------------+----------------------------+------------+---------------+-------------+----------------+-----------------------------------------------------------------------+----------+ | genomic_align_id | genomic_align_block_id | method_link_species_set_id | dnafrag_id | dnafrag_start | dnafrag_end | dnafrag_strand | cigar_line | level_id | +------------------+------------------------+----------------------------+------------+---------------+-------------+----------------+-----------------------------------------------------------------------+----------+ | 3400000000019 | 3400000000010 | 340 | 1045571 | 29797288 | 29799670 | 1 | 31MD30M4D234M9D8M2D76M7D812M3D350M6D72MD395M9D28M10D156M6D36M4D155M3D | 1 | | 3400000000020 | 3400000000010 | 340 | 2046338 | 30364209 | 30366482 | 1 | 68M6D102MD133M7D658M4D126MD19M18D787M110D197M4D177M23D7M | 1 | | 3400000000021 | 3400000000010 | 340 | 905386 | 29989810 | 29992066 | 1 | 68M6D102MD133M9D656M4D126MD19M18D787M110D10M10D25M2D150M4D177M23D4M3D | 1 | +------------------+------------------------+----------------------------+------------+---------------+-------------+----------------+-----------------------------------------------------------------------+----------+
correspond to the three sequences included in the alignment described above (see genomic_align_block table description). The first sequence includes the nucleotides from 29797288 to 29799670 in the forward strand of the chromosome 9 of the Rhesus genome (dnafrag.dnafrag_id = 1045571). The second sequence includes the nucleotides from 30364209 to 30366482 in the forward strand of the chromosome 10 of the Chimp genome (dnafrag.dnafrag_id = 2046338). Last, the third sequence includes the nucleotides from 29989810 to 29992066 in the forward strand of the chromosome 10 of the Human genome (dnafrag.dnafrag_id = 905386)
Here is a better way to get this by joining the dnafrag and genome_db tables:
mysql> SELECT genome_db.name, dnafrag.name, dnafrag_start, dnafrag_end, dnafrag_strand str, cigar_line FROM genomic_align LEFT JOIN dnafrag USING (dnafrag_id) LEFT JOIN genome_db USING (genome_db_id) WHERE genomic_align_block_id = 3400000000010; +-----------------+------+---------------+-------------+-----+-----------------------------------------------------------------------+ | name | name | dnafrag_start | dnafrag_end | str | cigar_line | +-----------------+------+---------------+-------------+-----+-----------------------------------------------------------------------+ | Macaca mulatta | 9 | 29797288 | 29799670 | 1 | 31MD30M4D234M9D8M2D76M7D812M3D350M6D72MD395M9D28M10D156M6D36M4D155M3D | | Pan troglodytes | 10 | 30364209 | 30366482 | 1 | 68M6D102MD133M7D658M4D126MD19M18D787M110D197M4D177M23D7M | | Homo sapiens | 10 | 29989810 | 29992066 | 1 | 68M6D102MD133M9D656M4D126MD19M18D787M110D10M10D25M2D150M4D177M23D4M3D | +-----------------+------+---------------+-------------+-----+-----------------------------------------------------------------------+
The aligned sequences can be rebuild using the original sequences fetched from the corresponding core databases and the cigar lines as explained before.
genomic_align_tree
This table is used to index tree alignments, e.g. EPO alignments. These alignments include inferred ancestral sequences. The tree required to index these sequences is stored in this table. This table stores the sturcture of the tree. Each node links to an entry in the genomic_align_group table, which links to one or several entries in the genomic_align table.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
node_id | bigint(20) unsigned | NO | PRI | NULL | auto_increment | internal unique id |
parent_id | bigint(20) unsigned | NO | MUL | 0 | link to the parent node | |
root_id | bigint(20) unsigned | NO | MUL | 0 | link to the root node | |
left_index | int(10) | NO | MUL | 0 | internal index. See below | |
right_index | int(10) | NO | MUL | 0 | internal index. See below | |
left_node_id | bigint(10) | NO | 0 | link to the node on the left side of this node | ||
right_node_id | bigint(10) | NO | 0 | link to the node on the right side of this node | ||
distance_to_parent | double | NO | 1 | phylogenetic distance between this node and its parent |
Left_index and right_index are used to speed up fetching trees from the database. Any given node has its left_index larger than the left_index of its parent node and its right index smaller than the right_index of its parent node. In other words, all descendent nodes of a given node can be obtained by fetching all the node with a left_index (or right_index or both) between the left_index and the right_index of that node.
E.g. the row
mysql > SELECT * FROM genomic_align_tree WHERE node_id = 3400000000017; +---------------+-----------+---------------+------------+-------------+--------------+---------------+--------------------+ | node_id | parent_id | root_id | left_index | right_index | left_node_id | right_node_id | distance_to_parent | +---------------+-----------+---------------+------------+-------------+--------------+---------------+--------------------+ | 3400000000017 | 0 | 3400000000017 | 33 | 42 | 0 | 0 | 0 | +---------------+-----------+---------------+------------+-------------+--------------+---------------+--------------------+correspond to the root of a tree, because parent_id = 0 and root_id = node_id. In order to fetch all the nodes of this tree, one can use the left_index and right_index values:
mysql > SELECT * FROM genomic_align_tree WHERE left_index >= 33 and left_index <= 42; +---------------+---------------+---------------+------------+-------------+---------------+---------------+--------------------+ | node_id | parent_id | root_id | left_index | right_index | left_node_id | right_node_id | distance_to_parent | +---------------+---------------+---------------+------------+-------------+---------------+---------------+--------------------+ | 3400000000017 | 0 | 3400000000017 | 33 | 42 | 0 | 0 | 0 | | 3400000000018 | 3400000000017 | 3400000000017 | 34 | 39 | 3400000001404 | 3400000000683 | 0 | | 3400000000021 | 3400000000018 | 3400000000017 | 35 | 36 | 3400000001400 | 3400000000686 | 0 | | 3400000000020 | 3400000000018 | 3400000000017 | 37 | 38 | 3400000001402 | 3400000000689 | 0 | | 3400000000019 | 3400000000017 | 3400000000017 | 40 | 41 | 3400000001403 | 3400000000687 | 0 | +---------------+---------------+---------------+------------+-------------+---------------+---------------+--------------------+
The API rebuilds the tree using this information. In this example, the tree is:
3400000000017 (root) +-3400000000018 (internal_node) | +-3400000000021 (leaf) | +-3400000000020 (leaf) +-3400000000019 (leaf)
Please, refer to the genomic_align_group table for more examples on how to access tree alignments
genomic_align_group
This table is used to group genomic_aligns. It is used to support 2X genomes in the EPO alignments. These 2X genomes are split into very small regions. We can use more than one of these regions in one single EPO alignments. This table allows us to link one single genomic_align_tree.node_id with several genomic_align entries.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
group_id | bigint(20) unsigned | NO | MUL | NULL | auto_increment | external reference to genomic_align_tree.node_id |
type | varchar(40) | NO | This field allow us to group genomic_aligns in several ways (types) | |||
genomic_align_id | bigint(20) unsigned | NO | MUL | external reference to genomic_align.genomic_align_id |
E.g. the rows
mysql> SELECT * FROM genomic_align_group WHERE group_id = 3420000000097; +---------------+------+------------------+ | group_id | type | genomic_align_id | +---------------+------+------------------+ | 3420000000097 | epo | 3420000000328 | | 3420000000097 | epo | 3420000000329 | | 3420000000097 | epo | 3420000000330 | | 3420000000097 | epo | 3420000000331 | | 3420000000097 | epo | 3420000000332 | +---------------+------+------------------+
correspond to several segments from a 2X genome (Mouse lemur in this case) that are linked together in this EPO alignment.
mysql> SELECT group_id, genome_db.name, dnafrag.name, dnafrag_start, dnafrag_end, dnafrag_strand FROM genomic_align_group LEFT JOIN genomic_align USING (genomic_align_id) LEFT JOIN dnafrag USING (dnafrag_id) LEFT JOIN genome_db USING (genome_db_id) WHERE group_id = 3420000000097; +---------------+--------------------+-------------------+---------------+-------------+----------------+ | group_id | name | name | dnafrag_start | dnafrag_end | dnafrag_strand | +---------------+--------------------+-------------------+---------------+-------------+----------------+ | 3420000000097 | Microcebus murinus | GeneScaffold_762 | 110550 | 111895 | 1 | | 3420000000097 | Microcebus murinus | GeneScaffold_358 | 15752 | 15933 | 1 | | 3420000000097 | Microcebus murinus | GeneScaffold_613 | 199673 | 200191 | 1 | | 3420000000097 | Microcebus murinus | GeneScaffold_3358 | 1003 | 4254 | -1 | | 3420000000097 | Microcebus murinus | GeneScaffold_613 | 200192 | 204424 | 1 | +---------------+--------------------+-------------------+---------------+-------------+----------------+
conservation_score
Contains conservation scores calculated from the whole-genome multiple alignments stored in the genomic_align_block table.
Field | Type | Null | Key | Default | Extra | Description | |
---|---|---|---|---|---|---|---|
genomic_align_block_id | bigint(20) unsigned | NO | MUL | external reference to genomic_align_block.node_id | |||
window_size | smallint(5) unsigned | NO | The scores are stored at different resolution levels. This column defines the window size used to calculate the average score. | ||||
position | int(10) unsigned | NO | |||||
expected_score | blob | YES | MUL | expected score | |||
diff_score | blob | YES | MUL | diff_score is teh difference between the expected and observed variation, i.e. the conservation score. |
Several scores are stored per row. expected_score and diff_score are binary columns and you need to use the Perl API to access these data.
constrained_element
Contains constrained elements calculated from the whole-genome multiple alignments stored in the genomic_align_block table.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
constrained_element_id | bigint(20) unsigned | NO | MUL | internal ID | ||
dnafrag_id | int(12) unsigned | NO | MUL | external reference to dnafrag.dnafrag_id | ||
dnafrag_start | int(12) | NO | start of the constrained element | |||
dnafrag_end | int(12) | NO | end of the constrained element | |||
method_link_species_set_id | int(10) unsigned | NO | MUL | external reference to method_link_species_set.method_link_species_set_id | ||
p_value | mediumtext | YES | NULL | p-value derived from Gerp | ||
taxonomic_level | mediumtext | YES | NULL | indicates the taxonomic range of the alignments from which the constrained elements were derived | ||
score | double | NO | 0 | score derived from Gerp |
E.g. the row
mysql> select * from constrained_element limit 1; +------------------------+------------+---------------+-------------+----------------------------+---------+---------------------+-------+ | constrained_element_id | dnafrag_id | dnafrag_start | dnafrag_end | method_link_species_set_id | p_value | taxonomic_level | score | +------------------------+------------+---------------+-------------+----------------------------+---------+---------------------+-------+ | 3390000000001 | 905404 | 114071407 | 114071438 | 339 | NULL | amniota vertebrates | 27.5 | +------------------------+------------+---------------+-------------+----------------------------+---------+---------------------+-------+
refers to a constrained element from human chromosome 7 located between nucleotide positions 114071407 and 114071438.
synteny_region
Contains all the syntenic relationships found and the relative orientation of both syntenic regions.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
synteny_region_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
method_link_species_set_id | int(10) unsigned | NO | MUL | external reference to method_link_species_set.method_link_species_set_id. |
E.g. the row
mysql> SELECT * FROM synteny_region WHERE synteny_region_id = 27845; +-------------------+----------------------------+ | synteny_region_id | method_link_species_set_id | +-------------------+----------------------------+ | 27845 | 10003 | +-------------------+----------------------------+
means that the syntenic region 27845 corresponds to a synteny relationship between the Human and Rat genomes. See dnafrag_region table for more details.
dnafrag_region
Contains the genomic regions corresponding to every synteny relationship found. There are two genomic regions for every synteny relationship.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
synteny_region_id | int(10) unsigned | NO | MUL | 0 | external reference to synteny_region.synteny_region_id | |
dnafrag_id | int(10) unsigned | NO | MUL | 0 | external reference to dnafrag.dnafrag_id | |
dnafrag_start | int(10) unsigned | NO | 0 | first nucleotide from this dnafrag which is in synteny | ||
dnafrag_end | int(10) unsigned | NO | 0 | last nucleotide from this dnafrag which is in synteny | ||
dnafrag_strand | tinyint(4) | NO | 0 | strand of this region |
E.g. the rows
mysql> SELECT * FROM dnafrag_region WHERE synteny_region_id = 27845; +-------------------+------------+---------------+-------------+----------------+ | synteny_region_id | dnafrag_id | dnafrag_start | dnafrag_end | dnafrag_strand | +-------------------+------------+---------------+-------------+----------------+ | 27845 | 175248 | 498 | 12057776 | -1 | | 27845 | 905405 | 115311888 | 125710881 | 1 | +-------------------+------------+---------------+-------------+----------------+
Here is a better way to get this by joining the dnafrag and genome_db tables:
A
mysql> SELECT genome_db.name, dnafrag.name, dnafrag_start, dnafrag_end, dnafrag_strand FROM dnafrag_region LEFT JOIN dnafrag USING (dnafrag_id) LEFT JOIN genome_db USING (genome_db_id) WHERE synteny_region_id = 27845; +-------------------+------+---------------+-------------+----------------+ | name | name | dnafrag_start | dnafrag_end | dnafrag_strand | +-------------------+------+---------------+-------------+----------------+ | Rattus norvegicus | X | 498 | 12057776 | -1 | | Homo sapiens | X | 115311888 | 125710881 | 1 | +-------------------+------+---------------+-------------+----------------+
correspond to both genomic regions of the synteny region 27845. In this case, the first genomic region corresponds to the negative strand of the sequence from 498 to 12057776 of the chromosome X of the Rat genome (dnafrag_id = 175248 for this chromosome) and the second one corresponds to the sequence from 115311888 to 125710881 of the chromosome X of the Human genome (dnafrag_id = 905405 for this chromosome).
Tables For Orthologues and Protein Clusters
member
This table links sequences to the EnsEMBL core DB or to external DBs.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
member_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
stable_id | varchar(40) | NO | MUL | EnsEMBL stable ID or external ID (for Uniprot/SWISSPROT and Uniprot/SPTREMBL) | ||
version | int(10) | YES | 0 | version of the stable ID (see EnsEMBL core DB) | ||
source_name | varchar(40) | NO | MUL | describe the source of the member (Uniprot/SWISSPROT, Uniprot/SPTREMBL, ENSEMBLGENE, ENSEMBLPEP) | ||
taxon_id | int(10) unsigned | NO | MUL | external reference to taxon.taxon_id | ||
genome_db_id | int(10) unsigned | YES | MUL | NULL | external reference to genome_db.genome_db_id | |
sequence_id | int(10) unsigned | YES | MUL | NULL | external reference to sequence.sequence_id. May be 0 when the sequence is not available in the sequence table, e.g. for a gene instance. |
|
gene_member_id | int(10) unsigned | YES | MUL | NULL | external reference to member.memebr_id to allow linkage from peptides to genes. | |
description | text | YES | NULL | the description of the gene/protein as described in the core database or from the Uniprot entry | ||
chr_name | char(40) | YES | NULL | chromosome where this sequence is located | ||
chr_start | int(10) | YES | NULL | first nucleotide of this chromosome which corresponds to this member | ||
chr_end | int(10) | YES | NULL | last nucleotide of this chromosome which corresponds to this member | ||
chr_strand | tinyint(1) | NO | strand of the chromosome in which the member is | |||
display_label | varchar(128) | YES | NULL |
E.g. the row
mysql> SELECT * FROM member WHERE stable_id = "ENSP00000309431"; +-----------+-----------------+---------+-------------+----------+--------------+-------------+----------------+-... | member_id | stable_id | version | source_name | taxon_id | genome_db_id | sequence_id | gene_member_id | +-----------+-----------------+---------+-------------+----------+--------------+-------------+----------------+-... | 41115 | ENSP00000309431 | 3 | ENSEMBLPEP | 9606 | 22 | 22932 | 41114 | +-----------+-----------------+---------+-------------+----------+--------------+-------------+----------------+-... ...+------------------------------------------------------------------------------+----------+-----------+---------+------------+---------------+ | description | chr_name | chr_start | chr_end | chr_strand | display_label | ...+------------------------------------------------------------------------------+----------+-----------+---------+------------+---------------+ | Transcript:ENST00000308911 Gene:ENSG00000173213 Chr:18 Start:37358 End:39626 | 18 | 37390 | 39557 | -1 | NULL | ...+------------------------------------------------------------------------------+----------+-----------+---------+------------+---------------+
refers to the human (taxon_id = 9606 or genome_db_id = 22) peptide ENSP00000309431 which is located in the chromosome 18 (from 37390 to 39557, in the reverse strand). This peptide is described as "Transcript:ENST00000308911 Gene:ENSG00000173213 Chr:18 Start:37358 End:39626" and the sequence can be found on the sequence table.
sequence
This table contains the protein sequences present in the member table used in the protein alignment part of the EnsEMBL Compara DB.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
sequence_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
length | int(10) | NO | the length of the sequence | |||
sequence | longtext | NO | MUL | the sequence |
E.g. the row
mysql> SELECT * FROM sequence WHERE sequence_id = 118; +-------------+--------+----------------------------------------------+ | sequence_id | length | sequence | +-------------+--------+----------------------------------------------+ | 118 | 44 | FGLLPQHPESPFKSFLRVVPRHRITVRRVILAPFFLSSCAYELP | +-------------+--------+----------------------------------------------+
contains a 44 aminoacids long sequence.
peptide_align_feature
This tables stores the raw HSP local alignment results of peptide to peptide alignments returned by a BLAST run it is translated from a FeaturePair object.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
peptide_align_feature_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
qmember_id | int(10) unsigned | NO | MUL | external reference to member.member_id for the query peptide | ||
hmember_id | int(10) unsigned | NO | MUL | external reference to member.member_id for the hit peptide | ||
qgenome_db_id | int(10) unsigned | NO | MUL | external reference to genome_db.genome_db_id for the query peptide (for query optimization) | ||
hgenome_db_id | int(10) unsigned | NO | MUL | external reference to genome_db.genome_db_id for the hit peptide (for query optimization) | ||
analysis_id | int(10) unsigned | NO | MUL | external reference to analysis.analyis_id | ||
qstart | int(10) | NO | 0 | starting position in the query peptide sequence | ||
qend | int(10) | NO | 0 | ending position in the query peptide sequence | ||
hstart | int(11) | NO | 0 | starting position in the hit peptide sequence | ||
hend | int(11) | NO | 0 | ending position in the hit peptide sequence | ||
score | double(16,4) | NO | 0.0000 | blast score for this HSP | ||
evalue | double | YES | NULL | blast evalue for this HSP | ||
align_length | int(10) | YES | NULL | alignment length of HSP | ||
identical_matches | int(10) | YES | NULL | blast HSP match score | ||
perc_ident | int(10) | YES | NULL | percent identical matches in the HSP length | ||
positive_matches | int(10) | YES | NULL | blast HSP positive score | ||
perc_pos | int(10) | YES | NULL | precent positive matches in the HSP length | ||
hit_rank | int(10) | YES | NULL | rank in blast result | ||
cigar_line | mediumtext | YES | NULL | cigar string coding the actual alignment |
E.g. the rows:
mysql> SELECT * FROM peptide_align_feature_homo_sapiens_22 WHERE peptide_align_feature_id = 139843937; +--------------------------+------------+------------+---------------+---------------+-... | peptide_align_feature_id | qmember_id | hmember_id | qgenome_db_id | hgenome_db_id | +--------------------------+------------+------------+---------------+---------------+-... | 139843937 | 439410 | 439488 | 22 | 22 | +--------------------------+------------+------------+---------------+---------------+-... ...+-------------+--------+------+--------+------+----------+---------+--------------+... | analysis_id | qstart | qend | hstart | hend | score | evalue | align_length | ...+-------------+--------+------+--------+------+----------+---------+--------------+... | 67 | 1 | 124 | 1 | 124 | 649.0000 | 7.2e-65 | 124 | ...+-------------+--------+------+--------+------+----------+---------+--------------+... ...+-------------------+------------+------------------+----------+----------+------------+ | identical_matches | perc_ident | positive_matches | perc_pos | hit_rank | cigar_line | ...+-------------------+------------+------------------+----------+----------+------------+ | 123 | 99 | 123 | 99 | 2 | | ...+-------------------+------------+------------------+----------+----------+------------+
corresponds to a particular hit found between two human protiens:
mysql> SELECT m1.stable_id, m2.stable_id, score, evalue FROM peptide_align_feature_homo_sapiens_22 LEFT JOIN member m1 ON (qmember_id = m1.member_id) LEFT JOIN member m2 ON (hmember_id = m2.member_id) WHERE peptide_align_feature_id = 139843937; +-----------------+-----------------+----------+---------+ | stable_id | stable_id | score | evalue | +-----------------+-----------------+----------+---------+ | ENSP00000320207 | ENSP00000350502 | 649.0000 | 7.2e-65 | +-----------------+-----------------+----------+---------+
homology
Contains all the genomic homologies found. There are two homology_member entries for each homology entry for now, but both the schema and the API can handle more than just pairwise relations.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
homology_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
stable_id | varchar(40) | YES | NULL | stable ID of the pairwise homology relationship | ||
method_link_species_set_id | int(10) unsigned | MUL | external reference to method_link_species_set.method_link_species_set_id | |||
description | varchar(40) | YES | NULL | describes the type of homology found:
|
||
subtype | varchar(40) | NO | gives the taxonomy name of the latest common taxonomic ancestor for the considered homology e.g.
|
|||
dn | float(10,5) | YES | NULL | number of nonsynonymous substitutions per nonsynonymous site | ||
ds | float(10,5) | YES | NULL | number of synonymous substitutions per synonymous site | ||
n | float(10,1) | YES | NULL | number of nonsynonymous substitutions | ||
s | float(10,1) | YES | NULL | number of synonymous substitutions | ||
lnl | float(10,3) | YES | NULL | maximum likelihood test value | ||
threshold_on_ds | float(10,5) | YES | NULL | used by the EnsEMBL Web Browser to decide whether or not to display dN/DS ratio | ||
ancestor_node_id | int(10) unsigned | NO | external reference to protein_tree_node.node_id that represents the ancestor node between the 2 genes involved in the homology relation | |||
tree_node_id | int(10) unsigned | NO | MUL | external reference to protein_tree_node.node_id that represents the root node of the tree that relates the 2 genes involved in the homology relation |
dN, dS, N, S and lnL are statistical values given by the codeml program of the Phylogenetic Analysis by Maximum Likelihood (PAML) package.
E.g. the row
mysql> SELECT * FROM homology WHERE homology_id = 4650; +-------------+-----------+----------------------------+------------------------+--------------+------+------+------+------+------+-----------------+------------------+--------------+ | homology_id | stable_id | method_link_species_set_id | description | subtype | dn | ds | n | s | lnl | threshold_on_ds | ancestor_node_id | tree_node_id | +-------------+-----------+----------------------------+------------------------+--------------+------+------+------+------+------+-----------------+------------------+--------------+ | 4650 | NULL | 20239 | within_species_paralog | Homo sapiens | NULL | NULL | NULL | NULL | NULL | NULL | 706673 | 10159 | +-------------+-----------+----------------------------+------------------------+--------------+------+------+------+------+------+-----------------+------------------+--------------+
defines a pair of within-species paralogues. See homology_member for more details
homology_member
Contains the sequences corresponding to every genomic homology relationship found. There are two homology_member entries for each pairwise homology entry. As written in the homology table section, both schema and API can deal with more than pairwise relationships.
The original alignment is not stored but it can be retrieved using the cigar_line field and the original sequences. The cigar line defines the sequence of matches or mismatches and deletions in the alignment.
The alignment will be:
First peptide cigar line | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | M | D | M | M | M | M |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
First aligned peptide | S | E | R | C | Q | V | V | V | I | S | I | G | P | I | S | V | L | S | M | I | - | L | D | F | Y |
Second aligned peptide | S | D | R | C | Q | V | L | V | I | S | I | - | - | - | - | - | L | S | M | I | G | L | D | F | Y |
Second peptide cigar line | M | M | M | M | M | M | M | M | M | M | M | D | D | D | D | D | M | M | M | M | M | M | M | M | M |
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
homology_id | int(10) unsigned | NO | PRI | external reference to homology.homology_id | ||
member_id | int(10) unsigned | NO | PRI | external reference to member.member_id. It refers to the corresponding gene (ENSEMBL_GENE). | ||
peptide_member_id | int(10) unsigned | YES | MUL | NULL | external reference to member.member_id. It refers to the peptide/protein (ENSEMBL_PEP). | |
peptide_align_feature_id | int(10) unsigned | YES | MUL | NULL | external reference to peptide_align_feature.peptide_align_feature_id | |
cigar_line | mediumtext | YES | NULL | an internal description of the alignment. It contains mathces/mismatches (M) and delations (D)and refers to the corresponding peptide_member_id sequence. | ||
cigar_start | int(10) | YES | NULL | defines the first aligned aminoacid | ||
cigar_end | int(10) | YES | NULL | defines the last aligned aminoacid | ||
perc_cov | int(10) | YES | NULL | defines the percentage of the peptide which has been aligned | ||
perc_id | int(10) | YES | NULL | defines the percentage of identity between both homologues | ||
perc_pos | int(10) | YES | NULL | defines the percentage of positivity (similarity) between both homologues |
E.g. the rows
mysql> SELECT * FROM homology_member WHERE homology_id = 4650; +-------------+-----------+-------------------+--------------------------+------------+-------------+-----------+----------+---------+----------+ | homology_id | member_id | peptide_member_id | peptide_align_feature_id | cigar_line | cigar_start | cigar_end | perc_cov | perc_id | perc_pos | +-------------+-----------+-------------------+--------------------------+------------+-------------+-----------+----------+---------+----------+ | 4650 | 805594 | 805595 | NULL | 387M | NULL | NULL | 100 | 96 | 97 | | 4650 | 808352 | 808461 | NULL | 387M | NULL | NULL | 100 | 96 | 97 | +-------------+-----------+-------------------+--------------------------+------------+-------------+-----------+----------+---------+----------+
refer to the two homologue sequences defined by the homology.homology_id 4650. The gene corresponding to the first sequence can be retrieved using the member.member_id 805594 and the corresponding peptide using the member.member_id 805595. Gene and peptide sequence of the second homologue can retrieved in the same way.
family
Contains all the group homologies found. There are several family_member entries for each family entry.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
family_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
stable_id | varchar(40) | NO | UNI | stable family ID | ||
method_link_species_set_id | int(10) unsigned | NO | MUL | external reference to method_link_species_set.method_link_species_set_id | ||
description | varchar(255) | YES | MUL | NULL | description of the family as found using the Longest Common String (LCS) of the descriptions of the member proteins. | |
description_score | double | YES | NULL | Scores the accuracy of the annotation (max. 100) |
E.g. the row
mysql> SELECT * FROM family WHERE family_id = 30208; +-----------+-------------------+----------------------------+--------------+-------------------+ | family_id | stable_id | method_link_species_set_id | description | description_score | +-----------+-------------------+----------------------------+--------------+-------------------+ | 30208 | fam51v00000030208 | 30017 | CATHELICIDIN | 100 | +-----------+-------------------+----------------------------+--------------+-------------------+
defines a family of "CATHELICIDIN" proteins. The score of the description is 100.
NOTE: stable_id are currently not stable. We are working in getting IDs stable between releases.
family_member
Contains the proteins corresponding to protein family relationship found. There are several family_member entries for each family entry.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
family_id | int(10) unsigned | NO | PRI | external reference to family.family_id | ||
member_id | int(10) unsigned | NO | PRI | external reference to member.member_id | ||
cigar_line | mediumtext | YES | NULL | internal description of the multiple alignment (see homology_member table) |
E.g. the rows
mysql> SELECT * FROM family_member WHERE family_id = 30208; +-----------+-----------+------------+ | family_id | member_id | cigar_line | +-----------+-----------+------------+ | 30208 | 2221971 | 148M | | 30208 | 2241749 | 11D137M | | 30208 | 2241750 | 148M | | 30208 | 2258940 | 132M4D12M | +-----------+-----------+------------+
refer to the four members of the protein family 30208. The proteins can be retieved using the member_ids. The multiple alignment can be restored using the cigar_lines.
domain
Not used by now
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
domain_id | int(10) unsigned | PRI | NULL | auto_increment | internal unique ID | |
stable_id | varchar(40) | |||||
method_link_species_set | int(10) unsigned | 0 | external reference to method_link_species_set.method_link_species_set_id | |||
description | varchar(255) | YES | NULL |
domain_member
Not used by now
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
domain_id | int(10) unsigned | MUL | 0 | |||
member_id | int(10) unsigned | MUL | 0 | |||
member_start | int(10) | YES | NULL | |||
member_end | int(10) | YES | NULL |
protein_tree_node
Contains the data structure of each tree.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
node_id | int(10) unsigned | NO | PRI | NULL | auto_increment | internal unique ID |
parent_id | int(10) unsigned | NO | MUL | link to the parent node | ||
root_id | int(10) unsigned | MUL | 0 | link to the root node | ||
left_index | int(10) | MUL | 0 | internal index. See below | ||
right_index | int(10) | MUL | 0 | internal index. See below | ||
distance_to_parent | double | 1 | phylogenetic distance between this node and its parent |
When parent_id=0, then node_id is the root. In our data structure, all trees are arbitrarily connected to the same root. This allows storing in the same database the data from independant tree building analysis. Hence the "biological root" of each tree are the children nodes of the main root. For instance the following SQL query
mysql> SELECT COUNT(*) FROM protein_tree_node WHERE parent_id = root_id; +----------+ | COUNT(*) | +----------+ | 27514 | +----------+
will return the number of independant trees stored in the database.
protein_tree_member
Contains the data about the leaf (joining the table with a member_id from the member table, all ENSEMBLPEP members)
present in each tree. The method_link_species_set should link to an
entry that has a method_link.type="PROTEIN_TREES". For the aligment,
the cigar_* column hold the needed information to rebuild the alignment
using the Perl API, as it is done in the family/family_member tables.
As we store global multiple aligment you will notice that cigar_start
and cigar_end are always NULL. The table however is able to local
multiple alignment, then cigar_start/cigar_end should be set to their
correct value.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
node_id | int(10) unsigned | NO | PRI | external reference to protein_tree_node.node_id | ||
member_id | int(10) unsigned | NO | MUL | external reference to member.member_id. Should always link to a protein (usually member.source_name='ENSEMBLPEP') | ||
method_link_species_set_id | int(10) unsigned | NO | external reference to method_link_species_set.method_link_species_set_id | |||
cigar_line | mediumtext | YES | NULL | |||
cigar_start | int(10) | YES | NULL | |||
cigar_end | int(10) | YES | NULL |
protein_tree_tag
Contains several tag/value data attached to node_ids
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
node_id | int(10) unsigned | NO | MUL | external reference to protein_tree_node.node_id | ||
tag | varchar(50) | YES | MUL | NULL | ||
value | mediumtext | YES | NULL |
sitewise_aln
Contains site-wise omega values found in the multiple alignments underlining the protein trees.
Field | Type | Null | Key | Default | Extra | Description |
---|---|---|---|---|---|---|
sitewise_id | int(10) unsigned | NO | PRI | auto_increment | internal unique ID | |
aln_position | int(10) unsigned | NO | MUL | |||
node_id | int(10) unsigned | NO | MUL | |||
tree_node_id | int(10) unsigned | NO | ||||
omega | float(10,5) | YES | NULL | |||
omega_lower | float(10,5) | YES | NULL | |||
omega_upper | float(10,5) | YES | NULL | |||
threshold_on_branch_ds | float(10,5) | YES | NULL | |||
type | varchar(10) | NO |