EnsEMBL Core Schema Documentation
Introduction
This document gives a high-level description of the tables that make up the EnsEMBL core schema. Tables are grouped into logical groups, and the purpose of each table is explained. It is intended to allow people to familiarise themselves with the schema when encountering it for the first time, or when they need to use some tables that they've not used before. Note that while some of the more important columns in some of the tables are discussed, this document makes no attempt to enumerate all of the names, types and contents of every single table. Some concepts which are referred to in the table descriptions are given at the end of this document; these are linked to from the table description where appropriate.
Different tables are populated throughout the gene build process:
Step | Process |
---|---|
0 | Create empty schema, populate meta table |
1 | Load DNA - populates dna, clone, contig, chromosome, assembly tables |
2 | Analyze DNA (raw computes) - populates genomic feature/analysis tables |
3 | Build genes - populates exon, transcript,etc. gene-related tables |
4a | Analyze genes - populate protein_feature, xref tables, interpro |
4b | ID mapping |
This document refers to version 54 of the EnsEMBL core schema.
Quick links to tables:
Fundamental tables
- assembly
- assembly_exception
- attrib_type
- coord_system
- dna
- dnac
- exon
- exon_stable_id
- exon_transcript
- gene
- gene_stable_id
- karyotype
- meta
- meta_coord
- prediction_exon
- prediction_transcript
- seq_region
- seq_region_attrib
- supporting_feature
- transcript
- transcript_attrib
- transcript_stable_id
- translation
- translation_attrib
- translation_stable_id
Features and analyses
- alt_allele
- analysis
- analysis_description
- density_feature
- density_type
- dna_align_feature
- map
- marker
- marker_feature
- marker_map_location
- marker_synonym
- misc_attrib
- misc_feature
- misc_feature_misc_set
- misc_set
- prediction_transcript
- protein_align_feature
- protein_feature
- qtl
- qtl_feature
- qtl_synonym
- repeat_consensus
- repeat_feature
- simple_feature
ID Mapping (Tables involved in mapping identifiers between releases)
Exernal references (Tables used for storing links to and details about objects that are stored in other databases)
Miscellaneous (Tables that don't fit anywhere else.)
Fundamental tables
seq_region
Stores information about sequence regions. The primary key is used as a pointer into the dna table so that actual sequence can be obtained, and the coord_system_id allows sequence regions of multiple types to be stored. Clones, contigs and chromosomes are all now stored in the seq_region table. Contigs are stored with the co-ordinate system 'contig'. The relationship between contigs and clones is stored in the assembly table. The relationships between contigs and chromosomes, and between contigs and supercontigs, are stored in the assembly table.
See also:
Tables:
- dna - 1:1 relationship to the dna table.
- coord_system - Describes which co-ordinates a particular feature is stored in.
coord_system
Stores information about the available co-ordinate systems for the species identified through the species_id field. Note that for each species, there must be one co-ordinate system that has the attribute "top_level" and one that has the attribute "sequence_level".
See also:
Tables:
- seq_region - Has coord_system_id foreign key to allow joins with the coord_system table.
- meta - Holds meta information about each of the species in a database, no matter if it's a multi-species or single-species database.
seq_region_attrib
Allows "attributes" to be defined for certain seq_regions. Provides a way of storing extra information about particular seq_regions without adding extra columns to the seq_region table. e.g.
See also:
Tables:
- seq_region -
- attrib_type - Provides codes, names and desctriptions of attribute types.
attrib_type
Provides codes, names and desctriptions of attribute types.
See also:
Tables:
- seq_region_attrib - Associates seq_regions with attributes.
dna
Contains DNA sequence. This table has a 1:1 relationship with the contig table.
See also:
Tables:
- seq_region - Relates sequence to features.
- external_synonym - Allows xrefs to have more than one name
dnac
Stores compressed DNA sequence.
assembly
The assembly table states, which parts of seq_regions are exactly equal. It enables to transform coordinates between seq_regions. Typically this contains how chromosomes are made of contigs, clones out of contigs, and chromosomes out of supercontigs. It allows you to artificially chunk chromosome sequence into smaller parts.
See also:
Tables:
- seq_region - Stores extra information about both the assembled object and its component parts
Concepts:
- supercontigs - The mapping between contigs and supercontigs is also stored in the assembly table.
assembly_exception
Allows multiple sequence regions to point to the same sequence, analogous to a symbolic link in a filesystem pointing to the actual file. This mechanism has been implemented specifically to support haplotypes and PARs, but may be useful for other similar structures in the future.
See also:
Tables:
- assembly -
karyotype
Describes bands that can be stained on the chromosome.
exon
Stores data about exons. Associated with transcripts via exon_transcript. Allows access to contigsseq_regions.
See also:
Tables:
- exon_transcript - Used to associate exons with transcripts.
exon_stable_id
Relates exon IDs in this release to release-independent stable identifiers.
See also:
Concepts:
- stable_id - Describes the rationale behind the use of stable identifiers.
transcript
Stores information about transcripts. Has seq_region_start, seq_region_end and seq_region_strand for faster retrieval and to allow storage independently of genes and exons. Note that a transcript is usually associated with a translation, but may not be, e.g. in the case of pseudogenes and RNA genes (those that code for RNA molecules).
transcript_stable_id
Relates transcript IDs in this release to release-independent stable identifiers.
See also:
Concepts:
- stable_id - Describes the rationale behind the use of stable identifiers.
transcript_attrib
Enables storage of attributes that relate to transcripts.
See also:
Tables:
translation_attrib
Enables storage of attributes that relate to translations.
See also:
Tables:
exon_transcript
Relationship table linking exons with transcripts. The rank column indicates the 5' to 3' position of the exon within the transcript, i.e. a rank of 1 means the exon is the 5' most within this transcript.
See also:
Tables:
- exon - One of the entities related by the exon_transcript table.
- transcript - One of the entities related by the exon_transcript table.
gene
Allows transcripts to be related to genes.
gene_stable_id
Relates gene IDs in this release to release-independent stable identifiers.
See also:
Concepts:
- stable_id - Describes the rationale behind the use of stable identifiers.
translation
Describes which parts of which exons are used in translation. The seq_start and seq_end columns are 1-based offsets into the relative coordinate system of start_exon_id and end_exon_id. i.e, if the translation starts at the first base of the exon, seq_start would be 1. Transcripts are related to translations by the transcript_id key in this table.
translation_stable_id
Relates translation IDs in this release to release-independent stable identifiers.
See also:
Concepts:
- stable_id - Describes the rationale behind the use of stable identifiers.
supporting_feature
Describes the exon prediction process by linking exons to DNA or protein alignment features. As in several other tables, the feature_id column is a foreign key; the feature_type column specifies which table feature_id refers to.
prediction_transcript
Stores transcripts that are predicted by ab initio gene finder programs (e.g. genscan, SNAP). Unlike EnsEMBL transcripts they are not supported by any evidence.
prediction_exon
Stores exons that are predicted by ab initio gene finder programs. Unlike EnsEMBL exons they are not supported by any evidence.
meta
Stores data about the data in the current schema. Taxonomy information, version information and the default value for the type column in the assembly table are stored here. Unlike other tables, data in the meta table is stored as key-value pairs. Also stores (via assembly.mapping keys) the relationships between co-ordinate systems in the assembly table.
The species_id field of the meta table is used in multi-species databases and makes it possible to have species-specific meta key-value pairs. The species-specific meta key-value pairs needs to be repeated for each species_id. Entries in the meta table that are not specific to any one species, such as the schema.version key and any other schema-related information must have their species_id field set to NULL. The default species_id, and the only species_id value allowed in single-species databases, is 1.
See also:
Tables:
- assembly - The default value for assembly.type is stored in the meta table.
- coord_system - The species_id in the meta table is linked to the species_id in the coord_system table.
meta_coord
Describes which co-ordinate systems the different feature tables use.
See also:
Tables:
Features and analyses
analysis
Usually describes a program and some database that together are used to create a feature on a piece of sequence. Each feature is marked with an analysis_id. The most important column is logic_name, which is used by the webteam to render a feature correctly on contigview (or even retrieve the right feature). Logic_name is also used in the pipeline to identify the analysis which has to run in a given status of the pipeline. The module column tells the pipeline which Perl module does the whole analysis, typically a RunnableDB module.
See also:
Tables:
- analysis_description - Gives more details about each analysis.
analysis_description
Allows the storage of a textual description of the analysis, as well as a "display label", primarily for the EnsEMBL web site.
See also:
Tables:
- analysis - Holds most analysis information.
dna_align_feature
Stores DNA sequence alignments generated from Blast (or Blast-like) comparisons.
See also:
Concepts:
- cigar_line - Used to encode gapped alignments.
protein_align_feature
Stores translation alignments generated from Blast (or Blast-like) comparisons.
See also:
Concepts:
- cigar_line - Used to encode gapped alignments.
repeat_feature
Describes sequence repeat regions.
marker_feature
Used to describe marker positions.
See also:
Tables:
- marker - Stores details about the markers themselves.
- marker_map_location -
- marker_synonym - Holds alternative names for markers.
qtl_feature
Describes Quantitative Trail Loci (QTL) positions as obtained from inbreeding experiments. Note the values in this table are in chromosomal co-ordinates. Also, this table is not populated all schemas.
See also:
Tables:
- qtl - Describes the markers used to define a QTL.
- qtl_synonym - Stores alternative names for QTLs
prediction_transcript
Stores information about ab initio gene transcript predictions.
simple_feature
Describes general genomic features that don't fit into any of the more specific feature tables.
protein_feature
Describes features on the translations (as opposed to the DNA sequence itself), i.e. parts of the peptide. In peptide co-ordinates rather than contig co-ordinates.
See also:
Tables:
- analysis - Describes how protein features were derived.
Concepts:
density_feature
.
See also:
Tables:
density_type
.
See also:
Tables:
qtl
Describes the markers (of which there may be up to three) which define Quantitative Trait Loci. Note that QTL is a statistical technique used to find links between certain expressed traits and regions in a genetic map.
See also:
Tables:
- qtl_synonym - Describes alternative names for QTLs
qtl_synonym
Describes alternative names for Quantitative Trait Loci (QTLs).
marker
Stores data about the marker itself - e.g. the primer sequences used.
See also:
Tables:
- marker_synonym - Stores alternative names for markers.
- marker_map_location -
marker_map_location
Allows storage of information about the position of a marker - these are positions on genetic or radiation hybrid maps (as opposed to positions on the assembly, which EnsEMBL has determined and which are stored in marker_feature).
See also:
Tables:
- marker - Stores marker data.
- marker_feature - Stores marker positions on the assembly.
marker_synonym
Stores alternative names for markers, as well as their sources.
See also:
Tables:
- marker - Stores the original marker.
map
Stores the names of different genetic or radiation hybrid maps, for which there is marker map information.
See also:
Tables:
- marker - Stores the original marker.
repeat_consensus
Stores consensus sequences obtained from analysing repeat features.
misc_feature
Alllows for storage of arbitrary features.
See also:
Tables:
- misc_attrib - Allows storage of arbitrary attributes for the misc_features.
misc_attrib
Stores arbitrary attributes about the features in the misc_feature table.
See also:
Tables:
misc_set
Defines "sets" that the features held in the misc_feature table can be grouped into.
See also:
Tables:
- misc_feature_misc_set - Defines which features are in which set
misc_feature_misc_set
Defines which of the features in misc_feature are in which of the sets defined in misc_set
See also:
Tables:
- misc_feature -
- misc_set -
alt_allele
Stores information about genes on haplotypes that may be orthologous.
ID Mapping
Tables involved in mapping identifiers between releases
mapping_session
Stores details of ID mapping sessions - a mapping session represents the session when stable IDs where mapped from one database to another. Details of the "old" and "new" databases are stored.
See also:
Tables:
- stable_id_event - Stores details of what happened during the mapping session.
Concepts:
- stable_id - Describes the need for ID mapping.
stable_id_event
Represents what happened to all gene, transcript and translation stable IDs during a mapping session. This includes which IDs where deleted, created and related to each other. Each event is represented by one or more rows in the table.
See also:
Tables:
- mapping_session - Describes the session when events stored in this table occurred.
gene_archive
Contains a snapshot of the stable IDs associated with genes deleted or changed between releases. Includes gene, transcript and translation stable IDs.
peptide_archive
Contains the peptides for deleted or changed translations.
Exernal references
Tables used for storing links to and details about objects that are stored in other databases
xref
Holds data about objects which are external to EnsEMBL, but need to be associated with EnsEMBL objects. Information about the database that the external object is stored in is held in the external_db table entry referred to by the external_db column.
See also:
Tables:
- external_db - Describes the database that xrefs are stored in
- external_synonym - Allows xrefs to have more than one name
external_db
Stores data about the external databases in which the objects described in the xref table are stored.
See also:
Tables:
- xref - Holds data about the external objects that are stored in the external_dbs.
external_synonym
Some xref objects can be referred to by more than one name. This table relates names to xref IDs.
See also:
Tables:
- xref - Holds most of the data about xrefs.
object_xref
Describes links between EnsEMBL objects and objects held in external databases. The EnsEMBL object can be one of several types; the type is held in the ensembl_object_type column. The ID of the particular EnsEMBL gene, translation or whatever is given in the ensembl_id column. The xref_id points to the entry in the xref table that holds data about the external object.Each EnsEMBL object can be associated with zero or more xrefs. An xref object can be associated with one or more EnsEMBL objects.
See also:
Tables:
- xref - Stores the data about each externally-referenced object.
- go_xref - Stores extra data for relationships to GO objects.
- identity_xref - Stores data about how 'good' the relationships are
go_xref
Links between EnsEMBL objects and external objects produced by GO (Gene Ontology) require some additional data which is not stored in the object_xref table.
See also:
Tables:
- object_xref - Stores basic, non GO-specific information for GO xrefs
Links:
- GO - Gene Ontology website
identity_xref
Describes how well a particular xref object matches the EnsEMBL object.
See also:
Tables:
- object_xref - Stores basic information about EnsEMBL object-xref mapping
Miscellaneous
Tables that don't fit anywhere else.
interpro
Allows storage of links to the InterPro database. InterPro is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences.
See also:
Links:
- InterPro - The InterPro website
Concepts
-
There are several different co-ordinate systems used in the EnsEMBL database and API. For every co-ordinate system, the fundamental unit is one base. The differences between co-ordinate systems lie in where a particular numbered base lies, and the start position it is relative to. CONTIG co-ordinates, also called 'raw contig' co-ordinates or 'clone fragments' are relative to the first base of the first contig of a clone. Note that the numbering is from 1, i.e. the very first base of the first contig of a clone is numbered 1, not 0. In CHROMOSOMAL co-ordinates, the co-ordinates are relative to the first base of the chromosome. Again, numbering is from 1. The seq_region table can store sequence regions in any of the co-ordinate systems defined in the coord_system table.
-
A supercontig is made up of a group of adjacent or overlapping contigs.
-
The sticky_rank differentiates between fragments of the same exon; i.e for exons that span multiple contigs, all the fragments would have the same ID, but different sticky_rank values
-
Gene predictions have changed over the various releases of the EnsEMBL databases. To allow the user to track particular gene predictions over changing co-ordinates, each gene-related prediction is given a 'stable identifier'. If a prediction looks similar between two releases, we try to give it the same name, even though it may have changed position and/or had some sequence changes.
-
This allows the compact storage of gapped alignments by storing the maximum extent of the matches and then a text string which encodes the placement of gaps inside the alignment. Colloquially inside EnsEMBL this is called a and its adoption has shrunk the number of rows in the feature table around 4-fold.