Ensembl Variation Schema Documentation

Introduction

This document gives a high-level description of the tables that make up the Ensembl variation schema. Tables are grouped into logical groups, and the purpose of each table is explained. It is intended to allow people to familiarise themselves with the schema when encountering it for the first time, or when they need to use some tables that they've not used before. Note that this document makes no attempt to enumerate all of the names, types and contents of every single table.

This document refers to version 53 of the Ensembl variation schema.

A PDF document of the schema is available here.

Samples and populations

individual
individual_population
individual_type
sample
sample_synonym
population
population_structure

Simple variations

source
variation
variation_synonym
flanking_sequence
failed_variation
failed_description
allele
variation_feature
transcript_variation
tagged_variation_feature

Multi-marker variations

allele_group
allele_group_allele
variation_group
variation_group_feature
variation_group_variation
httag

Genotypes, read coverage and annotations

compressed_genotype_single_bp
individual_genotype_multiple_bp
population_genotype
read_coverage
variation_annotation
phenotype

Miscellaneous

meta
meta_coord
seq_region

Samples and populations

individual

Stores information about an identifiable individual, including gender and the identifiers of the individual's parents (if known).

See also:

Tables:

sample
individual_type
individual_population
individual_genotype_multiple_bp
compressed_genotype_single_bp

individual_population

This table resolves the many-to-many relationship between the individual and population tables; i.e. samples may belong to more than one population. Hence it is composed of rows of individual and population identifiers.

See also:

Tables:

individual
population

individual_type

This table contains a set of common individual types and their descriptions. These types are currently: “fully_inbred”, “partly_inbred”, “outbred” and “mutant”.

See also:

Tables:

individual

sample

Sample is used as a generic catch-all term to cover individuals, populations and strains; it contains a name and description, as well as a size if applicable to the population.

See also:

Tables:

individual
population

sample_synonym

Used to store alternative names for populations when data comes from multiple sources.

See also:

Tables:

sample
population
source

population

A table consisting simply of sample_ids representing populations; all data relating to the populations are stored in separate tables (see below).

See also:

Tables:

sample
sample_synonym
individual_population
population_structure
population_genotype
allele
allele_group
tagged_variation_feature

population_structure

This table stores hierarchical relationships between populations by relating them as populations and sub-populations.

See also:

Tables:

population

Simple variations

source

This table contains details of the source from which a variation is derived. Most commonly this is NCBI's dbSNP; other sources include SNPs called by Ensembl.

See also:

Tables:

variation
variation_synonym
variation_feature
variation_annotation
variation_group
allele_group
sample_synonym
httag

variation

This is the schema's generic representation of a variation, defined as a genetic feature that varies between individuals of the same species. The most common type is the single nucleotide variation (SNP) though the schema also accommodates copy number variations (CNVs) and structural variations (SVs). A variation is defined by its flanking sequence rather than its mapped location on a chromosome; a variation may in fact have multiple mappings across a genome. This table stores a variation's name (commonly an ID of the form e.g. rs123456, assigned by dbSNP), along with a validation status and ancestral (or reference) allele.

See also:

Tables:

variation_synonym
flanking_sequence
failed_variation
variation_feature
variation_group_variation
allele
allele_group_allele
individual_genotype_multiple_bp

variation_synonym

This table allows for a variation to have multiple IDs, generally given by multiple sources.

See also:

Tables:

source
variation

flanking_sequence

This table contains the upstream and downstream sequence surrounding a variation. Since each variation is defined by its flanking sequence, this table has a one-to-one relationship with the variation table.

See also:

Tables:

seq_region
variation

failed_variation

For various reasons it may be necessary to store information about a variation that has failed quality checks in the Variation pipeline. This table acts as a flag for such failures.

See also:

Tables:

failed_description
variation

failed_description

This table contains descriptions of reasons for a variation being flagged as failed.

See also:

Tables:

failed_description

allele

This table stores information about each of a variation's alleles, along with population frequencies.

See also:

Tables:

variation
population

variation_feature

This table represents mappings of variations to genomic locations. It stores an allele string representing the different possible alleles that are found at that locus e.g. “A/T” for a SNP, as well as a “worst case” consequence of the mutation. It also acts as part of the relationship between variations and transcripts.

See also:

Tables:

variation
tagged_variation_feature
transcript_variation

transcript_variation

This table relates a variation_feature to a transcript (see Core documentation). It contains the consequence of the variation e.g. intronic, non-synonymous coding, stop etc, along with the change in amino acid in the resulting protein if applicable.

See also:

Tables:

variation_feature

tagged_variation_feature

This table lists variation features that are tagged by another variation feature. Tag pairs are defined as having an r² > 0.99.

See also:

Tables:

variation_feature
population

Multi-marker variations

These tables are designed to represent alleles and haplotypes that consist of more than one variation.

allele_group

This table, along with allele_group_allele, represents a particular multi-marker allele of a given multi-marker variation, or haplotype. It stores an associated population frequency.

See also:

Tables:

allele_group_allele
variation_group
population
source

allele_group_allele

This table represents an allele of one variation in a multi-marker variation, or haplotype. It stores a string of the allele.

See also:

Tables:

allele_group
variation

variation_group

This table represents the equivalent of a variation for a multi-marker variation.

See also:

Tables:

variation_group_variation
allele_group
variation
source
httag

variation_group_variation

This table represents an individual variation that makes up a multi-marker variation, and resolves the many-to-many relationship between variation and variation_group.

See also:

Tables:

variation_group
variation

variation_group_feature

This table represents the equivalent of a variation_feature for multi-marker variations, mapping a haplotype to a chromosomal coordinate system.

See also:

Tables:

variation_group
seq_region

httag

This table represents the equivalent of a tagged_variation_feature for multi-marker variations, representing an instance where a haplotype is tagged by or tags another marker.

See also:

Tables:

variation_group
source

Genotypes, read coverage and annotations

compressed_genotype_single_bp

This table holds genotypes compressed using the pack() method in Perl. These genotypes are mapped to particular genomic locations rather than variation objects. The data have been compressed to reduce table size and increase the speed of the web code.

See also:

Tables:

individual
seq_region

individual_genotype_multiple_bp

This table holds uncompressed genotypes for given variations.

See also:

Tables:

individual
variation

population_genotype

This table stores alleles and frequencies for variations in given populations.

See also:

Tables:

population
variation

read_coverage

This table stores the read coverage in the resequencing of individuals. Each row contains an individual ID, chromosomal coordinates and a read coverage level.

See also:

Tables:

individual
seq_region

variation_annotation

This table stores information linking genotypes and phenotypes. It stores various fields pertaining to the study conducted, along with the associated gene, risk allele frequency and a p-value.

See also:

Tables:

variation
phenotype
source

phenotype

This table stores details of the phenotypes associated with variation annotations.

See also:

Tables:

variation_annotation

Miscellaneous

Tables that don't fit anywhere else.

seq_region

This table stores the relationship between Ensembl's internal coordinate system identifiers and traditional chromosome names.

See also:

Tables:

variation_feature
variation_group_feature
flanking_sequence
compressed_genotype_single_bp
read_coverage

meta_coord

This table gives the coordinate system used by various tables in the database.