Archive Ensembl HomeArchive Ensembl Home
Home > Help & Documentation

Regulatory Build

The regulatory build provides a single "best guess" set of regulatory elements. These elements are based on the information contained within the ensembl functional genomics database.

Regulatory Feature Construction

The main principle of building regulatory features is simply performing overlap analysis of annoations in each data set. Core regions are identified using the focus sets, which are chosen to define a set of potentially regions. These tend to be broad coverage narrowly focused marks which are likely candidates for different types of regulatory elements or motifs. Historically we have used DNase1 which is known to mark for accessible chromatin, H3K4me3 which has been associated with promoter regions, and CTCF which characterises 'insulator/enhancer' element. As such the core regions of regulatory features are likely to be positioned on or around any potential regualtory motif. Core regions are extended only in the case of direct overlap with another focus feature. To maintain resolution and to avoid chaining of regulatory features across regions of dense regulatory elements we have imposed a 2KB cut off for focus features. Exceeding this cutoff causes the focus feature to be treated as an attribute feature and so does not extend the core region.

The arms or bounds of a regulatory feature are defined by overlap of attribute features with respect to the core region. Attribute feature sets are generally less discriminatory and less focused annotations which are used extend the immediate structure of the regulatory feature and to inform the annotation analysis. Directly overlapping attribute features are said to have one degree of sepration. Attributes with two degrees of separation or only included is they are within 2KB of the core region and they are entirely contained within another longer attribute feature. This is done to capture information adjacent and indirectly associated with the core region, whilst avoiding longer range and potentially anomalous associations.

Annotation

Regulatory features are classified by analysing the whole set of patterns of consituent attribute and focus features with respect to several classes of genomic feature. This enables us to identify clusters of patterns which are significantly over represented and allows us to generate annotations based on these patterns. Some preliminary analysis has identified patterns for the following annotations that are both common and statistically significant:

  • non-gene-associated = patterns under-represented in genic regions
  • gene-associated = patterns over-represented in genic regions
  • promoter-associated = patterns over-represented in the region of the first exon and 2500bp upstream of protein coding genes, but not in the downstream 'gene-body'.

These data sets can be displayed along the chromosome in 'Region in Detail', displayed for a gene in the 'GeneRegulation' view or mined from the functional genomics database.

Source data

Regulatory features are generated by using as a variety genome wide epigenomic data sets, currently comprising the following:

Human Regulatory Build Version 4
Focus SetsData typeReference
DNase1 Hypersensitivity site (GM06990 and CD4+ T cells) ChIP-Seq 1
CCCTC-binding factor (CTCF) ChIP-Chip* 2
CCTC-binding factor (CTCF) ChIP-Seq 4
Attribute SetsData typeReference
H4K20me3 ChIP-Chip 3
H3K27me3 ChIP-Chip 3
H3K36me3 ChIP-Chip 3
H3K79me3 ChIP-Chip 3
H3K9me3 ChIP-Chip 3
H2AZ ChIP-Seq 4
H2BK5me1 ChIP-Seq 4
H3K27me1, me2, me3 ChIP-Seq 4
H3K36me1, me3 ChIP-Seq 4
H3K4me1,me2,me3 ChIP-Seq 4
H3K79me1,me2,me3 ChIP-Seq 4
H3K9me1,me2,me3 ChIP-Seq 4
H3R2me1, me2 ChIP-Seq 4
H4K20me1, me3 ChIP-Seq 4
H4R3me2 ChIP-Seq 4
PolII ChIP-Seq 4
CD4_H2AK5ac ChIP-Seq 5
CD4_H2AK9ac ChIP-Seq 5
CD4_H2BK120ac ChIP-Seq 5
CD4_H2BK12ac ChIP-Seq 5
CD4_H2BK20ac ChIP-Seq 5
CD4_H2BK5ac ChIP-Seq 5
CD4_H3K14ac ChIP-Seq 5
CD4_H3K18ac ChIP-Seq 5
CD4_H3K23ac ChIP-Seq 5
CD4_H3K27ac ChIP-Seq 5
CD4_H3K36ac ChIP-Seq 5
CD4_H3K4ac ChIP-Seq 5
CD4_H3K9ac ChIP-Seq 5
CD4_H4K12ac ChIP-Seq 5
CD4_H4K16ac ChIP-Seq 5
CD4_H4K5ac ChIP-Seq 5
CD4_H4K8ac ChIP-Seq 5
CD4_H4K91ac ChIP-Seq 5

Mouse Regulatory Build Version 1
Focus SetsData typeReference
DNase1 Hypersensitivity site (ES Cells) ChIP-Seq 6
Attribute SetsData typeReference
ES: H3me, H3K4me3, H3K9me3, H3K27me3, H3K36me3, H4K20me3, PolII ChIP-Seq 7
ES Hybrid: H3K4me3, H3K9me3, H3K36me3 ChIP-Seq 7
MEF: H3K4me3, H3K27me3, H3K36me3 ChIP-Seq 7
NPC: H3K4me3, H3K9me3, H3K36me3 ChIP-Seq 7

References

1. Genome-wide identification of DNaseI hypersensitive sites was performed by Greg Crawford and Terry Furey (Duke University) using a whole genome DNase-sequencing protocol (Crawford et al., Genome Research 2006).
DNase-sequencing was performed using the Illumina (Solexa) sequencing by synthesis method from a DNase treated library generated from the GM06990 cell line (Crawford and Furey, unpublished).

2. Kim, T.H.; Abdullaev, Z.K.; Smith, A.D.; Ching, K.A.; Loukinov, D.I.; Green, R.D.; Zhang, M.Q.; Lobanenkov, V.V. & Ren, B.
Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome.
Cell, 2007 , 128 , 1231-1245

3. Hirst, M; Hurd, P.J.; Bainbridge, M.; Robertson, G.; Kirmizis, A.; Nelson, C.; Zhao, Y.; Zeng, T.; Pandoh, P.; Tam, A.; Prabhu, A.; Dhalla, N.; Sa, D.; Delaney, A.; Bilenky, M.; Jones, S.; Kouzarides, T.; Marra, M. (In preparation)

4. A. Barski, S. Cuddapah, K. Cui, T.Y. Roh, D.E. Schones, Z. Wang, G. Wei, I. Chepelev and K. Zhao, (2007). High-resolution profiling of histone methylations in the human genome, Cell 129 (2007), pp. 823-837.

5. Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, Zhao K. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet. 2008 Jul;40(7):897-903. Epub 2008 Jun 15. PMID: 18552846

6. Dnase1-sequencing was produced as a collaboration between Ensembl, David Adams (Wellcome Trust Sanger Institute), and Greg Crawford (Duke University).

7. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007 Aug 2;448(7153):548-9. PMID: 17603471

* The CTCF data was processed with the Nessie HMM (Flicek, unpublished).