Archive Ensembl HomeArchive Ensembl Home
Home > Help & Documentation

Ensembl Glossary


Accession number
A unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ).
A sequence of computational tasks or actions that carry out a specific function.
A comparison between two or more sequences by matching identical and/or similar residues and assigning a score to the match.
One of the alternate forms of a specific gene. Each allele is an individual member of a gene pair and is inherited from one parent. When genes are considered "simply" as segments of a nucleotide sequence, then it refers to each of the possible alternative nucleotides at a specific position in the sequence.
A dispersed intermediately repetitive DNA sequence found in the human genome in about one million copies. The sequence is about 300 bp long and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. The name Alu comes from the a recognition site for the AluI endonuclease that cleaves it. The Alu universal primer sequence is as follows: 5'-GTG GAT CAC CTG AGG TCA GGA GTT TC-3' (26-mer).
API (Application Programming Interface)
A series of routines that applications can use to make the operating system request and carry out lower-level services.
ATV (A Tree Viewer)
An application (Java tool) for the visualisation of phylogenetic trees. Allows the possibility to edit and export data. See Zmasek et al.
BAC (Bacterial Artificial Chromosome)
A vector used to clone DNA fragments (100 to 300-kb insert size; average, 150 kb) from another species so that it can be replicated in bacteria.
BLAST (Basic Local Alignment Search Tool)
A sequence comparison algorithm optimised for speed which is used to search sequence databases for optimal local alignments to a query. (Altschul et al., J Mol Biol 215:403-410; 1990)
BLAT (BLAST-Like Alignment Tool)
An mRNA/DNA and cross-species protein sequence analysis tool to quickly find sequences of 95% and greater similarity of length 40 bases or more. (Kent, W.J. 2002. BLAT -- The BLAST-Like Alignment Tool. Genome Research 4: 656-664)
BLOSUM 62 (Blocks Substitution Matrix)
A matrix that defines scores for amino acid substitutions, reflecting the similarity of physicochemical properties, and observed substitution frequencies. The BLOSUM 62 matrix is tailored using sequences sharing no more than 62% identity (sequences closer evolutionary, were represented by a single sequence in the alignment to avoid bias from using related family members). (Henikoff and Henikoff, Proc Natl Acad Sci U S A 89:10915-10919; 1992).
CCDS (Consensus CDS)
A core set of human protein-coding regions that are consistently annotated between Ensembl, VEGA and RefSeq. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. More information...
cDNA (Complementary DNA)
DNA obtained by reverse transcription of a mRNA template. In bioinformatics jargon, cDNA is thought of as a DNA version of the mRNA sequence. Generally, cDNAs are denoted in coding or 'sense' orientation.
CDS (Coding sequence)
The portion of a gene or an mRNA that codes for a protein. Introns are not coding sequences, nor are the 5' or 3' UTR. The coding sequence in a cDNA or mature mRNA includes everything from the start codon through to the stop codon, inclusive.
Centimorgan (cM)
A unit of genetic distance, determined by how frequently two genes on the same chromosome are inherited together. One centimorgan equals 1% recombinant offspring. In humans, 1 cM is about 1 x 10^6 bp
Cigar (Compact Idiosyncratic Gapped Alignment Report)
Defines the sequence of matches/mismatches and deletions (or gaps). The cigar line defines the sequence of matches/mismatches and deletions (or gaps). For example, this cigar line 2MD3M2D2M will mean that the alignment contains 2 matches/mismatches, 1 deletion (number 1 is omitted in order to save some space), 3 matches/mismatches, 2 deletions and 2 matches/mismatches. If the original sequence is:

  • Original sequence: AACGCTT

The aligned sequence will be:

cigar line: 2MD3M2D2M
A A - C G C - - T T
A segment of DNA that has been inserted into a vector molecule, such as a plasmid, and then replicated to form many identical copies.
Three base pairs in either DNA or RNA that code for an amino acid (or stop translation).
A contig is a contiguous stretch of DNA sequence without gaps that has been assembled solely based on direct sequencing information. Contig can be used in other contexts: A clone contig is a group of cloned fragments of DNA covering overlapping regions of a particular chromosome. A sequence contig is an extended sequence created by merging sequences that overlap. A contig map shows the regions of a chromosome where contiguous DNA segments overlap. These maps allow the study of a complete segment of a genome by examining a series of overlapping clones covering a region of interest.
DNA from a bacterial virus spliced with a small fragment of a genome (up to 50 kb) to be amplified and sequenced.
Refers to the number of overlapping sequences used to build a region of the assembly. High coverage indicates a good amount of sequence information while low coverage reflects a low amount of sequence information.
cytogenetic map
A banding pattern on a chromosome resulting from staining and examination by microscopy. Cytogenetic abnormalities such as deletions or inverted nucleotide sequences may be detected by examining and comparing banding patterns.
DAS (Distributed Annotation System)
A protocol for requesting and returning annotation data for genomic regions. See the BioDAS site for more information.
DDBJ (DNA Data Bank of Japan)
DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA sequences from researchers and to issue the internationally recognized accession number to data submitters. Data is exchanged with EMBL/EBI and GenBank/NCBI on a daily basis, and the three data banks share virtually the same data at any given time.
A region of special biological interest within a single protein sequence. However, a domain may also be defined as a region within the three-dimensional structure of a protein that may encompass regions of several distinct protein sequences that accomplishes a specific function. A domain class is a group of domains that share a common set of well-defined properties or characteristics.
Ensembl DotterView is based on the program Dotter, a dot-matrix program with dynamic threshold control suited for genomic DNA and protein sequence analysis. The Dotter tool provides a visual display of the sequence alignment it represents. The dotplot displays detailed comparison of two sequences. Every residue in one sequence is compared to every residue in the other sequence. The first sequence runs along the x-axis and the second sequence along the y-axis. In regions where the two sequences are similar to each other, a row of high scores will run diagonally across the dot matrix. If you're comparing a sequence against itself to find internal repeats, you'll notice that the main diagonal scores maximally, since it's the 100% perfect self-match. To make the score matrix more intelligible, the pairwise scores are averaged over a sliding window that runs diagonally. The averaged score matrix forms a three-dimensional landscape, with the two sequences in two dimensions and the height of the peaks in the third. This landscape is projected onto two dimensions by aid of grayscales - higher peaks are indicated by darker grays. Dotter was written by Erik L.L. Sonnhammer and Richard Durbin Gene 167: GC1-10 (1995)
A standalone application that looks for low complexity sequences.
DWGA (Derived from Whole Genome Alignments)
Human versus Chimpanzee exception: The human versus chimpanzee orthologue predictions were obtained in a completely different manner. Since the current chimpanzee genome sequence assembly is the result of low-coverage sequencing, the assembled sequence is of too poor quality to generate a gene set on the classical Ensembl gene build pipeline. The chimpanzee gene set produced by Ensembl has rather been generated by "projecting" human genes to the chimpanzee genome through whole genome BLASTz alignments between both species and filtering for orthologue sequence alignments. The result of this procedure is de facto the human - chimpanzee orthologue set that has been Derived from Whole Genome Alignments (DWGA). See the Prediction Method section on a relevant Ensembl Gene Report page.
EMBL (European Molecular Biology Laboratory)
Europe's primary nucleotide sequence resource. The main sources of the DNA and RNA sequences in the database are submissions from individual researchers, genome sequencing projects and patent applications.
ENCODE (ENCyclopedia Of DNA Elements)
The ENCODE project uses defined regions of the Human genome to test and evaluate different methods and technologies for finding various functional elements in Human DNA. The two main criteria for manually selected regions were presence of well-studied genes or other known sequence elements, and existence of a substantial amount of comparative sequence data. A total of 14.82Mb of sequence was manually selected using this approach, consisting of 14 targets that range in size from 500kb to 2Mb.
Ensembl genes
Set of Ensembl gene predictions based on experimental evidence from protein sequences and/or near-full-length cDNA available from public sequence databases. "Ensembl known genes" are predicted on the basis of species-specific database entries from manually curated UniProt/Swiss-Prot, partially manually curated RefSeq and UniProt/TrEMBL databases. Predictions of "Ensembl novel genes" are based on other experimental evidence such as protein and cDNA sequence information from related species. Golden genes are the result of a merge between a Havana transcript (manually curated) and an Ensembl gene prediction from the annotation pipeline. See "havana transcript".
Eponine is a probabilistic method for detecting transcription start sites (TSS) in mammalian genomic sequence, with good specificity and excellent positional accuracy. Eponine models consist of a set of DNA weight matrices recognizing specific sequence motifs. Each of these is associated with a position distribution relative to the TSS.
EST (Expressed Sequence Tags)
Coarse sequence reads from flanking vector regions into the inserts of cDNA libraries. ESTs act as physical markers for cloning and full length sequencing of the cDNAs of expressed genes. Typically identified by purifying mRNAs, converting to cDNAs, and then sequencing a portion of the cDNAs. Usually short, single reads from a tissue or stage in development.
EST genes
Set of Ensembl gene predictions solely based on EST evidence. The process of EST gene prediction uses a combination of Exonerate, BLAST and Est2Genome to map ESTs onto the genomic sequence. Redundant ESTs are merged, before GenomeWise is used to assign 5' and 3' UTRs to the longest found ORF. See Eyras et al. for a more complete explanation of the EST gene prediction process.
The part of the genomic sequence that remains in the transcript (mRNA) after introns have been spliced out.
A fast gapped DNA-DNA alignment algorithm. It can be used for aligning various types of sequences such as genomic DNA, cDNAs/ESTs, and proteins.
Any annotation on a specific location in the genomic sequence.
FGENES, also known as Find Genes, is a Human gene predictor that is based on pattern recognition of different types of exons, promoters and poly A signals. It is built based on linear discriminant functions of internal, 5'-coding, and 3'-coding exon recognition. It is designed to find the optimal combination of these components and to construct a set of gene models along a given sequence.
Flanking sequence
Sequence 5' or 3' to a DNA or RNA sequence of interest (for example gene, transcript, SNP or repeat).
GeneWise is sequence analysis tool for comparing proteins or profile HMMs to DNA sequences allowing for introns and frameshifts. The Wise2 package was written by Ewan Birney. More information about the package can be obtained at:
Specific alleles present in an individual's genome, or the genetic makeup of one organism.
An application for identification of complete gene structures in genomic DNA (Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94). The splice site models used are described in more detail in: Burge, C. B. (1998) Modeling dependencies in pre-mRNA splicing signals. In Salzberg, S., Searls, D. and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, pp. 127-163.
GO (Gene Ontology)
An organized hierarchy of terms produced by the Gene Ontology Consortium, used to describe biological processes, cellular component, and molecular function. Specific GO terms are as follows: Molecular Function Ontology. Tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity. Biological Process Ontology. Broad biological goals, such as mitosis or purine metabolism, are accomplished by ordered assemblies of molecular functions. Cellular Component Ontology. Subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex. A gene may be indexed under many GO terms depending on GO classification system. A gene product has one or more molecular functions and is used in one or more biological processes; it might be associated with one or more cellular components. For instance, cytochrome c can be described by the molecular function term electron transporter activity, the biological process terms oxidative phosphorylation and induction of cell death, and the cellular component terms mitochondrial matrix and mitochondrial inner membrane.
A set of genes or markers on one chromosome that are inherited together. Often refers to SNPs that are closely linked (i.e. have a high linkage disequilibrium (LD) value, and are inherited together.)
Havana transcript
A transcript resulting from manual curation of genome annotation for vertebrate species. The Havana team is a subset of Vega (See "Vega genes".)
Specific sequences that are descended from the same common sequence in an ancestor. See orthologues or paralogues.
A measure of how similar two sequences are, specifically, what percent of amino acids are the same in type and position between the two sequences.
In-del (Insertion-deletion)
A mutation or polymorphism in which one or more base pairs have been inserted into or removed from a genomic sequence.
InterPro is an integrated resource for protein families, domains and sites, combining information from several different protein signature databases. InterPro IDs are linked to the summary of information about that domain or family. InterPro is managed by EBI. A number of databases (SwissProt, TrEMBL, PROSITE, PRINTS, Pfam, and ProDom, SMART, TIGRFAMs, PIR SuperFamilies and SUPERFAMILY) with different approaches to biological information are used to derive protein signatures. ProteinView, GeneView and DomainView provide links to the relevant InterPro entries.
The part of the genomic sequence that is transcribed and then spliced out of the transcript (mRNA). Noncoding.
Jalview is a multiple alignment editor, used by the EBI clustalw server and the PFAM protein domain database and is available as a general purpose alignment editor.
Known genes
Known genes are predicted transcripts that have been mapped by Ensembl to full-length or near-full-length protein sequences already available in the public sequence databases.
LD (Linkage Disequilibrium)
A measure of how often two SNPs or specific sequences are inherited together.
A measure of how often features (genes, specific sequences) on a chromosome are inherited together.
Low-complexity region
A region in the sequence with a biased composition (i.e. repeated sequences or residues.)
A short sequence whose placement on the genome is known.
MBRH (Multiple Best Reciprocal Hit)
When due to gene duplications there are multiple 'best' hits with identical score, E-value, % identity, %positivity, one is unable to pick a unique orthologue for a gene. This results in more complex graphs of 'best' relationships. This often occurs when different genes have identical translations, which could be due to a duplication event, an assembly error, or chance. On average 3% of the genes have an identical translation to some other gene either within it's genome or in another genome.

  • MBRH / DUP 1.# - MBRH set where in one genome there is only one gene, but the other genome has multiple genes, all on the same chromosome and within 1.5 megbases of each other. This could be due to recent gene duplication events where sequences have not diverged or a mis-assembly of the genome sequence leading to artificial, apparent gene duplications. (e.g. MBRH / DUP 1.2 or MBRH/ DUP 1.4)
  • MBRH / SYN - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. The one(s) labeled MBRH/SYN satisfies both the MBRH criteria and the RHS search criteria.
  • MBRH / COMPLEX - This is a more complex MBRH set where there are multiple genes in each genome split across multiple chromosomes. This MBRH pair does not satisfy the RHS criteria.
MGI (Mouse Genome Informatics)
Houses a database that provides integrated access to data on the genetics, genomics, and biology of mouse (Mus musculus).
A region in the genomic sequence containing short tandem repeats.
miRNA (micro RiboNucleicAcid)
Single-stranded RNA, typically 21-23 base pairs long, that is thought to be involved in gene regulation (especially inhibition of protein expression.)
A conserved region of sequence with a specific function/ structure.
A modification (insertion, deletion, or alteration) in the genomic or amino acid sequence.
Novel genes
Novel genes are genes that have been predicted by Ensembl on the basis of similarity to protein or cDNA sequences. They cannot be mapped with confidence to existing entries in any public sequence database.
OMIM (Online Mendelian Inheritance in Man)
Genetic knowledge database which was first published in 1966, (Mendelian Inheritance in Man (MIM) (currently in its 12th edition) that includes information and references, including links to MEDLINE and sequence. Ensembl links both to OMIM entries for any gene, where available, and to a subset of this database, the OMIM Morbid Map presenting the syndrome and disease-associated genes described in OMIM.
ORF (Open Reading Frame)
A DNA sequence that possesses a start codon and a large window of sequence with no stop codon that could potentially code for a protein.
Orthologues are genes derived from a common ancestor through vertical descent (or speciation) and can be thought of as the direct evolutionary counterpart. In contrast, paralogues are genes within the same genome that have evolved by duplication.
Sequences (homologues) that have evolved by duplication.
PDB (Protein Data Bank)
A repository for 3-D biological macromolecular structure data. PDB archives protein structures deduced from crystallography and nuclear magnetic reasonance (NMR) experiments on protein structures. The Protein Data Bank (PDB) is operated by Rutgers, The State University of New Jersey; the San Diego Supercomputer Center at the University of California, San Diego; and the Center for Advanced Research in Biotechnology of the National Institute of Standards and Technology -- three members of the Research Collaboratory for Structural Bioinformatics (RCSB). The RCSB PDB is supported by funds from the National Science Foundation, the Department of Energy, and the National Institutes of Health.
Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. Pfam can be used to view the domain organization of proteins, to view multiple alignments, protein domain architectures, protein structures, and species distributions.
Pmatch is a fast, exact matching program for aligning protein sequences with either protein or DNA sequence.
Pre-release site
Initial annotations of upcoming Ensembl genomes, usually without gene predictions or validation, are regularly made available on the pre-release site,
The PRINTS protein fingerprint database is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SwissProt/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors.
PROSITE is a database of protein families and domains run by the (Expert Protein Analysis System (ExPASy) proteomics server of the Swiss Institute of Bioinformatics (SIB). It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs.
Processed pseudogenes result from reverse transcription of a mature mRNA and reinsertion into the genomic sequence. Ensembl detects potential processed pseudogenes among the Ensembl transcript predictions. For more information about how Ensembl predicts pseudogenes, see the following: Curwen et al.
QTL (Quantitative Trait Locus)
Genetic loci where allelic variation is associated with variation in a quantitative trait (e.g. blood pressure). The presence of QTL is inferred from genetic mapping. Total variation is partitioned into components linked to a number of discrete, mapped chromosome markers described by statistical association to quantitative variation in a particular phenotypic trait that is thought to be controlled by the cumulative action of alleles at multiple loci.
Query %id
Query %id indicates the percentage of the query sequence matching the target sequence.
Reference SNP (Reference Single Nucleotide Polymorphism)
A SNP assigned to eliminate redundancy in the NCBI dbSNP database. All SNPs submitted at the position of a reference SNP are given the reference SNP identifier (a number preceded by 'rs').
NCBI's Reference Sequences (RefSeq) database is a curated database of Genbank's genomes, mRNAs and proteins. RefSeq attempts to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, tRNA, and protein products, providing a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery, and comparative analyses.
Repetitive DNA in which the same sequence occurs multiple times.
Repeat Masking
The method by which repeated sequences and low-complexity regions are hidden, usually used in searches by alignment and homology-searching programs.
RepeatMasker (AFA Smit & P Green) is a standard software tool used in computational genomics to identify repetitive elements and low-complexity sequences.
RH map (Radiation Hybrid map)
Technique for identifying landmarks (STS) every 100 kb in the human genome, the ordering is relative to the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human-hamster hybrid cell lines.
rRNA (ribosomal RiboNucleic Acid)
rRNAs are a main component of ribosomes and make up at least 80% of the RNA molecules found in a typical eukaryotic cell. The rRNA molecules have several roles in protein synthesis.
SARA (Same As Reference Assembly)
An acronym used to indicate a SNP (single nucleotide polymorphism) that has the same sequence as the strain used in the assembly.
Scaffolds are sets of ordered, oriented contigs positioned relative to each other by mate pairs whose reads are in adjacent contigs.
scRNA (small cytoplasmic RiboNucleic Acid)
Small cytoplasmic RNAs are mainly found in the cytoplasm and sometimes the nucleus of a eukaryote. They are usually complexed with proteins in scRNPs.
Seg divides sequences into contrasting segments of low-complexity and high-complexity. Low-complexity segments defined by the algorithm represent "simple sequences" or "compositionally-biased regions". Segment lengths and the number of segments per sequence are determined automatically by the algorithm.
SGD (Saccharomyces Genome Database)
Canonical database for the molecular biology and genetics of Saccharomyces cerevisiae.
Shotgun method
(also whole genome shotgun) Semi-automated sequencing method that involves randomly sequenced cloned pieces of the genome (size selected, sually 2, 10, 50 and 150 kb), with no prior knowledge their location. The clones are then sequenced from both ends. The two ends of the same clone are referred to as mate pairs. The distance between two "mate pairs" can be inferred if the library size is known and has a narrow window of deviation. This approach can be contrasted with "directed" strategies, in which pieces of DNA from known chromosomal locations are sequenced.
Shotgun sequencing
A method in which small, random DNA sequences are generated that overlap. The fragments are sequenced and the full, connected sequence determined through the overlaps.
The SignalP application predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks. Signal peptides indicate a protein that will be secreted. Prediction of signal peptides is quite accurate however care must be exercised and these regions should be verified by other means. (Henrik Nielsen, Jacob Engelbrecht, Søren Brunak and Gunnar von Heijne. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 10, 1-6 (1997)
How well one sequence matches another determined by calculation by an alignment program of identical and conserved residues.
  1. (Synonymous/Non-synonymous Analysis Program) A program which calculates synonymous and non-synonymous substitution rates based on a set of codon-aligned nucleotide sequences, based on the method of Nei and Gojobori, incorporating a statistic developed in Ota and Nei.
  2. An ab initio gene prediction program developed by Ian Korf that models protein coding sequences in genomic DNA by means of hidden Markov models.
SNP (Single Nucleotide Polymorphism)
SNPs are common variations that occur in DNA with a 0.1% frequency. Ensembl displays SNPs obtained from dbSNP, (the SNP repository maintained by NCBI; The Human Genic Bi-Allelic Sequences Database (HGVBase) and The SNP Consortium Ltd.(TSC).
SSAHA (Sequence Search and Alignment by Hashing Algorithm)
A search designed to detect exact matches, or nearly exact matches, in DNA or protein databases. The SSAHA search has been optimized for alignments of high percentage identity and display as results the most significant matches for ungapped alignments between sequences. Each exact match in an SSAHA alignment is analogous to finding a high-scoring segment pair in BLAST. A number of consecutive matches on a contig may represent features of a gene such as exons or 5' and 3' untranslated regions, depending on the nature of the query sequence.
STS markers
STS markers are short sequences of genomic DNA that can be uniquely amplified by the polymerase chain reaction (PCR) using a pair of primers. Because each is unique, STSs are often used in linkage and radiation hybrid mapping techniques. STSs serve as landmarks on the physical map of the human genome.
Assemblies consist of sequence contigs combined into scaffolds, also known as supercontigs. Supercontigs are combined and ordered according to their orientation and linking information provided by mated sequences from the ends of genomic sub-clones. For some species, supercontigs are combined into ultracontigs, in which neighboring supercontigs are organized into their proper order and orientation using linking information provided by the physical map of BAC clones independently assembled using restriction fragment patterns and the FPC program.
The term synteny was originally defined to mean that two gene loci share the same chromosome. In a genomic context we refer to syntenic regions if both sequence and gene order is conserved between two (closely related) species.
tandem repeats
Multiple copies of the same base sequence on a chromosome; used as markers in physical mapping.
Target % id
Target %id indicates the percentage of the target sequence matching the query sequence.
translation start site
The position within an mRNA at which synthesis of a protein begins. The translation start site is usually an AUG codon, but occasionally, GUG or CUG codons are used to initiate protein synthesis.
tRNA (Transfer RNA)
A class of ribonucleic acids with triplet nucleotide sequences that are complementary to the triplet nucleotide coding sequences of mRNA. The role of tRNAs in protein synthesis is to bond with amino acids and transfer them to the ribosomes, where proteins are assembled according to the genetic code carried by mRNA.
TSC (The SNP Consortium)
A non-profit foundation to provide public SNP-related information available to the public without intellectual property restrictions.
UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each Unigene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location.
(Universal Protein Resource) is the world's most comprehensive catalogue of information on proteins. UniProt/Swiss-Prot is a curated protein sequence database that provides a high level of annotation, a minimal level of redundancy and high level of integration with other databases. SwissProt is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI).
SPTrEMBL is a subset of TrEMBL (Translated EMBL database) containing the computer-annotated protein translations of all coding sequences (CDS) present in the EMBL EMBL nucleotides that are not yet incorporated into the UniProt/SwissProt database.
UniSTS is a NCBI resource for non-redundant Sequence Tagged Sites (STS) markers. For each marker, UniSTS displays the primer sequences, product size, and mapping information, as well as cross references to dbSNP, RHdb, GDB, MGD, etc. The marker report also lists GenBank and RefSeq records that contain the primer sequences determined by ePCR.
UTR (Untranslated Region)
The 5' UTR is the portion of an mRNA from the 5' end to the position of the first codon used in translation. The 3' UTR is the portion of an mRNA from the position of the last codon that is used in translation to the 3' end.
Vega genes
Vega genes from the Vertebrate Genome Annotation (VEGA) database include manual annotation of specific Human, Mouse, and Zebrafish clones. Annotation is performed on a clone-by-clone basis using a combination of similarity searches against DNA and protein databases, ab initio gene prediction applications (genscan, Fgenes),. Comparative analysis using vertebrate datasets is used to aid novel gene discovery. The data gathered in these steps is then used to manually annotate the clone adding gene structures, descriptions and poly-A features. The annotation is based on supporting evidence only.
YAC (Yeast Artificial Chromosome)
Originated from a bacterial plasmid, a YAC contains a yeast centromeric region (CEN), a yeast origin of DNA replication, a cluster of unique rectriction sites and a selectable marker and a telomere region at the en of each arm. YACs are capable of cloning extremely large segments of DNA (over 1 megabase long) into a host cell, where the DNA is propagated along with the other chromosomes of the yeast cell.
ZFIN (ZebraFish Information Network)
A database for the zebrafish model organism that holds information on wild-type stocks, mutants, genes, gene expression data, and map markers.