Microarray Probeset Mapping
Ensembl annotates microarray probe sets on the genome sequences if manufacturers disclose individual probe set sequences for a particular micro array. For Affymetrix® gene expression arrays, the mapping process is a two-step procedure. In the Ensembl browser, individual probe positions on the current genomic assembly determined in step one can be displayed in the 'Region in detail' view. Probes that match to a transcript can be seen in the 'Oligos' view, accessible from a transcript page.
Step One: Genome Sequence Mapping
In the first step individual probes (oligonucleotides) are mapped to the genome sequence. The Ensembl analysis and annotation pipeline uses the Exonerate sequence comparison and alignment tool (Slater et al., 2005) and tolerates only 1 bp mismatch between the probe and the genome sequence assembly. Probes that hit to 100 or more locations (e.g. suspected Alu repeats) are discarded and not stored in the database.
Step Two: Ensembl Transcript Mapping
In the second step, we aim to associate microarray probe sets with Ensembl transcript predictions (ENST...). Individual probes are grouped into probe sets and generally it is required that more than 50% of the probes in a probe set hit a given transcript sequence. Probe set sizes are determined dynamically on a per probe set basis, rather than taking the array-wide value documented by the manufacturer. Transcript cDNA sequences are extended by the length of the UTR. Where annotated UTRs are absent a default UTR length is used, calculated for both five and three prime UTRs as the highest of either the mean or the median of all annotated UTRs for a given species. Probes mapping across exon boundaries are not currently captured as the transcript annotations are based on the genomic mappings from step one.