Archive Ensembl HomeArchive Ensembl Home
Home > Help & Documentation

Ensembl BlastView Configuration

Contents

  1. Introduction
  2. Registering Sequence Databases
  3. Registering Search Methods
  4. Registering Method vs. Database Links
  5. Configuring the ENSEMBL BLAST Database
  6. Conclusion

Introduction

This document provides details for configuring the Ensembl BlastView web interface.

Ensembl Configuration Overview

The Ensembl distribution is usually stored within a filesystem directory (hereafter refered to as $ENSEMBL ROOT), e.g.:

$ setenv ENSEMBL_ROOT /usr/local/ensembl

Ensembl web site configuration data are stored in text files within $ENSEMBL ROOT, in the following directory;

$ENSEMBL_ROOT/conf

Species-independent configuration are stored in (hereafter refered to as MULTI.ini):

$ENSEMBL_ROOT/conf/ini-files/MULTI.ini

Species-default configuration are stored in (hereafter refered to as DEFAULTS.ini):

$ENSEMBL_ROOT/conf/ini-files/DEFAULTS.ini

Species-specific configuration are stored in (hereafter refered to as <SPECIES>.ini), for example:

$ENSEMBL_ROOT/conf/ini-files/Homo_sapiens.ini
$ENSEMBL_ROOT/conf/ini-files/Danio_rerio.ini

The .ini files are separated into sections containing key-value pairs using the format:

[SECTION_HEADING]
key1 = value1
key2 = value2

BlastView Configuration

BlastView configuration consists of the following components:

  • Register of sequence databases to search against
  • Register of methods (executables) that run the search
  • Register of search method vs. sequence database linkage
  • Location of MySQL database for search report storage (ENSEMBL BLAST)

Databases and methods are generally species independent, and their configurations are therefore stored in the DEFAULTS.ini file. Method vs. sequence database linkage, however, is species specific, and stored in the <SPECIES>.ini files.

For interface responsiveness search reports are parsed once and the results stored in a database. The location of this database is configured in the MULTI.ini file.

Registering Sequence Databases

Sequence database configuration requires the following data;

  • A "type" string, used for grouping of databases across species and methods, (e.g. CDNA ALL, PEP ALL)
  • A "label" string, used for display in the interface, (e.g. "Ensembl cDNAs", "Ensembl Peptides").

To register databases, type/label pairs must be entered in the BLAST DATASOURCES configuration section of DEFAULTS.ini. An additional DEFAULT key sets the type of the default database e.g.:

[BLAST_DATASOURCES]
# Register of BlastView sequence databases. Use format;
# DB_TYPE = DB_LABEL
# DB_TYPE is the internal identifier (no whitespace)

# DB_LABEL is the human-readable label.
LATESTGP = Genomic sequence
CDNA_ALL = Ensembl cDNAs
CDNA_GENSCAN = Genscan cDNAs
PEP_ALL = Ensembl Peptides
PEP_GENSCAN = Genscan Peptides
# Default sequence database. Use format;
# DEFAULT = DB_TYPE
DEFAULT = LATESTGP

As for all Ensembl configuration sections, DEFAULTS.ini section headers must also be included in each relevent <SPECIES>.ini file. For example, Homo_sapiens.ini contains the following:

[BLAST_DATASOURCES]
# Register of BlastView sequence databases.
# Accept defaults

It is also possible to register species-specific database types in the corresponding <SPECIES>.ini files. However, as non-empty sections in <SPECIES> take precedence over those in DEFAULTS.ini, one must copy all other types relevent to the species. For example, in Homo_sapiens.ini:

[BLAST_DATASOURCES]
# Homo_sapiens-specific database type:
HAPLO = Haplotype blocks
# Standard types copied from DEFAULTS.ini
LATESTGP = Genomic sequence
CDNA_ALL = Ensembl cDNAs
...

Given the problems with registering species-specific database types, it is probably best to simply register all types in DEFAULTS.ini - there is an alternative way of per-species database type selection, which is covered later.

Registering Search Methods

Within the context of BlastView, a search method is an algorithm/executable that takes a query sequence of a particular type (DNA/peptide), and a sequence database of a particular type (DNA/peptide), and computes some measure of sequence similarity between the two. This means that BLASTN, TBLASTX, BLASTP, SSAHA etc. are entirely seperate at the configuration level. Before use with BlastView, code must be written to "wrap" each executable in a perl module with a "runnable" interface. These wrapper modules hide the differences in calling and report handling between methods from the BlastView interface code. Coding of these wrappers for different methods/system configurations is covered in the BlastView tech document. The key attributes of BlastView methods are therefore:

  • A "type" string used for grouping methods across species, also used as human-readable label
  • A "module" string used to identify the appropriate wrapper class.

To register methods, type/module pairs must be entered in the ENSEMBL BLAST METHODS configuration section of DEFAULTS.ini e.g.:

[ENSEMBL_BLAST_METHODS]
# Register of BlastView methods. Use format:
# ME_TYPE = ME_WRAP
# ME_TYPE is the internal identifier (no whitespace)
# ME_WRAP is the Bio::Tools::Run::Search wrapper class
BLASTN = ensembl_wublastn
BLASTX = ensembl_wublastx
BLASTP = ensembl_wublastp
TBLASTN = ensembl_wutblastn
TBLASTX = ensembl_wutblastx
SSAHA = ensembl_ssaha

For example, the ensembl wublastn wrapper contains the logic to run a wu-blastn search using the Compaq BSUB job submission system as used for the Ensembl Blast cluster. Conversly, the ensembl ssaha wrapper runs SSAHA searches over TCP-IP uising a client-server model. Further wrappers are being developed to, for example, run blast searches on the same machine as the web server.

Unfortunately, it is not currently possible to override the blast method wrappers on a per-species basis. This is a known limitation of the system, and may be addressed in the future. However, you still need to add a "stub" ENSEMBL BLAST METHODS section for each species, e.g. within Homo_sapiens.ini:

[ENSEMBL_BLAST_METHODS]
# Register of BlastView sequence databases.
# Accept defaults

In addition to the ENSEMBL BLAST METHODS section, there are several attributes in the general section that also affect method configuration. These are:

# Path to binaries on local machine
ENSEMBL_BINARIES_PATH = /usr/local/bin
# Path to binaries on remote machine
ENSEMBL_BLAST_BIN_PATH = /usr/remote/bin
# Path to blast databases
ENSEMBL_BLAST_DATA_PATH = /usr/remote/data
# Path to blast filter directory
ENSEMBL_BLAST_FILTER = /usr/remote/blast/filter
# Path to blast matrix directory
ENSEMBL_BLAST_MATRIX = /usr/remote/blast/matrix
# Path to RepeatMasker executable
ENSEMBL_REPEATMASKER = /usr/remote/RepeatMasker
# Names of method executables. Names correspond to method types in ENSEMBL_BLAST_METHODS
ENSEMBL_SSAHA_PROGRAM_NAME = ssahaClient3.1
ENSEMBL_BLASTN_PROGRAM_NAME = wublastn
ENSEMBL_BLASTX_PROGRAM_NAME = wublastx
ENSEMBL_BLASTP_PROGRAM_NAME = wublastp
ENSEMBL_TBLASTN_PROGRAM_NAME = wutblastn
ENSEMBL_TBLASTX_PROGRAM_NAME = wutblastx

As the blast/ssaha databases are species-specific they must be configured in their corresponding <Genus_species>.ini files.

Each of the method types registered in DEFAULTS.ini (e.g. BLASTN, SSAHA etc.) must have must have a corresponding section named [<METHOD>_DATASOURCES] in the <Genus_species>.ini e.g. [BLASTN_DATASOURCES].

The [<METHOD>_DATASOURCES] section contains:

  1. A DATASOURCE_TYPE = dna or DATASOURCE_TYPE = peptide key=value pair to specify the query type (dna/peptide) that the search method expects as input (see example below).
  2. <DATABASE> = <LOCATOR> (KEY = value) pairs which are links to one or more sequence databases.
    • <DATABASE> is one of the database types registered in DEFAULTS.ini (e.g. CDNA_ALL, LATESTGP, PEP_KNOWN)
    • <LOCATOR> refers to the filesystem or TCP-IP location of the database. For the blast databases the <LOCATOR> can either be the full name of the file (see example below), or the file name can be replaced with <DATABASE> = %_ e.g.
      LATESTGP      = %_
      
      In the latter case, the file name will be autogenerated on server start up to use the files with the name Genus_species.Assembly.Release.sequencetype.subset.fa

Example:

[BLASTN_DATASOURCES]
DATASOURCE_TYPE = dna
LATESTGP      = Homo_sapiens.NCBI36.40.dna.seqlevel.fa 
CDNA_ALL      = Homo_sapiens.NCBI36.40.cdna.all.fa
CDNA_ABINITIO = Homo_sapiens.NCBI36.40.cdna.abinitio.fa

[TBLASTX_DATASOURCES]
DATASOURCE_TYPE = dna
LATESTGP      = Homo_sapiens.NCBI36.40.dna.seqlevel.fa 
CDNA_ALL      = Homo_sapiens.NCBI36.40.cdna.all.fa
CDNA_ABINITIO = Homo_sapiens.NCBI36.40.cdna.abinitio.fa

[TBLASTN_DATASOURCES]
DATASOURCE_TYPE = peptide
LATESTGP      = Homo_sapiens.NCBI36.40.dna.seqlevel.fa 
CDNA_ALL      = Homo_sapiens.NCBI36.40.cdna.all.fa
CDNA_ABINITIO = Homo_sapiens.NCBI36.40.cdna.abinitio.fa

[BLASTP_DATASOURCES]
DATASOURCE_TYPE = peptide
PEP_ALL       = Homo_sapiens.NCBI36.40.pep.all.fa
PEP_ABINITIO  = Homo_sapiens.NCBI36.40.pep.abinitio.fa

[BLASTX_DATASOURCES]
DATASOURCE_TYPE = dna
PEP_ALL       = Homo_sapiens.NCBI36.40.pep.all.fa
PEP_ABINITIO  = Homo_sapiens.NCBI36.40.pep.abinitio.fa

[SSAHA_DATASOURCES]
DATASOURCE_TYPE = dna
LATESTGP = ssaha01:50001

Configuring the ENSEMBL BLAST Database

To improve interface responsiveness, search reports are parsed once, and the results stored in a database. The location of this database is configured in the MULTI.ini file. Configuration is similar to that of all other Ensembl databases, with the name of the database being ENSEMBL BLAST. The main difference between ENSEMBL BLAST and other Ensembl databases, however, is that ENSEMBL BLAST is a read/write database, so the configured database user must have write permission. For example:

[databases]
ENSEMBL_BLAST = ensembl_blast
[ENSEMBL_BLAST]
HOST = localhost
PORT = 3306
USER = admin_user
PASS = secret

The database schema of the ENSEMBL BLAST database is distributed within the perl code, rather than available by FTP download. Firstly, an empty database should be created. E.g.

$ mysql -u admin_user -p secret -e "create database ensembl_blast"
Next, a script can be run that creates the database automatically. E.g.
$ $ENSEMBL_ROOT/utils/utils/blast_database.pl
The correct execution of this script can be checked as follows:
$ mysqldump -u admin_user -p secret ensembl_blast

The above should result in output similar to the following. Note that the blast result, blast hit and blast hsp tables are timestamped. Re-running the blast database.pl script will cause the blast result, blast hit and blast hsp tables to be rotated, meaning that, whilst old results/hsps/hits will still be available, new searches will be stored in the new tables. It is simple, therefore, to maintain the ENSEMBL BLAST database by dropping old tables. See the $ENSEMBL ROOT/utils/blast cleaner.pl script for an example of how this is done.

-- -- Table structure for table "blast_table_log" --
CREATE TABLE blast_table_log (
table_id int(10) unsigned NOT NULL auto_increment,
table_name varchar(32) default NULL,
table_type enum("TICKET","RESULT","HIT","HSP") default NULL,
table_status enum("CURRENT","FILLED","DELETED") default NULL,
use_date date default NULL,
create_time datetime default NULL,
delete_time datetime default NULL,
num_objects int(10) default NULL,
PRIMARY KEY (table_id),
KEY table_name (table_name),
KEY table_type (table_type),
KEY use_date (use_date),
KEY table_status (table_status)
) TYPE=MyISAM;

-- -- Table structure for table "blast_ticket" --
CREATE TABLE blast_ticket (
ticket_id int(10) unsigned NOT NULL auto_increment,
create_time datetime NOT NULL default "0000-00-00 00:00:00",
update_time datetime NOT NULL default "0000-00-00 00:00:00",
ticket varchar(32) NOT NULL default "",
object longblob, PRIMARY KEY (ticket_id),
UNIQUE KEY ticket (ticket),
KEY create_time (create_time),
KEY update_time (update_time)
) TYPE=MyISAM;

-- -- Table structure for table "blast_result20030821" --
CREATE TABLE blast_result20030821 (
result_id int(10) unsigned NOT NULL auto_increment,
ticket varchar(32) default NULL, object longblob,
PRIMARY KEY (result_id), KEY ticket (ticket)
) TYPE=MyISAM;
CREATE TABLE blast_hit20030821 (
hit_id int(10) unsigned NOT NULL auto_increment,
ticket varchar(32) default NULL, object longblob,
PRIMARY KEY (hit_id),
KEY ticket (ticket)
) TYPE=MyISAM;

-- -- Table structure for table "blast_hsp20030821" --
CREATE TABLE blast_hsp20030821 (
hsp_id int(10) unsigned NOT NULL auto_increment,
ticket varchar(32) default NULL, object longblob,
chr_name varchar(32) default NULL,
chr_start int(10) unsigned default NULL,
chr_end int(10) unsigned default NULL,
PRIMARY KEY (hsp_id),
KEY ticket (ticket)
) TYPE=MyISAM MAX_ROWS=705032704 AVG_ROW_LENGTH=4000;

Conclusion

By following the above steps, the BlastView interface should be available for use on the next restart of the Ensembl web server. Nowever, if the configuration .ini files have been changed, the following file should be deleted before server restart (this is a filesystem cache of the config): $ENSEMBL_ROOT/conf/config.packed For further details about how BlastView works, please see the BlastView technical documentation. (In preparation. Contact whs@sanger.ac.uk for a draft.)

Please address all comments/suggestions on this document, or on BlastView in general to: helpdesk@ensembl.org (cc whs@sanger.ac.uk).