PDF version

Table of Contents

Setting up your project

The easy way

If you're on a Mac or Linux machine, run setup.pl from your project root to

  • create data directory
  • download all sequences
  • download the query file
  • create the database
  • index the database
  • create stub files

Be sure to read through the script before executing it.

Installing BLAST

Download one of the installers from the NCBI

Project layout


├── data/
│   ├── NC_003997.fna
│   ├── NC_005945.fna
│   ├── NC_012581.fna
│   ├── NC_012659.fna
│   ├── anthraxDB.fna
│   └── queryNuc.txt
├── setup.pl
└── src/
    ├── query_nuc.pl
    └── query_prot.pl


READMEHow to run your program, etc.
data/Contains data for your project
setup.plAutomate setting up data
src/Source code
src/query_nuc.plPerforms nucleotide searches
src/query_prot.plPerforms protein searches

Downloading sequences

Downloading from the command-line

Use curl -O URL
Use wget URL
Use your browser

Preparing the DB

Need to merge all of our FNA files into anthraxDB.fna so BLAST can search it

Mac & Linux

cd data
cat *.fna > anthraxDB.fna


cd data
copy /a *.fna anthraxDB.fna

Index the database

From your project's working directory (i.e., above data), type

makeblastdb -in data/anthraxDB.fna -dbtype nucl

Download the query file

Save queryNuc.txt to data/ (we'll use it in the next tutorial)

Running BLAST

From the shell

Do a search against the database:

blastn -db data/anthraxDB.fna -query data/query.txt
  • look at the E-values (smaller is better)


Remember our goal

We are trying to automate queries against BLAST to determine whether ~100 fragments are from the A0248 strain.

Example Run

$ perl query_nuc.pl
best hit:       ambiguous
best hit:       ambiguous
best hit:       Bacillus anthracis str. A0248
best hit:       ambiguous

votes for ambiguous: XX
votes for Bacillus anthracis str. A0248: YY

Input & Output

results from a BLAST command
how many hits were ambiguous or conclusive

Process (pseudocode)

  1. Read the sequences in from the query file (queryNuc.txt)
  2. For each query sequence:
    1. Write the query to a text file to use in a BLAST command
    2. Run the BLAST command
    3. Parse the output to determine what strain
  3. Print out how many ambiguous and conclusive strains were found


At the top…

use strict;
use warnings;
my $db = 'data/anthraxDB.fna'; # path to BLAST DB
my $query_fn = 'tmp_query.txt'; # file we generate each seq
# this is defined on the "snippets" page -- paste
# the definition into your file
my @queries = fasta_to_array('data/queryNuc.txt');
my $num_ambiguous  = 0; # number of ambiguous hits
my $num_conclusive = 0; # number of conclusive hits
# command to r un for each query (i.e., 100 times)
my $cmd = "blastn -db $db -query $query_fn -evalue 1e-10";

In the middle…

for my $query (@queries) {
    # create a BLAST-query for the current sequence ($query)
    # ...write it to $query_fn

    # execute BLAST
    # my $result = `$cmd`;

    # see what we got (parse the output)

At the bottom…

print "Total ambiguous: $num_ambiguous\n";
print "Total conclusive: $num_conclusive\n";

Date: 2011-09-19 Mon

Author: Jon-Michael Deldin

Org version 7.7 with Emacs version 23