AQUESA: Aligning Q-grams Using Enhanced Suffix Arrays

Introduction

AQUESA is a suite of programs for high throughput sequence alignment.

Q-gram or seed-based alignment comes from the idea that a certain number of contiguous bases must match for an alignment to exist. If a sequence is L nucleotides long and it is desired that up to M mismatches (single base mismatch, single base insertion, single base deletion) be allowed, then the sequence can be divided into q-grams of length q = floor(L / (M + 1)). The M + 1 q-grams can be thought of as bins to be filled with M items. Regardless of how the M items are arranged, at least one bin will be empty. The empty bin corresponds to a perfect match. Sequence alignment then becomes an issue of finding q-grams in the reference sequence and query sequence that match and extending them to check for an alignment. The q-grams in the reference sequence should overlap while the q-grams in the query sequences should not (or vice-versa).

A suffix array is a permutation of the indices of a string so the suffixes starting at each index are in a sorted order. An enhanced suffix array augments this array with longest common prefix information.

The first step in the alignment is to index the reference sequence and select valid q-grams. The second step is to select valid q-grams in the query sequences and index them. (Preparing the query sequences this way is typically more efficient than selecting valid q-grams after indexing.) The third step is to scan the indices to find perfect seed matches and then extend in both directions to check for an alignment.

AQUESA was inspired by the PASS sequence alignment algorithm.

A poster about AQUESA was presented at ISMB 2008 (July 20-23) in Toronto.

Availability

This code is experimental. Send questions, comments and bugs to wilsonjr at umich dot edu.

Download AQUESA 0.01 for Linux x86-64 here.

Usage

  1. Convert the reference sequence and generate the reverse complement.
    ./superfasta -r reference.fasta reference
  2. Build the suffix array, inverse suffix array and longest common prefix array.
    ./sufarr -S128M reference
  3. Select q-grams.
    ./incvec -S128M -r 9 reference q9
  4. Convert the query sequences.
    ./superfasta query.fasta query
  5. Select q-grams and build the suffix array and longest common prefix array.
    ./sufarr -S128M -q9,q9 query
  6. Perform the alignment.
    ./align -n 2 -d 27 9 reference q9 query q9 result
  7. Convert the binary results to text.
    ./aligno reference query result > result.txt