NGSQC: Next Generation Sequencing Quality Control

While the accuracy and precision of deep sequencing data is significantly better than those obtained by the earlier generation of hybridization-based high throughput technologies, the digital nature of deep sequencing output often leads to unwarranted confidence in their reliability.

Next generation sequencing platforms have their own share of quality issues and there can be significant lab-to-lab, batch-to-batch and even within chip/slide variations.

The NGSQC pipeline provides a set of novel quality control measures for quickly detecting a wide variety of quality issues in deep sequencing data derived from two dimensional surfaces, regardless of the assay technology used. It also enables researchers to determine whether sequencing data related to their most interesting biological discoveries are caused by sequencing quality issues. NGSQC can help to ensure that biological conclusions, in particular those based on relatively rare sequences, are not caused by low quality sequencing.

Our publication:
M. Dai, R. Thompson, C. Maher, R. Contreras, M. Kaplan, D. Markovitz, G. Omenn and F. Meng (2010) NGSQC: Cross-Platform Quality Analysis Pipeline for Deep Sequencing Data. BMC Genomics 11 (Suppl 4):S7
(http://www.biomedcentral.com/1471-2164/11/S4/S7)
 

Demo Output
Explanation of Output
Download
Usage

Demo Output:

1000 Genome project, NGSQC V0.4

1000 Genome project, NGSQC V0.3
 

Explanation of Output:

The following is a list of example graphic outputs of our pipeline and their explanations:

  1. Full Sample View: Provide sample level overview of several QC measures including the distribution base/color code, genomic or other target hit count under different mismatch counts and target hit levels (unique or multiple), sequencing read density and quality score based on the corresponding average values for each tile/panel used by a sample. The results are presented in the same spatial layout as the deep sequencing assay to facilitate quick identification of trends/patterns of quality issues in the whole sample assay.

The Full Sample View includes the following graphs:

Full sample base/color code bias graphs: The heatmap color is determined by the percent of the specific base/color code ((A, C, T, G, 0, 1, 2, 3 or N for unreadable base) in each tile of the corresponding graph.  

base/color A_percent
base/color C_percent
base/color G_percent
base/color N_percent
base/color T_percent

Full sample genome hit graphs: The heatmap color is based on the number of genome hits of the specific type (multiple or unique hits with 0, 1, 2 mismatches) in each tile/panel. The genomehit_overall graph includes both multiple and unique hits with <=2 mismatches in the sample graph. The default multiple hit limit is <=10 hits on the target genome.
genomehit_multi_mis0
genomehit_multi_mis1
genomehit_multi_mis2
genomehit_overall
genomehit_unique_mis0
genomehit_unique_mis1
genomehit_unique_mis2

Full sample quality score graph: The heatmap is created using the average quality score of all bases/color codes from a tile/panel
qual_mean

Full sample read count graph: The total sequence read count in each tile/panel is used to generate this heatmap.
read_count



  1. All Tiles/Panels Summary View: The above quality measures from all tiles/panels based on individual x-y locations on the two dimensional tile/panel surface. This set of QC graphs is designed for detecting QC problems that are repeated for every tile/panel such as optical setup issues.

The All Tiles/Panels Summary View includes the following graphs:

All tiles/panels base/color code bias graphs: The heatmap color is determined by the percent of the specific base/color code ((A, C, T, G, 0, 1, 2, 3 or N for unreadable base) at each x-y coordinates from all tiles/panels of the corresponding sample.  

base/colorA_percent
base/colorC_percent
base/colorG_percent
base/colorN_percent
base/colorT_percent

All tiles/panels genome hit graphs: The heatmap color is based on the number of genome hits of the specific type (multiple or unique hits with 0, 1, 2 mismatches) at each x-y coordinates from all tiles/panels. The genomehit_overall graph includes both multiple and unique hits with <=2 mismatches in the sample graph. The default multiple hit limit is <=10 hits on the target genome.

genomehit_alltiles_multi_mis0
genomehit_alltiles_multi_mis1
genomehit_alltiles_multi_mis2
genomehit_alltiles_overall
genomehit_alltiles_unique_mis0
genomehit_alltiles_unique_mis1
genomehit_alltiles_unique_mis2

All tiles/panels quality score graph: The heatmap is created using the average quality score of all bases/color codes at each x-y coordinates from all tiles/panels in a sample.
qual_mean

All tiles/panels read count graph: The total sequence read count at each x-y coordinates from all tiles/panels is used to generate this heatmap.
read_count



  1.  Individual tile/panel QC: Individual tile QC maps can be used for identifying quality issues in individual tiles. To facilitate quick identification of problematic tiles/panels, we try to rank the unevenness of two measures, the read count and the genomic hit on each tile/panel, across x-y coordinates. Currently we use a simple fixed grid for detecting unevenness.



  1. Cycle-based QC plot: the average of quality measures from all tiles/panels as well as rows and columns of tiles/panels plotted against the base/color position in the sequence reads. The cycle-based plots for all tiles/panels are designed to provide an overview of cycle-related quality variations for all sequence reads in the sample. The plots for individual columns and rows are for detecting outlier tile/panel columns/rows. These graphs will not only help use to identify sequencing cycle-specific issues but also spatial-related issues based on tile/panel rows and columns.

Cycle-based base/color bias plots: For detecting based biases in the sequencing process.

021009_s_2_1_all_base/colorA
021009_s_2_1_all_base/colorC
021009_s_2_1_all_base/colorG
021009_s_2_1_all_base/colorN
021009_s_2_1_all_base/colorT
021009_s_2_1_col_base/colorA
021009_s_2_1_col_base/colorC
021009_s_2_1_col_base/colorG
021009_s_2_1_col_base/colorN
021009_s_2_1_col_base/colorT
021009_s_2_1_row_base/colorA
021009_s_2_1_row_base/colorC
021009_s_2_1_row_base/colorG
021009_s_2_1_row_base/colorN
021009_s_2_1_row_base/colorT

Cycle-based mismatch percentage plots: For assessing mismatch rate (based on uniquely aligned sequences) at different positions in the sequences.

021009_s_2_1_all_misbp_pct
021009_s_2_1_col_misbp_pct
021009_s_2_1_row_misbp_pct

Cycle-based quality score plots: For visualizing average quality score at each base/color code location in the sequences.

021009_s_2_1_all_qual_mean
021009_s_2_1_col_qual_mean
021009_s_2_1_row_qual_mean

  1. Target hit plot: These graphs present sequence alignment results across the target genome or transcriptome sequences. The x-axis is the target locations scaled to the display. The y-axis is the sequence count at each target locations on the positive strand (positive values) and on the negative strand (negative values). There are useful for identifying uneven distribution of sequences on the targets or help to identify sequence structural differences between the sample and the reference genome/transcriptome.


  1. QC for user defined sequence lists Link 1 and Link 2 If a user analyzes lists of sequences related to specific biological conclusions, the resulting QC data will be listed under the above categories in the output. The side-by-side presentation of the user defined sequences with the corresponding QC graphs from all sequences enables users to quickly detect whether sequences related to a specific biological conclusion are from low quality regions of sequencing.


  1.  Library QC for paired-end/mate-pair sequencing: For detecting problems in paired-end or mate-pair libraries.


Library Overview: The paired-end/mate pair library overview graph presents the percentage of good pairs (correct orientation on the same chromosome), unpaired reads, chimeric pairs from different chromosome, chimeric pairs with wrong orientation from the same chromosome, and chimeric pairs less than or greater than user defined library fragment range in a bar chart.


Distance Distribution: The pair distance distribution plot can be used to judge whether the matched paired end/mate pair reads exhibit the correct distance distribution. The distance between each good pair is calculated by the starting position of the first end minus the starting position of the second end thus the distance values can be negative.  We also plot pairs hitting different strands of the target separately. As a result, the pair distance distribution plot can also be used to detect strand bias.

Download: (free for academic user, contact <brainarray at umich.edu> for commercial use.

NGSQC v0.4 (We gratefully acknowledge that Dr. Nandi Tannistha and Dr. Patrick Tan at Genome Institute of Singapore provided Solid panel arrangement file)


Usage please refer to file 'Readme' in the zip file for detail

Following software are required to run NGSQC pipeline
1. gnuplot, Most linux distributions have it in repository, it just need be installed.
2. bowtie. Access http://bowtie-bio.sourceforge.net/index.shtml
3. sed and awk. Most linux distributions already have them in default installation.

A simple usage example is
1. download 'Pipeline with Sample data' and extract it
2. go to the folder 'ngsqc_<VERSION>'
3. run 'make check' to check if all required software are installed
4. run 'make'