polySNP: an analysis tool for quantitative sequencing

Damon P. Little¹ and Gerod S. Hall²

¹Lewis B. and Dorothy Cullman Program for Molecular Systematic Studies, The New York Botanical Garden, Bronx, NY, USA
²Department of Ecology and Evolutionary Biology, Cornell University, Ithaca, NY, USA

description

polySNP is a PERL script for automatically extracting peak area or height data for multiple Single Nucleotide Polymorphisms (SNPs) from sequence chromatograms. The data can automatically be transformed with user–input linear standard curves to yield relative template concentrations. This script will allow quantitative sequencing data to be more widely employed for the estimation of relative template frequency in pooled DNA samples. polySNP outputs a CSV (comma separated values) listing of SNP peak areas. The script is distributed under the GNU General Public License (GPL).

background

Increasingly, direct sequencing is being used to estimate relative template frequency in pooled DNA samples—particularly when specific real–time PCR primers and/or TaqMan probes cannot be designed (Amos et al. 2000; Higuchi et al. 2000; Pozhitkov et al. 2005; Wilkening et al. 2005). Upon the generation of a sequence chromatogram, the relative frequency of each template in the pooled sample can be estimated by measuring relative peak areas or heights. Such measurements can be accomplished using the output from PHRED or BIO::SCF—a PERL interface to io_lib.

script features

The inputs to polySNP are:
(1) A reference alignment containing two sequences in FASTA format. The SNPs that differentiate these sequences are inferred by polySNP. The positions of the SNPs in the reference alignment along with the sequence labels are used for output. Therefore, in order to produce easily comparable data all chromatograms for a particular research project should be processed with the same reference alignment.
(2) A sequencing chromatogram of experimental data made from a pooled DNA sample—ABI, ALF, CTF, SCF, and ZTR file formats are supported.
(3) Optionally, a user–defined standard curve can be input. The standard curve is read via a comma–separated–values text file containing the SNP position (matched to the reference file), the reference base that the curve was fit to, the slope of the line (in decimal notation), and the y–intercept (usually 0). Currently, only linear standard curves are supported. Raw data and ratios are always output, so later adjustments to the standard curve do not necessitate reprocessing of the chromatograms with polySNP.
(4) The user may specify which of four possible analysis procedures polySNP should conduct. The procedures vary by the use of peak area (measured by PHRED) versus peak height (measured either by PHRED or BIO::SCF) and the method of chromatogram trimming. As invoked by polySNP, PHRED trims the chromatograms using a “-trim_cutoff” value of 0.10 (other values can be specified by the user with the “-c” argument). Either PHRED or qclip may be used to trim chromatograms for BIO::SCF.
(5) Additional command–line arguments can optionally be used to specify the format of the output (“-l” outputs a line of data labels before outputting the data and “-a” outputs the alignment of the experimental data to the reference sequences).
The output from polySNP is in the form of comma–separated–values text.

algorithm

When PHRED is used exclusively, the algorithm used by polySNP is:
(1) Produce .seq, .phd, and .poly files by running PHRED on each chromatogram file.
(2) Align a set of pre–aligned reference sequences to the .seq file with MUSCLE (a FASTA format file containing the called bases). The alignment of the reference sequences to the experimental data uses the “-profile” option of MUSCLE so as not to disrupt the reference alignment.
(3) Use the alignment to locate dimorphic Single Nucleotide Polymorphism (SNP) positions in the .phd file (a text file containing the called bases, a quality assessment for each base, and location of the peak in the chromatogram).
(4) Use the peak location to find the corresponding line in the .poly file (a text file containing peak areas and heights for each chromatogram position).
(5) Select the peak areas or heights. Although empirical tests comparing the ratio of peak height versus peak area have not found substantial differences between the two measures (Leclair et al. 2004 Wilkening et al. 2005), the ratios output by polySNP may noticeably differ depending upon whether area or height are used when data are retrieved from “noisy” chromatograms.
(6) Confirm that the peaks correspond to the known variation in the SNP (using the reference sequences). PHRED only reports the areas for the “primary” and “secondary” peaks which may, or may not, correspond to the two possibilities for a given SNP. If either of the values do not match the variation enumerated in the reference alignment, polySNP will not output area ratios for that SNP. In addition, a “noisy” chromatogram may feature peaks that span multiple sequencer scan intervals. When this occurs, polySNP will issue a warning and refuse to output peak ratios for that SNP. In contrast, it is possible to retrieve peak heights for all four bases, at a given site As a result polySNP will always output peak height ratios—even from “noisy” chromatograms. The output of peak height data under these circumstances is not necessarily a positive attribute, as the data may be untrustworthy.
(7) Calculate the relative peak areas or heights for each sequence contributing to the SNP.
(8) Optionally, use a predetermined standard curve to correct for bias in the ratio of peak areas or heights.

When BIO::SCF is used, an analogous algorithm is applied—the only differences are that the peak heights are extracted directly from the chromatogram file rather than the PHRED output and the user can specify whether PHRED or qclip is used to trim the chromatogram. Peak heights retrieved from chromatograms differ somewhat depending upon whether PHRED or BIO::SCF are used.

script usage

polySNP -r reference_file -t trace_file [-s standard_curve_file] [-l] [-a] [-p 0|1|2|3] [-c 0.XX]
            “reference_file” is a FASTA format file containing two aligned sequences (SNPs will be inferred)
            “trace_file” is a chromatogram file readable by PHRED or convert_trace
            “standard_curve_file” is a CSV text file containing reference position (in “reference_file”), reference base, slope (decimal notation), y–intercept
            -l outputs a line of data labels before outputting the data
            -a outputs the alignment along with the data
            -p 0 base call and trim with PHRED, use PHRED values for peak area measurement (default)
            -p 1 base call and trim with PHRED, use PHRED values for peak height measurement
            -p 2 base call and trim with PHRED, use BIO::SCF values for peak height measurment
            -p 3 use existing base calls, trim with qclip, use BIO::SCF values for peak height measurement
            For -p 0, 1, or 2 an optional -c 0.XX argument can be appended to specify the PHRED “-trim_cutoff” value. If no -c 0.XX value is specified, the script defaults to 0.10.

requirements

PERL interpreter
MUSCLE
convert_trace (only used if non–SCF files are used)
qclip (only used if PHRED is not used for clipping)
PHRED (only used if BIO::SCF is not used)
BIO::SCF (only used if PHRED is not used)

citation

Little, D. P. and G. S. Hall. 2006. polySNP: an analysis tool for quantitative sequencing. Program distributed by the authors.

download

polySNP
example input/output