Open source software for systematics research

Researchers at NYBG have developed open source software designed to assist in the analysis of DNA sequence data and other aspects of systematics research. The scripts are designed to reformat DNA sequence alignments so that they can be used with a variety of downstream analysis programs; identify specimens using their DNA sequences (DNA barcoding); analyze DNA sequencing chromatograms to asses their quality; and streamline the process of compiling taxonomic monographs.

polySNP: an analysis tool for quantitative sequencing

polySNP is a PERL script for automatically extracting peak area or height data for multiple Single Nucleotide Polymorphisms (SNPs) from sequence chromatograms. The data can automatically be transformed with user–input linear standard curves to yield relative template concentrations. This script will allow quantitative sequencing data to be more widely employed for the estimation of relative template frequency in pooled DNA samples. polySNP outputs a CSV (comma separated values) listing of SNP peak areas. The script is distributed under the GNU General Public License (GPL).

Citation:
Little, D. P. and G. S. Hall. 2006. polySNP: an analysis tool for quantitative sequencing. Program distributed by the authors.

Simple Indel coding

2matrix.pl is a Perl script for merging and translating phylogenetic datasets. Input datasets can be either DNA/RNA/amino acid alignments (FASTA format) or morphological matrices (a xread file or a csv table, see below for the specification of the latter). Available output formats include xread, extended PHYLIP (RAxML) and NEXUS (the latter being compatible with Garli and MrBayes). Indel characters can be coded using the "simple" gap coding method of Simmons and Ochoterena (2000. Gaps as characters in sequence-based phylogenetic analysis. Systematic Biology 49: 369-381).

Citation:
Salinas, N. R. and D. P. Little. 2014. 2matrix: A Utility for Indel Coding and Phylogenetic Matrix Concatenation. Applications in Plant Sciences 2(1): 1300083.

degenbar: a simple SIDE (sequence identification engine)

This pair of scripts is designed to transform a set of FASTA formated sequences into a queriable DNA–BAR reference database for DNA barcoding. These scripts were first used by Little and Stevenson (2007). The script “degenbar-in.pl” generates an input file from FASTA formated sequences for the “degenbar” executable* which implements the DNA–BAR method of DasGupta et al. (2005). The resulting matrix of distinguishers is then queried by the “degenbar” script. The query sequence is scored for the presence or absence of each distinguisher (10–50 nucleotide in length). The reference sequence(s) with the greatest number of matching presence/absence scores is(are) taken to be the identification.

Citation:
Little, D. P. 2007. degenbar: a simple SIDE (sequence identification engine). Program distributed by the author.

BRONX: Barcode Recognition Obtained with Nucleotide eXposés version 2.0

This pair of scripts is designed to transform a set of FASTA formated sequences into a queriable BRONX reference database for DNA barcoding. This updated version of BRONX is noticeably faster (particularly for database creation) and more portable (it is a pure PERL implementation). The increased speed comes at the cost increased memory usage, but this should not be an issue for most users. BRONX2 retains, and in some cases, improves upon the performance of the original BRONX. The original version of BRONX remains available.

BRONX2 features new output options that provide better explanation of the barcode identification and the reference sequences used to make the identification (in html and plain text format). There is also an output format that is more amenable to use in pipelines etc.

Citations:
Little, D. P. 2012. BRONX2: Barcode Recognition Obtained with Nucleotide eXposés. Program distributed by the author.
Little, D. P. 2011. DNA barcode sequence identification incorporating taxonomic hierarchy and within taxon variability. PLoS ONE. 6 (8): e20552.

Simple pairwise matching for DNA barcoding

This suite of scripts is designed to evaluate the relative discriminatory power of a set of loci for DNA barcoding. These scripts were used by the Plant Working Group to evaluate candidate loci for land plant DNA barcoding. Please see the README.txt file for instructions and examples.

Citation:
Little, D. P. 2009. Simple pairwise matching for DNA barcoding. Program distributed by the author.

DOME ID (Diagnostic Oligo Motifs for Explicit IDentification): a simple SIDE (sequence identification engine)

This set of scripts is designed to transform a set of FASTA formated sequences into a queriable DNA barcoding reference database. These scripts were first used by Little and Stevenson (2007).

Citation:
Little, D. P. 2007. DOME ID (Diagnostic Oligo Motifs for Explicit IDentification): a simple SIDE (sequence identification engine). Program distributed by the author.

ATIM (Alignment-free Tree-based Identification Method): a simple SIDE (sequence identification engine)

This set of scripts is designed to transform a set of FASTA formated sequences into a queriable DNA barcoding reference database. These scripts were first used by Little and Stevenson (2007).

Citation:
Little, D. P. 2007. ATIM (Alignment-free Tree-based Identification Method): a simple SIDE (sequence identification engine). Program distributed by the author.

B: an index of sequence quality and contig overlap for DNA barcoding

Barcode quality index (B) is a novel, unified measure of sequence quality and contig overlap tailored to the needs of DNA barcoding.

Citations:
Little, D. P. 2010. B: an index of sequence quality and contig overlap for DNA barcoding. Program distributed by the author.
Little, D. P. 2010. A unified index of sequence quality and contig overlap for DNA barcoding. Bioinformatics. 26 (21): 2780–2781.

Electronic LAMP: virtual Loop–mediated isothermal AMPlification

This informatic tool identifies combinations of template(s) and primer set(s) that comply with LAMP primer specifications. Search queries can be exact or approximate, depending on the parameters set by the user. Approximate matching is carried out using agrep algorithm, as implemented in Tre (installed on Ubuntu by typing “sudo apt-get install tre-agrep”). A graphic interface designed with Tk/Perl is also available.

Citation:
Salinas, N. R. and D. P. Little. 2012. Electric LAMP: virtual Loop–mediated isothermal AMPlification. ISRN Bioinformatics. 2012: 696758.

monographaR: An R package to facilitate the production of plant taxonomic monographs

Taxonomic data are essential in many fields of biology. However, the production of taxonomic treatments is usually very time consuming and few programs are available for taxonomic use. The R-based package monographaR automates the production of some ubiquitous components of plant taxonomic studies, generating a monograph skeleton and figures for publication. The package includes functions to convert tables into taxonomic descriptions, lists of collectors, and examined specimens. Additionally, wrapper functions to batch-generate phenology histograms and distributional maps are also available. Automated workflows, such as the one provided by the monographaR package, can facilitate the production of taxonomic treatments, potentially increasing the dynamics of taxonomic data generation and preventing format mistakes in repetitive tasks that are otherwise performed manually.

Citations:
Reginato, M. 2016. monographaR: An R package to facilitate the production of plant taxonomic monographs. Brittonia DOI 10.1007/s12228-015-9407-z: 1–5.
Reginato, M. 2015. monographaR: Taxonomic Monographs Tools. Program distributed by the author.

Monographia: Open-source software to automate revisionary systematic studies

Monographia is a web application designed to automate descriptive systematic studies. It will be used to collect, view, and curate specimen data; to link specimen data to individual morphological and molecular observations; and to provide dynamically compiled data summaries in multiple formats, diagnostic interactive graphics, and cross references to other data types. Monographia is far more than a dynamic compilation tool: It will facilitate collaborative research among systematists and allow data to be exchanged across projects and databases—eliminating duplicated effort and improving data quality. It will preserve information that is usually lost soon after publication (e.g. individual morphological measurements) thus facilitating data reuse. It will enable simultaneous communication to a variety of audiences using an integrated multilingual design.

Citation:
Little, D. P. 2016. Monographia. Program distributed by the author.