DNA Sequence Analysis and Manipulation

Patterns of DNA polymorphisms observed within populations can be used to understand the processes of demography, adaptation, and ultimately speciation. These patterns are typically quantified using the site frequency spectrum. The shape of this spectrum is understood theoretically under genetic drift, various demographic scenarios and with some forms of natural selection. DNA Sequence Analysis and Manipulation (DnaSAM), addresses the challenges of data manipulation, summary statistic estimation and statistical hypothesis testing for large-scale resequencing projects. The program is capable of performing a large number of standard and newly designed tests of neutrality for multiple sequence alignments of resequenced gene loci. In addition, the program allows hypothesis testing using complex and user-specified null models.

DnaSAM is written for Perl (version 5) and has been tested in both Apple’s OS 10.5 and Fedora Core Linux 8 environments and should be compatible with any environment where Perl 5 is installed.

DnaSAM uses output from Richard Hudson’s ms program, and he has made the source code and documentation for ms available at the following site:
http://home.uchicago.edu/~rhudson1/source/mksamples.html

If you already have ms installed on your computer, you can download the dnasam.<version>_source.tar.gz file and follow the instructions in the dnasam.pdf manual to install DnaSAM and create a symbolic link to the ms binary.

As a convenience, we’ve created compiled ms binaries for both GNU/linux and Apple’s OS 10.5 that we’ve included in the dnasam.<version>linux_x86.tar.gz and dnasam.<version>_osx_x86.tar.gz files. If you are running one of these operating systems and do not wish to compile and install ms, you can download one of these files.

Version 20100621 update notice: In the calculation for Fu’s Fs, updated code to handle a rare numerical error that occurred in some larger data sets which would then cause the program to exit (without printing out printing out results). Thanks to Karen Lundy at UCLA for reporting this error and for helping us reproduce it.

Version 20100503 update notice: A bug was found in cases where outgroup sequence was present in the alignment and S for the ingroup was greater than zero and when missing data in the outgroup caused all the polymorphic sites in the ingroup to be dropped for analyses that require outgroup, all p-values were reported as ‘NA’ (including p-values that should be calculated for the ingroup, without regard as to whether there was an outgroup present or not). Note no erroneous numerical p-values were reported, but p-values that should have been reported were not. (Many thanks to Elena Mosca for discovering and reporting this.)

Version 20100409 update notice: This version fixes a bug that affects alignments where there is an outgroup present and there is data missing at biallelic sites in the ingroup. These sites with missing data were not getting dropped properly when the Xi array was created, affecting calculations for ThetaNe, ThetaH and ThetaL (which in turn affected calculations for D_outgroup, F_outgroup, H, normH and normE). Users who previously ran DnaSAM on data fitting this description will want to upgrade and rerun their analyses (Many thanks to Jingjing Li at the University of Toronto who recognized the problem and provided us with a data set that allowed us to debug and fix this issue.)