SFS

SFS is a simple perl scrip to calculate the Site Frequency Spectrum from a polytable (a table showing polymorphic SNPs).
NEW and improved version 2.2

Documentation


SFS.pl outputs the site frequency spectrum (SFS) from a polytable. SFS.pl allows you to scale the SFS of your observed sample to match an arbitrary sample size. Downscaling data (to a sample size smaller than the observed data) is straightforward and follows the hypergeometric scaling suggested in Nielsen et al. 2005. To scale the data up to a large sample size, the code uses similar methods (manuscript in prep). Note, however, that scaling up adds a lot of noise to your data.

Points to be aware of:

Please check that your polytables are formatted similar to the examples below. Minor formatting differences in polytables can potentially cause erroneous results.

Because of the sampling probabilities used here and in Nielsen et al. 2005, there is often a nonzero probability of observing a site at a count of 0 (missing) or n (fixed) in a newly scaled sample of size n.

If there is no ancestral state available at a site, a gap, or an "N", SFS.pl will assume the minor allele is the derived state.

Fixed sites should be excluded from a polytable.

If there are more than two mutations segregating at a site, SFS will arbitrarily ignore the third mutation it comes across (i.e. treat it as ancestral).

If you calculate a folded SFS and scale downward, that the program may return values higher than half the size of the scaled down sample.

Usage:

SFS.pl -F [number of files] [filenames]

the [filenames] argument accepts the wildcard * character for multiple files

Optional command line arguments:

-q quiet mode, only prints counts
-n sample size to use for output (default is observed sample size of first file)
-f prints out observed frequency count of the derived mutation at every site
--folded calculates folded SFS
--version prints out the version and license information and exits
--unrounded prints out probabilities in their full decimal splendor
--conditional prints out probabilities conditional on the site being polymorphic (ignores fixed sites in the rescaled samples)

Download

SFS.pl, the script
example.txt, a sample polytable

Example

“perl SFS.pl -F 1 example.txt -f -n 10“ should return:


SFS version 2.2 run on Thu Jun 10 19:17:55 2010
example.txt
observed n=13 new n: 10 S=10
12 1
115 3
123 1
224 3
225 6
391 2
468 4
551 6
552 8
562 2

0 1 2 3 4 5 6 7 8 9 10
0.545 2.531 2.286 1.482 1.027 1.076 0.734 0.279 0.034 0 0


and “perl SFS.pl -F 1 example.txt” returns the SFS for n=13, the observed sample size:


SFS version 2.2 run on Thu Jun 10 19:22:30 2010
example.txt
observed n=13 new n: 13 S=10

0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 2 2 2 1 0 2 0 1 0 0 0 0 0

“perl SFS.pl -F 1 example.txt --unfolded -n 15” scales the data up to sample size n=15:

SFS version 2.2 run on Thu Jun 10 19:41:25 2010
example.txt
observed n=13 new n: 15 S=10

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0.753 1.12 1.24 1.216 1.117 0.989 0.856 0.726 0.601 0.48 0.362 0.253 0.157 0.083 0.033 0.007

“perl SFS.pl -F 1 example.txt -n 10 -f --unrounded” scales the folded SFS down to 10 and does not round the estimates:

SFS version 2.2 run on Thu Jun 10 19:41:38 2010
example.txt
observed n=13 new n: 10 S=10
12 1
115 3
123 1
224 3
225 6
391 2
468 4
551 6
552 8
562 2

0 1 2 3 4 5 6 7 8 9 10
0.545454545454546 2.53146853146853 2.28671328671329 1.48251748251748 1.02797202797203 1.07692307692308 0.734265734265734 0.27972027972028 0.034965034965035 0 0