Our software work has shifted from building tools (e.g., DNA Dictionary, genome browsers) to machine learning classification experiments (LeBlanc et al. 2012, 2013).
Sharing a slice of experimental time: a Suite of Scripts
Note: In order for each script to work properly you must download the whole suite of scripts and save them into a common directory.
Extract Genomes from Local Database
This script accesses a database and retrieves all the different organisms on the server and gets some basic information about them.
Cut Genomes into Chunks
Cutter.pl (“Script #1”) is the second of a suite of scripts designed to assist in the analysis of DNA. This particular script breaks a large DNA sequence down into several smaller chunks of user-determined size.
Frequency Counts of Motifs
This script assumes that the script cutter.pl has already been run. This script goes through all the files created by cutter.pl that match the type of data specified in the command line, counts the number of times each unique lmer appears in the genome as well as its reversed complementary sequence, and outputs the results into a series of .xls files one for each combination of lmer size and input file.
Prepare Data for R
This script takes the various motif counts created by the motifCounts.pl script and combines them into an single .xls file for use in statistical analysis and also adds some additional metadate.
prepare4R.zip – ReadMe (pdf)
This particular script quieries a database to gather metadata about the bugs in the data directory. Data gathered includes the organism’s reference sequence, super kingdom, group, genus, species, strain, oxygen requirements, habitat, temperature range, and pathogenic data.