This pipeline of three online tools enable you to first "scrub" (clean) your text(s), then cut a single text into chunks, and last build dendrograms (trees) that show the relationships in and between chunks of your text(s).
- scrubber v1.0 -- strip tags, remove stop words, apply lemma list: prepare text for diviText
- diviText v1.2.1 -- cut texts into chunks in one of three ways, count words, .zip results
- treeView v1.2 -- build a dendrogram and save the output as .pdf or phyloXML
Tutorials and transcripts for these tools can be found here.
Download the software for these three open-source tools:
Advanced (offline, in progress) tools:
trueTree v1.0 -- cluster validation ... just how good is that clade?
topWords v1.0 -- find significant discriminating words between clades
Early command-line scripts:
Prior to developing our online tools, we wrote this suite of command-line scripts that morphs data into needed formats in preparation for your experimental analyses of texts, including statistical summaries of word usage across select groups (or chunks) of texts, authorship attribution techniques, and clustering and classification methods.
Note: In order for each script to work properly you must download the whole suite of scripts.
- Merge the counts into one file
This script can be used either after you've created a Virtual Manuscript or following as3_countWords. The main goal of this script, in addition to collecting some statistics on your collection of texts, is to merge the counts into one file in preparation for further analysis, for example, in R (see below).mergeWordCounts.zip - ReadMe