Just what is lexomics?

The term “lexomics” was originally coined to describe the computer-assisted detection of “words” (short sequences of bases) in genomes. When applied to literature as we do here, lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. More specifically as relating to our current suite of tools we have built and use, we segment text(s), count the number of times each word appears in each segment (or chunk), and then apply cluster analysis to build dendrograms (branching diagrams or trees) that show relationships between the chunks. Watch our Project Videos that introduce some of the problems we have been working on and watch our online demos as we apply our online tools to those tasks.

Will lexomics replace me as a scholar?

Nope. Lexomics, like most computational text mining tools, are what John Burrows calls a “middle game” technique. You (the scholar) have much scholarship to do before running computational tools (for example, collecting texts, forming hypotheses as to where you might segment your texts, etc) and then following the use of the tools, you have much work to do after to interpret the results and form yet new hypotheses. The computer is just a tool “in the middle”, albeit a very powerful tool that today’s scholars of texts want in their arsenal.

Where do I start?

If you are new to lexomics, we strongly recommend that you review our tutorial materials.

What browser should I use?

On MacOS, Windows, or Linux, we recommend that you use the Chrome browser, however, Firefox also works. Safari and IE are not supported.

What languages (other than Old English, Latin, etc) does Lexomic analysis work with?

We have worked hard on the functionality of “scrubbing” so that it will handle texts of almost any file type (raw .txt, Unicode, .sgml, .html, .xml) and language, however, HTML and XML parsing are not full-fledged parsers.

How reliable are Lexomic results? Is there any way to verify their reliability?

Excellent question. When making dendrograms, every set of chunks from a text or texts will produce a dendrogram. Your question really asks how we can assign confidence measures for any and all clades (groups) in the dendrogram. We are currently prototyping a new tool (not yet available) called “trueTree” that performs a bootstrapping method that assigns levels of confidence to our dendrograms.

Is lexomics analysis in general and are dendrograms in particular unique to your research group?

No. Scholars have been counting words, clustering, and classifying texts for years. Our contribution is to build tools that make the experimental process relatively easy for scholars to perform cluster analysis (dendrograms). This includles our group and our undergraduate students! But don’t stop with dendrograms. Use our CSV-generator tool to jumpstart your subsequent analysis with other tools and languages (e.g., analyses using the language R).

Who wrote the software for the Lexomics Tools?

Undergraduate students at Wheaton College (Norton, MA) wrote almost all the software with Professor Scott Kleinman at UC-Northridge serving as software lead. LeBlanc manages the software efforts. As of 2015, Cheng Zhang ’18 is our lead developer. See our About Us page to see our smiling faces and learn more about the community of student programmers over the years.

How do I cite use of the Lexomics tools?

The Lexomics tools are Open Source Software [GPLv3]. For research use, please remember to cite Lexomics:

Kleinman, S., LeBlanc, M.D., Drout, M., and Zhang, C. (2017). Lexos v3.1.1. https://github.com/WheatonCS/Lexos/.

Have any unanswered questions?

Be sure to send us any additional questions at [mdrout at wheatoncollege dot edu]; we’d like to answer them here!