The purpose of this page is to explore and demonstrate some of the possibilities of marrying close and distant reading. By combining both of these processes there is a hope greater comprehension and understanding of a corpus can be gained when compared to using close or distant reading alone. (This text might also be republished at dh.crc.nd.edu/sandbox/thatcamp-2015/)
To give this exploration a go, two texts are being used to form a corpus: 1) Machiavelli’s The Prince and 2) Emerson’s Representative Men. Both texts were printed and bound into a single book (codex). The book is intended to be read in the traditional manner, and the layout includes extra wide margins allowing the reader to liberally write/draw in the margins. As the glue is drying on the book, the plain text versions of the texts were evaluated using a number of rudimentary text mining techniques and with the results made available here. Both the traditional reading as well as the text mining are aimed towards answering a few questions. How do both Machiavelli and Emerson define a “great” man? What characteristics do “great” mean have? What sorts of things have “great” men accomplished?
|Feature||The Prince||Representative Men|
|Author||Niccolò di Bernardo dei Machiavelli (1469 – 1527)||Ralph Waldo Emerson (1803 – 1882)|
|Title||The Prince||Representative Men|
|Fulltext||plain text | HTML | PDF | TEI/XML||plain text | HTML | PDF | TEI/XML|
|Length||31,179 words||59,600 words|
|Frequencies||unigrams, bigrams, trigrams, quadgrams, quintgrams||unigrams, bigrams, trigrams, quadgrams, quintgrams|
|Parts-of-speech||nouns, pronouns, adjectives, verbs, adverbs||nouns, pronouns, adjectives, verbs, adverbs|
I observe this project to be a qualified success.
First, I was able to print and bind my book, and while the glue is still trying, I’m confident the final results will be more than usable. The real tests of the bound book are to see if: 1) I actually read it, 2) I annotate it using my personal method, and 3) if I am able to identify answers to my research questions, above.
Second, the text mining services turned out to be more of a compare & contrast methodology as opposed to a question-answering process. For example, I can see that one book was written hundreds of years before the other. The second book is almost twice as long and the first. Readability score-wise, Machiavelli is almost certainly written for the more educated and Emerson is easier to read. The frequencies and parts-of-speech are enumerative, but not necessarily illustrative. There are a number of ways the frequencies and parts-of-speech could be improved. For example, just about everything could be visualized into histograms or word clouds. The verbs ought to lemmatized. The frequencies ought to be depicted as ratios compared to the texts. Other measures could be created as well. For example, my Great Books Coefficient could be employed.
How do Emerson and Machiavelli define a “great” man. Hmmm… Well, I’m not sure. It is relatively easy to get “definitions” of men in both books (The Prince or Representative Men). And network diagrams illustrating what words are used “in the same breath” as the word man in both works are not very dissimilar:
“man” in The Prince
“man” in Representative men
I think I’m going to have to read the books to find the answer. Really.
Bunches o’ code was written to produce the reports:
- concordance.cgi – the simple search engine
- fathom.pl – used to compute the readability scores
- file2pos.py – create a parts-of-speech file for later use
- network.cgi – used to display words used “in the same breath” a given word
- ngrams.pl – compute ngrams
- pos.py – count and tabulate parts-of-speech from a previously created file
You can download this entire project — code and all — from dh.crc.nd.edu/sandbox/thatcamp-2015/reports/thatcamp-2015.tar.gz.