Marrying close and distant reading: A THATCamp project

Eric Lease Morgan — Tue, 07 Apr 2015 15:26:18 +0000

The purpose of this page is to explore and demonstrate some of the possibilities of marrying close and distant reading. By combining both of these processes there is a hope greater comprehension and understanding of a corpus can be gained when compared to using close or distant reading alone. (This text might also be republished at dh.crc.nd.edu/sandbox/thatcamp-2015/)

To give this exploration a go, two texts are being used to form a corpus: 1) Machiavelli’s The Prince and 2) Emerson’s Representative Men. Both texts were printed and bound into a single book (codex). The book is intended to be read in the traditional manner, and the layout includes extra wide margins allowing the reader to liberally write/draw in the margins. As the glue is drying on the book, the plain text versions of the texts were evaluated using a number of rudimentary text mining techniques and with the results made available here. Both the traditional reading as well as the text mining are aimed towards answering a few questions. How do both Machiavelli and Emerson define a “great” man? What characteristics do “great” mean have? What sorts of things have “great” men accomplished?

Comparison

Feature	The Prince	Representative Men
Author	Niccolò di Bernardo dei Machiavelli (1469 – 1527)	Ralph Waldo Emerson (1803 – 1882)
Title	The Prince	Representative Men
Date	1532	1850
Fulltext	plain text \| HTML \| PDF \| TEI/XML	plain text \| HTML \| PDF \| TEI/XML
Length	31,179 words	59,600 words
Fog score	23.1	14.6
Flesch score	33.5	52.9
Kincaid score	19.7	11.5
Frequencies	unigrams, bigrams, trigrams, quadgrams, quintgrams	unigrams, bigrams, trigrams, quadgrams, quintgrams
Parts-of-speech	nouns, pronouns, adjectives, verbs, adverbs	nouns, pronouns, adjectives, verbs, adverbs

Search

Search for “man or men” in The Prince. Search for “man or men” in Representative Men.

Observations

I observe this project to be a qualified success.

First, I was able to print and bind my book, and while the glue is still trying, I’m confident the final results will be more than usable. The real tests of the bound book are to see if: 1) I actually read it, 2) I annotate it using my personal method, and 3) if I am able to identify answers to my research questions, above.

bookmaking tools

almost done

Second, the text mining services turned out to be more of a compare & contrast methodology as opposed to a question-answering process. For example, I can see that one book was written hundreds of years before the other. The second book is almost twice as long and the first. Readability score-wise, Machiavelli is almost certainly written for the more educated and Emerson is easier to read. The frequencies and parts-of-speech are enumerative, but not necessarily illustrative. There are a number of ways the frequencies and parts-of-speech could be improved. For example, just about everything could be visualized into histograms or word clouds. The verbs ought to lemmatized. The frequencies ought to be depicted as ratios compared to the texts. Other measures could be created as well. For example, my Great Books Coefficient could be employed.

How do Emerson and Machiavelli define a “great” man. Hmmm… Well, I’m not sure. It is relatively easy to get “definitions” of men in both books (The Prince or Representative Men). And network diagrams illustrating what words are used “in the same breath” as the word man in both works are not very dissimilar:

“man” in The Prince

“man” in Representative men

I think I’m going to have to read the books to find the answer. Really.

Code

Bunches o’ code was written to produce the reports:

concordance.cgi – the simple search engine
fathom.pl – used to compute the readability scores
file2pos.py – create a parts-of-speech file for later use
network.cgi – used to display words used “in the same breath” a given word
ngrams.pl – compute ngrams
pos.py – count and tabulate parts-of-speech from a previously created file

You can download this entire project — code and all — from dh.crc.nd.edu/sandbox/thatcamp-2015/reports/thatcamp-2015.tar.gz.

Text Mining – THATCamp ND 2015

Marrying close and distant reading: A THATCamp project

Comparison

Search

Observations

Code