Text Mining – THATCamp ND 2015 http://nd2015.thatcamp.org Just another THATCamp site Fri, 10 Apr 2015 21:35:51 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.12 Marrying close and distant reading: A THATCamp project http://nd2015.thatcamp.org/2015/04/07/close-and-distant/ Tue, 07 Apr 2015 15:26:18 +0000 http://nd2015.thatcamp.org/?p=149

The purpose of this page is to explore and demonstrate some of the possibilities of marrying close and distant reading. By combining both of these processes there is a hope greater comprehension and understanding of a corpus can be gained when compared to using close or distant reading alone. (This text might also be republished at dh.crc.nd.edu/sandbox/thatcamp-2015/)

To give this exploration a go, two texts are being used to form a corpus: 1) Machiavelli’s The Prince and 2) Emerson’s Representative Men. Both texts were printed and bound into a single book (codex). The book is intended to be read in the traditional manner, and the layout includes extra wide margins allowing the reader to liberally write/draw in the margins. As the glue is drying on the book, the plain text versions of the texts were evaluated using a number of rudimentary text mining techniques and with the results made available here. Both the traditional reading as well as the text mining are aimed towards answering a few questions. How do both Machiavelli and Emerson define a “great” man? What characteristics do “great” mean have? What sorts of things have “great” men accomplished?

Comparison

Feature The Prince Representative Men
Author Niccolò di Bernardo dei Machiavelli (1469 – 1527) Ralph Waldo Emerson (1803 – 1882)
Title The Prince Representative Men
Date 1532 1850
Fulltext plain text | HTML | PDF | TEI/XML plain text | HTML | PDF | TEI/XML
Length 31,179 words 59,600 words
Fog score 23.1 14.6
Flesch score 33.5 52.9
Kincaid score 19.7 11.5
Frequencies unigrams, bigrams, trigrams, quadgrams, quintgrams unigrams, bigrams, trigrams, quadgrams, quintgrams
Parts-of-speech nouns, pronouns, adjectives, verbs, adverbs nouns, pronouns, adjectives, verbs, adverbs

Search

Search for “man or men” in The Prince. Search for “man or men” in Representative Men.

Observations

I observe this project to be a qualified success.

First, I was able to print and bind my book, and while the glue is still trying, I’m confident the final results will be more than usable. The real tests of the bound book are to see if: 1) I actually read it, 2) I annotate it using my personal method, and 3) if I am able to identify answers to my research questions, above.


bookmaking tools

almost done

Second, the text mining services turned out to be more of a compare & contrast methodology as opposed to a question-answering process. For example, I can see that one book was written hundreds of years before the other. The second book is almost twice as long and the first. Readability score-wise, Machiavelli is almost certainly written for the more educated and Emerson is easier to read. The frequencies and parts-of-speech are enumerative, but not necessarily illustrative. There are a number of ways the frequencies and parts-of-speech could be improved. For example, just about everything could be visualized into histograms or word clouds. The verbs ought to lemmatized. The frequencies ought to be depicted as ratios compared to the texts. Other measures could be created as well. For example, my Great Books Coefficient could be employed.

How do Emerson and Machiavelli define a “great” man. Hmmm… Well, I’m not sure. It is relatively easy to get “definitions” of men in both books (The Prince or Representative Men). And network diagrams illustrating what words are used “in the same breath” as the word man in both works are not very dissimilar:


“man” in The Prince

“man” in Representative men

I think I’m going to have to read the books to find the answer. Really.

Code

Bunches o’ code was written to produce the reports:

  • concordance.cgi – the simple search engine
  • fathom.pl – used to compute the readability scores
  • file2pos.py – create a parts-of-speech file for later use
  • network.cgi – used to display words used “in the same breath” a given word
  • ngrams.pl – compute ngrams
  • pos.py – count and tabulate parts-of-speech from a previously created file

You can download this entire project — code and all — from dh.crc.nd.edu/sandbox/thatcamp-2015/reports/thatcamp-2015.tar.gz.

]]>