24 Days of R: Day 11

I don't know how often Michael Caine appeared in a Shakespearean work, but I'm sure that he has and I'm sure that he was excellent. A bit pressed for time today, so just a simple word cloud featuring the full text of King Lear. I found the text at a website that I presume is associated with a university in Cambridge. http://shakespeare.mit.edu/lear/full.html I stored a local copy.

My sister lives in Stratfrod-Upon-Avon and can't stop talking about Shakespeare. Today's post is dedicated to her.

aFile = readLines("./Data/Lear.txt")

myCorpus = Corpus(VectorSource(aFile))

myCorpus = tm_map(myCorpus, tolower)
myCorpus = tm_map(myCorpus, removePunctuation)
myCorpus = tm_map(myCorpus, removeNumbers)
myCorpus = tm_map(myCorpus, removeWords, stopwords("english"))

myDTM = TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))

m = as.matrix(myDTM)

v = sort(rowSums(m), decreasing = TRUE)

wordcloud(names(v), v, min.freq = 15)

plot of chunk ReadData

A lot of “king”, “lear”, “thee”, “thy” and “thou”.

And of course in searching for a reference, for the code above (I modified from it something else), I came across this: Text mining Shakespeare. I feel even lazier than I did before.

I can't leave it at that, so I'll very quickly determine the most frequent 2 and 3 word phrases in the text.


bigrams = textcnt(aFile, n = 2, method = "string")
bigrams = bigrams[order(bigrams, decreasing = TRUE)]
## king lear 
##       209
## my lord 
##      76
trigrams = textcnt(aFile, n = 3, method = "string")
trigrams = trigrams[order(trigrams, decreasing = TRUE)]
## king lear no 
##           13
## i know not 
##         12

No surprises that the most frequent bigram is “king lear” at 209 times and “my lord” is the sort of thing one would expect in an Elizabethan play. I like that the most frequent trigram is “king lear no” at 13. I'll have to have a look at the text to see what's behind that.

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] wordcloud_2.4      RColorBrewer_1.0-5 Rcpp_0.10.6       
## [4] knitr_1.4.1        RWordPress_0.2-3   tau_0.0-15        
## [7] tm_0.5-9.1        
## loaded via a namespace (and not attached):
##  [1] digest_0.6.3   evaluate_0.4.7 formatR_0.9    parallel_3.0.2
##  [5] RCurl_1.95-4.1 slam_0.1-30    stringr_0.6.2  tools_3.0.2   
##  [9] XML_3.98-1.1   XMLRPC_0.3-0
Posted in R

10 thoughts on “24 Days of R: Day 11

  1. ‘No’ is a frequent first word of many of King Lear’s lines, so the “King Lear: ‘No ….'” occurs often. Maybe you could strip out the names of the speakers of the parts before the analysis.

  2. I’ve been meaning to get of my cognitive hiney and explore making word clouds, and I think your post just may have pushed me over the edge. As and aside, have you had any experience scraping text from web pages for word clouds? The word cloud idea that has been floating around in my head for quite some time has actually involved collecting post-game press conferences from the same coach over an entire season, and creating word clouds of the post-victory and post-loss conferences. I think I can manage to put together the right URLs, but I’m not keen enough yet on scraping the xml…

    1. I had seen that article and referenced it (I think via R-bloggers) on the page. Apologies for not citing you personally. I seem to recall having read it when it first appeared and then it was one of the first pages which appeared when I Googled “text mining in R”. Hats off for taking on the complete works!

  3. Oddly, when running this in RStudio, I got a bunch of errors, that some of the most frequent words could not be plotted – however, when running it in the normal RGui itself, it worked fine.

    “1: In wordcloud(names(v), v, min.freq = 15) :
    thou could not be fit on page. It will not be plotted.
    2: In wordcloud(names(v), v, min.freq = 15) :
    lear could not be fit on page. It will not be plotted.
    3: In wordcloud(names(v), v, min.freq = 15) :
    gloucester could not be fit on page. It will not be plotted.
    4: In wordcloud(names(v), v, min.freq = 15) :
    shall could not be fit on page. It will not be plotted.
    5: In wordcloud(names(v), v, min.freq = 15) :
    king could not be fit on page. It will not be plotted.
    6: In wordcloud(names(v), v, min.freq = 15) :
    make could not be fit on page. It will not be plotted.”

    1. This is probably to do with spatial limitations in RStudio’s plotting window. Often complex plots will produce warnings because RStudio can’t render them. If you maximize the window where the plot appears, the warnings may go away.


      1. Thanks — I’ll have to try it out later!

        Thanks also for posting all of it — I’m loving it! Merry Christmas!

  4. Thanks for sharing this wonderful tutorial. Quick question though: For R n00bs like myself, how does one go about incorporating the bigrams and trigrams into one word cloud in R? I can’t get that last step on my own unfortunately.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s