Part 2 Corpus Cleaning in R

2.1 Basic cleaning

I used R and the tm package (Feinerer and Hornik 2020) as the first tools with which to clean the RBMA corpus. This was an iterative process, with around 5 to 10 passes before I got to a point where I felt the resulting versions of the corpus might be good enough to work with.

Here are the main steps and takeaways from the process:

  1. I ran word counts on the documents to remove empty files (only one). I then manually renamed files to lecturer name only (they were originally the URL for the lecture page) and removed the introductory paragraphs from each as these created additional noise. I tried to use regular expressions at first to simplify and hasten this process but got stuck so I decided to move forward manually. Eventually, with some help from a friend, I was able to write a regex solution which I implemented with the Python workflow for a similar problem (and which we’ll see in the following notebook).

  2. I decided to create two different custom stopword lists. The core of each was based on the frequent noise words I knew existed in the corpus (such as “applause”, “audience member”, and the name of hosts) alongside some of the top words gleaned from the Distant Reader processing. The second list also includes lecturer names as those were additional noise within each lecture, repeated every time someone spoke.

  3. The source lectures were in simple text formats, encoded in UTF-8, and I ran tm’s punctuation removal method with the ASCII default which meant I ended up having to run additional custom content transformers to catch the remaining loose punctuation.

  4. After testing the cleaning process a few times, I built a Document Term Matrix and inspected the 15 top words by frequency and used this to create a second custom stopword list, this one focused on catching words that had emerged from the previous steps (such as “don’t” becoming “don”) as well as removing the irrelevant frequent words found via the DTM (such as “get” and “say”). I then also created some additional content transformers to sweep for fragments left behind by the removal of pronouns such as “ll” for “I’ll” and “d” for “I’d.”

Below are cells showing the code I used for the cleaning and some relevant outputs.7

library(tm) #load library
rbma_corpus<- VCorpus(DirSource("FILES/rbma-lectures-master/data_no_empty_files_new_file_names_no_intro")); #create corpus as Volatile Corpus to ensure that anything I did remained confined to the R object
writeLines(as.character(rbma_corpus[[1]])[0:5]); #print snippet of first doc to compare with final result 
## A guy called Gerald
## Hello. Hello. Thank you for turning up to the lecture. My name is A Guy Called Gerald and this is my... He’s doing the presenting, actually. 
## Torsten Schmidt
## Well, actually, you’re making the job a whole lot easier here because I just ran up the stairs and spent the last few hours looking at Excel sheets. I have to say ... This might be even more complicated from first sight, but it’s a lot more pleasant. 
## A guy called Gerald
rbma_corpus<- tm_map(rbma_corpus, content_transformer(tolower)); #lowercase everything 
rbma_corpus<- tm_map(rbma_corpus, removeNumbers); #remove numbers 
rbma_corpus<- tm_map(rbma_corpus, removeWords, stopwords("en")) #remove stopwords using tm inbuilt list
rbma_stopwords = read.csv("FILES/rbma_stop_words.csv", header = TRUE); #load in my first custom stopword list
rbma_stopwords_vec<- as.vector(rbma_stopwords$RBMA.stop.words); #vectorise it so it can be applied through tm_map
rbma_stopwords_vec #check it
##  [1] "chal ravens"            "hanna bächer"           "benji b"               
##  [4] "rollie pemberton"       "david nerattini"        "davide nerattini"      
##  [7] "cognito"                "teri gender bender"     "frosty"                
## [10] "brendan m gillen"       "jeff “chairman” mao"    "jeff mao"              
## [13] "jeff chang"             "hattie collins"         "deepti datt"           
## [16] "dj rekha"               "lauren martin"          "eothen “egon” alapatt" 
## [19] "johnny hockin"          "kenneth lobo"           "davide bortot"         
## [22] "julian brimmers"        "anupa mistry"           "noz"                   
## [25] "serko fu"               "todd l. burns"          "egon"                  
## [28] "patrick thévenin"       "geraldine sarratia"     "gerd janson"           
## [31] "tony nwachukwu"         "monk one"               "heinz reich"           
## [34] "nelson george"          "fabio de luca"          "nick dwyer"            
## [37] "fergus murphy"          "christine kakaire"      "tim sweeney"           
## [40] "duane jones"            "dj soulscape"           "brian reitzell"        
## [43] "osunlade"               "om’mas keith"           "kimberly drew"         
## [46] "carl wilson"            "masaaki hara"           "toby laing"            
## [49] "tito del aguila"        "erin macleod"           "shawn reynaldo"        
## [52] "audience member"        "rui miguel abreu"       "vivian host"           
## [55] "sacha jenkins"          "anthony obst"           "andrew barber"         
## [58] "susumu kunisaki"        "calle dernulf"          "shaheen ariefdien"     
## [61] "patrick pulsinger"      "adam baindridge"        "christina lee"         
## [64] "brian “b.dot” miller"   "jospeh ghosn"           "miss info"             
## [67] "ian christie"           "translator"             "alvin blanco"          
## [70] "étienne menu"           "denis boyarinov"        "todd burns"            
## [73] "torsten schmidt"        "emma warren"            "red bull music academy"
## [76] "yeah"                   "like"                   "applause"              
## [79] "laughs"                 "really"                 "rbma"                  
## [82] "kinda"                  "kind of"                "audience member"       
## [85] "just"                   "laughter"               "aaron gonsher"
rbma_corpus<- tm_map(rbma_corpus, removeWords, rbma_stopwords_vec); #remove custom sw 
rbma_corpus<- tm_map(rbma_corpus, removePunctuation); #remove punctuation only at this step otherwise it would affect elements of the custom stopword list
punctuation_clean<- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))});
rbma_corpus<- tm_map(rbma_corpus, punctuation_clean, "’"); rbma_corpus<- tm_map(rbma_corpus, punctuation_clean, "‘"); rbma_corpus<- tm_map(rbma_corpus, punctuation_clean, "“"); rbma_corpus<- tm_map(rbma_corpus, punctuation_clean, "”"); rbma_corpus<- tm_map(rbma_corpus, punctuation_clean, "–"); rbma_corpus<- tm_map(rbma_corpus, punctuation_clean, "…"); rbma_corpus<- tm_map(rbma_corpus, punctuation_clean, "—"); rbma_corpus<- tm_map(rbma_corpus, punctuation_clean, "dâmfunk") #content transformer pattern to catch various loose punctuation marks as well as one lecturer name spelling that wouldn't stick through the stopword list, likely due to the accented a 
dtm_stopwords = read.csv("FILES/dtm_stopwords.csv", header = FALSE); #load a second custom stopword list based on examining the DTM
dtm_stopwords_vec<- as.vector(dtm_stopwords$V1);
dtm_stopwords_vec
##  [1] "don"       "can"       "going"     "get"       "got"       "something"
##  [7] "lot"       "thing"     "get"       "things"    "one"       "kind"     
## [13] "stuff"     "know"      "want"      "well"      "didn"
rbma_corpus<- tm_map(rbma_corpus, removeWords, dtm_stopwords_vec); #remove those new stopwords 
rbma_corpus<- tm_map(rbma_corpus, content_transformer(function(x) gsub(x, pattern = "\\st\\s", replacement = ""))); rbma_corpus<- tm_map(rbma_corpus, content_transformer(function(x) gsub(x, pattern = "\\ss\\s", replacement = ""))); rbma_corpus<- tm_map(rbma_corpus, content_transformer(function(x) gsub(x, pattern = "\\sre\\s", replacement = ""))); rbma_corpus<- tm_map(rbma_corpus, content_transformer(function(x) gsub(x, pattern = "\\sm\\s", replacement = ""))); rbma_corpus<- tm_map(rbma_corpus, content_transformer(function(x) gsub(x, pattern = "\\sll\\s", replacement = ""))); rbma_corpus<- tm_map(rbma_corpus, content_transformer(function(x) gsub(x, pattern = "\\sve\\s", replacement = ""))); #content transformer with regex to remove all the single/double letters that result from pronouns being removed 
rbma_corpus<- tm_map(rbma_corpus, stripWhitespace) #finally strip whitespace 
writeLines(as.character(rbma_corpus[[1]])[0:5]); #print snippet of first doc to compare with initial result 
##  guy called gerald
## hello hello thank turning lecture name guy called gerald presenting actually 
## 
##  actually making job whole easier ran stairs spent last hours looking excel sheets say might even complicated first sight pleasant 
##  guy called gerald

The resulting corpus was labelled as RBMA CLEAN R V1. I then created a V2, which followed the same process but used the second stopword list, which includes all lecturer names. This version of the corpus was labelled RBMA CLEAN R V2.


2.2 Lemmatizing vs Stemming

Once I had the basic cleaning done I did some stemming tests but decided that the effect of returning stems wasn’t useful for what I wanted to look at. I instead focused on lemmatizing, which would leave the lemma behind (rather than a potential nonsensical stem) and could potentially remove some additional noise such as multiple forms of verbs.

Lemmatization in turn threw up some challenges, especially in trying to combine the tm and textstem (Rinker 2018) packages. One major issue was being unable to use a dictionary for the lemmatization process as the lemmatize_strings method from textstem has to be passed to a tm VCorpus object via tm_map and a content_transformer function. I tried using the built-in dictionaries, and I also built one from a DTM of the corpus, but in every case when I passed the dictionary via tm_map R would just run endlessly until I force quit. In the end the only thing that seemed to work was to simply pass lemmatize_strings as is. Below I show the Document Term Matrix summary outputs for RBMA CLEAN R V1 and its equivalent lemmatized version, which shows the difference in total unique term counts.

rbma_corpus_v1<- VCorpus(DirSource("FILES/RBMA_CLEAN_R_V1"));
rbma_corpus_v1_lemm<- VCorpus(DirSource("FILES/RBMA_CLEAN_R_LEMM_V1"));
dtm_v1<- DocumentTermMatrix(rbma_corpus_v1);
dtm_v1_lemm<- DocumentTermMatrix(rbma_corpus_v1_lemm);
dtm_v1; 
## <<DocumentTermMatrix (documents: 468, terms: 55227)>>
## Non-/sparse entries: 635651/25210585
## Sparsity           : 98%
## Maximal term length: 63
## Weighting          : term frequency (tf)
dtm_v1_lemm
## <<DocumentTermMatrix (documents: 468, terms: 45314)>>
## Non-/sparse entries: 521915/20685037
## Sparsity           : 98%
## Maximal term length: 63
## Weighting          : term frequency (tf)

After the lemmatization process I had another two versions of the corpus: R V1 LEMM and R V2 LEMM.

Iterating through all these steps multiple times and repeatedly testing the results, for example by observing the outputs of specific corpus documents, was essential in figuring out how to catch as much as possible during the cleaning process in order to create a corpus that could be as useful as possible. As I’ll explain in the following notebooks, I undertook text analysis and mining with the R V2 LEMM version of the corpus but having multiple versions to compare at first was useful in understanding which choices to make in the cleaning process as well as the strength of various noise factors.

References

Feinerer, Ingo, and Kurt Hornik. 2020. Tm: Text Mining Package. http://tm.r-forge.r-project.org/.
Rinker, Tyler. 2018. Textstem: Tools for Stemming and Lemmatizing Text. http://github.com/trinker/textstem.

  1. I used Eight 2 Late’s Gentle introduction to text mining using R as a primary resource for navigating the key steps of corpus cleaning with the tm package.↩︎