Part 3 Corpus Cleaning in Python

3.1 Basic cleaning pt 2

After cleaning the corpus with R, I attempted to replicate the process in Python both as a learning exercise and to see what might be different. For Python I focused on using the nltk library8, which seems to be equivalent in popularity to R’s tm package.

Here are the main steps and takeaways from the process:

  1. Regular expressions are essential to doing ‘better’ cleaning in Python and after a lot of trial and error attempting to devise various patterns to do what I wanted I got some help from a friend to focus on a simple pattern that could strip each document of the lecturer or host’s name when it appears on its own line rather than throughout the text. I settled on this as a small win and a primary point of difference between the Python and R cleaning processes. In R, I removed lecturer/host names from the entire corpus, even when someone might be mentioned in another lecture, which could be an issue for any meaningful attempts at creating networks between people (the repeated use of names throughout each file is also a huge noise factor for any textual analysis). With this regex pattern I was able to remove the instances of names that are pure noise.

  2. Another major difference between Python and R that emerged through this comparison was in the ease of processing stopwords. R allows for the inclusion of n-grams in a stopword list, whereas in Python this needs to be done with additional code. I tried a few regex solutions but was unable to get something working to capture anything more than unigrams. This ended up being another difference between the two versions. In Python I broke down all the stopwords into unigrams. The primary issue with this is that one of the custom stopwords is “Red Bull Music Academy,” which ended up removing the word “music” from the Python version of the corpus, which as we’ll see is one of the most prominent words in the corpus.

Below are cells showing the code I used for the cleaning and some relevant outputs.

Note that to avoid running unnecessary code during the output of the notebooks, Python functions are shown as plain code with their outputs as images, code chunks are only used and output when they required minimal processing.

library(reticulate)
#import needed libraries
import pandas as pd
import nltk
from nltk.corpus import PlaintextCorpusReader, wordnet
import re
import os.path
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer
#set raw corpus to be cleaned
rbma_corpus = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/rbma-lectures-master/data_no_empty_files_new_file_names_no_intro', '.*\.txt')
#load custom stoplist as df 
lecturer_names_df = pd.read_csv('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/rbma_stop_words_2.csv')
#set export directories
export_directory = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_V1'
export_directory_two = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_V2'
export_directory_three = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_LEMM_V1'
export_directory_four = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_LEMM_V2'
export_directory_five = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_LEMM_V1_POS'
export_directory_six = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_LEMM_V2_POS'
#clean each doc in corpus of punct, standard stopwords, custom stopwords and lecturer/interviewer names (only on new lines)
def filter_stop_punct(input_string):
    stopwords = nltk.corpus.stopwords.words('english') #nltk stopwords 
    custom_stopwords = ['red', 'bull', 'music', 'academy', 'yeah', 'like', 'applause', 'laughs', 'really', 'rbma', 'kinda', 'audience', 'member',  'just', 'can', 'going', 'get', 'got', 'something', 'lot', 'thing', 'get', 'things', 'one', 'kind', 'stuff', 'know', 'want', 'well'] #list of custom stopwords 
    lecturer_names = lecturer_names_df['RBMA stop words'].tolist() #import lecturer/host names 
    cleaning_regex = re.compile('\n\s*(' + '|'.join(lecturer_names) + ')\s*\n') #cleaning pattern for names 
    punct_tokenizer = nltk.RegexpTokenizer(r"\w+") #catch punct at tokenizing stage 
    newline_string = '\n' + input_string #add a new line to catch first mention of lecturer/host
    text_lower = newline_string.lower() #lower everything 
    text_clean = re.sub(cleaning_regex, '\n', text_lower) #remove names on their own line only
    text_tokenised = punct_tokenizer.tokenize(text_clean) #tokenize string
    text_clean = [w for w in text_tokenised if w not in stopwords] #remove stopwords
    text_clean_2 = [w for w in text_clean if w not in custom_stopwords] #remove custom stopwords 
    text_clean_3 = TreebankWordDetokenizer().detokenize(text_clean_2) #detokenize to get clean string back 
    text_clean_4 = re.sub('’|‘|–|“|”|…|—|dâm funk', '', text_clean_3) # catch loose bits
    text_clean_5 = re.sub(r'\d+', '', text_clean_4) #remove numbers
    text_clean_6 = re.sub(r'\ss\s', '', text_clean_5) #remove trailing s's 
    text_final = ' '.join(text_clean_6.split()) #strip whitespace
    return text_final

#iterate through input corpus, clean, export to new directory 
def export_corpus(input_corpus):
    for d in input_corpus.fileids():
        clean_string = filter_stop_punct(input_corpus.raw(d))
        filename = d
        filepath = os.path.join(export_directory, filename)
        outfile = open(filepath, 'w')
        outfile.write(clean_string)
        outfile.close()
    return 'Process complete!'
#print the first few lines of the same lecture before and after cleaning 
rbma_corpus = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/rbma-lectures-master/data_no_empty_files_new_file_names_no_intro', '.*\.txt')
rbma_corpus_py_clean_v1 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_V1', '.*\.txt')
print('Raw version:\n', rbma_corpus.raw('jam-and-lewis.txt')[0:200])
## Raw version:
##  Jeff Mao
## Our lecturers today are a songwriting and producing team from Minneapolis, Minnesota, that you may have heard of. They happen to be two of the best to have ever done it. So please welcome Jim
print('Clean version:\n', rbma_corpus_py_clean_v1.raw('jam-and-lewis.txt')[0:200])
## Clean version:
##  lecturers today songwriting producing team minneapolis minnesota may heard happen two best ever done please welcome jimmy jam terry lewis thank thank us yes talk talk would love us listen first reset

The resulting corpus was labelled as RBMA CLEAN PY V1.

I then ran the same function again but this time I did not remove custom stopwords and only removed lecturer and host names when they appeared on new lines, as I felt this might give me an interesting corpus version that still included the key word “music” and didn’t have host and lecturer names as noise but only where they are relevant.

#clean each doc in corpus but no custom stopwords and lecturer and host names only on new lines 
def filter_names(input_string):
    stopwords = nltk.corpus.stopwords.words('english') #nltk stopwords 
    lecturer_names = lecturer_names_df['RBMA stop words'].tolist() #import lecturer/host names 
    cleaning_regex = re.compile('\n\s*(' + '|'.join(lecturer_names) + ')\s*\n') #cleaning pattern for names 
    punct_tokenizer = nltk.RegexpTokenizer(r"\w+") #catch punct at tokenizing stage 
    newline_string = '\n' + input_string #add a new line to catch first mention of lecturer/host
    text_lower = newline_string.lower() #lower everything 
    text_clean = re.sub(cleaning_regex, '\n', text_lower) #remove names on their own line only
    text_tokenised = punct_tokenizer.tokenize(text_clean) #tokenize string
    text_clean = [w for w in text_tokenised if w not in stopwords] #remove stopwords
    text_clean_2 = TreebankWordDetokenizer().detokenize(text_clean) #detokenize to get clean string back 
    text_clean_3 = re.sub('’|‘|–|“|”|…|—|dâm funk', '', text_clean_2) # catch loose bits
    text_clean_4 = re.sub(r'\d+', '', text_clean_3) #remove numbers
    text_clean_5 = re.sub(r'\ss\s', '', text_clean_4) #remove trailing s's 
    text_final = ' '.join(text_clean_5.split()) #strip whitespace
    return text_final

#iterate through input corpus, clean, export to new directory 
def export_corpus_names(input_corpus):
    for d in input_corpus.fileids():
        clean_string = filter_names(input_corpus.raw(d))
        filename = d
        filepath = os.path.join(export_directory_two, filename)
        outfile = open(filepath, 'w')
        outfile.write(clean_string)
        outfile.close()
    return 'Process complete!'

The resulting corpus was labelled as RBMA CLEAN PY V2.

#print the first few lines of the same lecture to show differences btw v1 and v2
rbma_corpus_py_clean_v2 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_V2', '.*\.txt')
print('Python V1:\n', rbma_corpus_py_clean_v1.raw('jam-and-lewis.txt')[0:300])
## Python V1:
##  lecturers today songwriting producing team minneapolis minnesota may heard happen two best ever done please welcome jimmy jam terry lewis thank thank us yes talk talk would love us listen first reset clean ears whatever else play actually album produced janet jackson last year cut entitled broken he
print('Python V2:\n', rbma_corpus_py_clean_v2.raw('jam-and-lewis.txt')[0:300])
## Python V2:
##  lecturers today songwriting producing team minneapolis minnesota may heard happen two best ever done please welcome jimmy jam terry lewis applause thank thank us yes lot music talk lot things talk would love us listen something first reset clean ears whatever else want play something actually album

3.2 Lemmatizing vs Stemming pt 2

For the lemmatizing process in Python I opted to use the WordNetLemmatizer module which is bundled with nltk. The first few tries yielded no changes as I was calling the lemmatizer method directly on the raw strings. I eventually realised I needed to tokenize the raw strings first and then call the method on the tokens before detokenizing the result using nltk’s TreebankWordDetokenizer. The code for this is shown below. 9

lemmatizer = WordNetLemmatizer()
#lemmatize and export cleaned corpus
def lemma_corpus(input_corpus):
    for d in input_corpus.fileids():
        tokenize_str = word_tokenize(input_corpus.raw(d))
        lemmatized = [lemmatizer.lemmatize(w) for w in tokenize_str]
        detokenise_str = TreebankWordDetokenizer().detokenize(lemmatized)
        filename = d
        filepath = os.path.join(export_directory_three, filename)
        outfile = open(filepath, 'w')
        outfile.write(detokenise_str)
        outfile.close()
    return 'Process complete!'
#print the first few lines of the same lecture to show differences btw clean and lemmatized 
rbma_corpus_py_clean_lemm_v1 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_LEMM_V1', '.*\.txt')
print('Clean version V1:\n', rbma_corpus_py_clean_v1.raw('jam-and-lewis.txt')[0:500])
## Clean version V1:
##  lecturers today songwriting producing team minneapolis minnesota may heard happen two best ever done please welcome jimmy jam terry lewis thank thank us yes talk talk would love us listen first reset clean ears whatever else play actually album produced janet jackson last year cut entitled broken hearts heal jimmy jam terry lewis janet jackson janet jackson broken hearts heal bad huh ok popular songs album popular album successful album story behind every song curious story behind song share us
print('Clean and lemmatized version V1:\n', rbma_corpus_py_clean_lemm_v1.raw('jam-and-lewis.txt')[0:500])
## Clean and lemmatized version V1:
##  lecturer today songwriting producing team minneapolis minnesota may heard happen two best ever done please welcome jimmy jam terry lewis thank thank u yes talk talk would love u listen first reset clean ear whatever else play actually album produced janet jackson last year cut entitled broken heart heal jimmy jam terry lewis janet jackson janet jackson broken heart heal bad huh ok popular song album popular album successful album story behind every song curious story behind song share u broken h

As we can see the process isn’t full proof, with words like “us” being lemmatized to “u” and verb forms like “produced” not lemmatized to their base lemma. This is a by-product of the default settings of the WordNetLemmatizer and it is possible to further refine the process by using Part Of Speech tagging and custom dictionaries to indicate which terms should be captured and how they should be tagged. Custom dictionaries are quite a bit of work so I skipped them for now but I did tweak the lemmatization function to add baseline POS tagging for adjectives, nouns, verbs, and adverbs. This helped increase the efficiency of the process a little.

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def lemma_pos_corpus(input_corpus):
    for d in input_corpus.fileids():
        tokenize_str = word_tokenize(input_corpus.raw(d))
        lemmatized = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in tokenize_str]
        detokenise_str = TreebankWordDetokenizer().detokenize(lemmatized)
        filename = d
        filepath = os.path.join(export_directory_four, filename)
        outfile = open(filepath, 'w')
        outfile.write(detokenise_str)
        outfile.close()
    return 'Process complete!' 
#print the first few lines of the same lecture to show differences btw clean and lemmatized 
rbma_corpus_py_clean_lemm_v1_pos = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_LEMM_V1_POS', '.*\.txt')
print('Lemmatized version V1:\n', rbma_corpus_py_clean_lemm_v1.raw('jam-and-lewis.txt')[0:500])
## Lemmatized version V1:
##  lecturer today songwriting producing team minneapolis minnesota may heard happen two best ever done please welcome jimmy jam terry lewis thank thank u yes talk talk would love u listen first reset clean ear whatever else play actually album produced janet jackson last year cut entitled broken heart heal jimmy jam terry lewis janet jackson janet jackson broken heart heal bad huh ok popular song album popular album successful album story behind every song curious story behind song share u broken h
print('Lemmatized with POS V1:\n', rbma_corpus_py_clean_lemm_v1_pos.raw('jam-and-lewis.txt')[0:500])
## Lemmatized with POS V1:
##  lecturer today songwriting produce team minneapolis minnesota may heard happen two best ever do please welcome jimmy jam terry lewis thank thank u yes talk talk would love u listen first reset clean ear whatever else play actually album produce janet jackson last year cut entitle broken heart heal jimmy jam terry lewis janet jackson janet jackson broken heart heal bad huh ok popular song album popular album successful album story behind every song curious story behind song share u broken heart h

We can see a slight increase in accuracy, especially with verb forms but we’re still getting issues with “us” and with “heard” not being recognised as a verb form.

The second version of the Python clean corpus was also lemmatized resulting in eight versions of the original raw corpus:

  • R clean v1 (custom stopwords)
  • R clean v2 (custom stopwords + all lecturer and host names removed)
  • R clean and lemmatized v1 and v2 (using textstem)
  • Python clean v1 (custom stopwords as unigrams)
  • Python clean v2 (no custom stopwords + lecturer and host names removed only from new lines)
  • Python clean and lemmatized v1 and v2 (using WordNetLemmatizer and POS)

Lastly, I ran a quick check on total word counts for each version of the corpus to make sure everything looked ok. Below is the code and the resulting dataframe, which I’d pickled in advance.

#Set corpora roots 

rbma_corpus = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/rbma-lectures-master/data_no_empty_files_new_file_names_no_intro', '.*\.txt')
rbma_corpus_clean_v1 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_R_V1', '.*\.txt')
rbma_corpus_clean_v2 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_R_V2', '.*\.txt')
rbma_corpus_clean_lemm_v1 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_R_LEMM_V1', '.*\.txt')
rbma_corpus_clean_lemm_v2 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_R_LEMM_V2', '.*\.txt')
rbma_corpus_py_clean_v1 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_V1', '.*\.txt')
rbma_corpus_py_clean_v2 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_V2', '.*\.txt')
rbma_corpus_py_clean_lemm_v1 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_LEMM_V1_POS', '.*\.txt')
rbma_corpus_py_clean_lemm_v2 = PlaintextCorpusReader('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/RBMA_CLEAN_PY_LEMM_V2_POS', '.*\.txt')

#make a list of the corpora and give them pretty titles
rbma_corpus_list = [rbma_corpus, rbma_corpus_clean_v1, rbma_corpus_clean_v2, rbma_corpus_clean_lemm_v1, rbma_corpus_clean_lemm_v2, rbma_corpus_py_clean_v1, rbma_corpus_py_clean_v2, rbma_corpus_py_clean_lemm_v1, rbma_corpus_py_clean_lemm_v2]
title_list = ['RBMA Raw', 'RBMA R V1', 'RBMA R V2', 'RBMA R V1 LEMM', 'RBMA R V2 LEMM', 'RBMA PY V1', 'RBMA PY V2', 'RBMA PY V1 LEMM', 'RBMA PY V2 LEMM']

#get total length of corpora to compare 
def corpus_lengths(corpus_list):
    total_length = [len(c.words()) for c in corpus_list]
    title = [d for d in title_list]
    length_df = pd.DataFrame({'Corpus': title, 'Length': total_length}).set_index('Corpus')
    return length_df 

corpus_lengths(rbma_corpus_list)
rbma_corpus_lenghts = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 1/Q3/DIGITAL TEXT/PROJECT/PICKLES/corpus_lengths.pkl'
rbma_corpus_lenghts = pd.read_pickle(rbma_corpus_lenghts)
rbma_corpus_lenghts
##                   Length
## Corpus                  
## RBMA Raw         6520133
## RBMA R V1        2045574
## RBMA R V2        1963253
## RBMA R V1 LEMM   2045696
## RBMA R V2 LEMM   1963375
## RBMA PY V1       1952929
## RBMA PY V2       2248159
## RBMA PY V1 LEMM  1952837
## RBMA PY V2 LEMM  2248066

By counting all the words in the nltk corpus objects we see a slight increase in total words between the R clean and lemmatized versions, however the Document Term Matrix from 2.2 confirm that there has been a reduction in unique words. I’m not sure why this increase in total word count happened but assume it might be a by-product of the lemmatizing process creating additional tokens. Aside from this everything looks correct: R V2 has less words due to having removed all host and lecturer names; Python V2 has more words due to not having removed custom stopwords; and the lemmatized versions of the Python corpora have a slightly lesser total word count but a more meaningful drop in unique terms if we inspect their DTM.

I then proceeded forward with three versions of the corpus:

  1. The raw version, as a benchmark
  2. R V2 LEMM, as a version with the highest amount of custom stopwords and names removed
  3. PY V2 LEMM, as a version with the lowest amount of custom stopwords and names removed

Overall, the experience of iteratively cleaning the corpus across both R and Python in a variety of ways really underlined how this is a process that can quickly grow out of hand and needs to be reigned in at some point, as chasing the idea of a perfectly clean corpus can feel like an endless pursuit.


  1. https://www.nltk.org/book/↩︎

  2. The following were used as reference for lemmatizing functions and code: Selva Prabhakaran’s Lemmatization Approaches with Examples in Python and Hafsa Jabeen’s Stemming and Lemmatization in Python↩︎