Standardizing Spelling and Locations with Python: Pyspellchecker and Mordecai
An important step in preprocessing text data is making sure that strings with essentially the same meaning but different forms are standardized so that they can be used in analysis. A simple form of this is accomplished by lemmatization, which replaces words with their root forms. This tutorial will explain how to standardize spelling and location names in the dataset for Kaggle’s “Real or Not? NLP with Disaster Tweets” competition.
Standardizing Spelling with Pyspellchecker
There are multiple algorithms and Python libraries for standardizing spelling. This section will explain how to use the Pyspellchecker library.
Pyspellchecker works by finding all permutations of characters in a word within a predetermined edit distance. Put another way, it finds all the strings that can be created by inserting, deleting, replacing, and transposing the characters in a given word. It compares each permutation to a dictionary of known words and their frequencies and returns the word with the highest frequency as the most probable correct spelling. For more details on how the algorithm works, check out the documentation.
The default edit distance is 2 characters, which means that it will find all permutations that can be created by editing the original word no more than 2 times. We will reduce the edit distance to 1, however, because our dataset contains over 20,000 words, some of which are very long, and it takes a prohibitively long amount of time to calculate the edit distance 2 permutations for each one.
Now that you understand the basics of how Pyspellchecker works, let’s get started with implementing it. First, install Pyspellchecker and import it into your Python notebook. Then import the data you’ll be spellchecking. For this tutorial, we’ll use the Kaggle disaster tweets training data that has been preprocessed so that the text of the tweets is one-hot encoded as a column for each worrd. For more details on the initial preprocessing process, see the previous tutorial. Since the words are encoded as columns, we will be running the spellchecker on column names.
%pip install pyspellchecker
from spellchecker import SpellChecker
import pandas as pd#import data
data = pd.read_csv(“data/train_preprocessed.csv”)
Once the data is imported, create a spellchecker instance as shown below. Don’t forget to reduce the edit distance so that it runs faster on our large dataset. If you want to play around with the spellchecker before running it on the entire dataset, you can test it on a short list of words. The method we’ll be using is correction(), which returns the single most probable correct spelling for a given word.
## initialize spellchecker and test on a few wordsspell = SpellChecker(distance=1)
test = ['na', 'lief']for word in test: # Get the one `most likely` answer print(spell.correction(word))
Now we can spell check each word in the dataset. For our preprocessed data, this means we’ll spell check all of the column names except for those that contain the original data (the first four columns). We’ll iterate through each word, filter out those that contain hashtags, mentions, or URLs, and then run the spellchecker on the word. If it returns a corrected word that isn’t the same as the input, we’ll store it in a dictionary that we’ll use to change the column names later on. For debugging and monitoring purposes, we’ll also count and print each correction.
#store all of the columns that encode words (all but the first 4, which contain the original tweet data)
bag_of_words = data.columns[4:]## for each word (one-hot encoded as column names) replace with spellcheck version
column_map = {} #dictionary to store map for corrected words
n_corrected = 0 #count number of corrections made#iterate through all columns that represent tweet words
for word in bag_of_words:
#filter out hashtags, mentions, and urls
if ‘#’ not in word and ‘http://' not in word and ‘@’ not in word:
#use spell check to get correct version of word
corrected = spell.correction(word)
#if word is corrected by spellchecker:
if word != corrected:
#store correct version in dictionary, then print and count
column_map[word] = corrected
n_corrected += 1
print(word, ‘corrected to ‘, corrected)
Finally, we’ll print out the number of words corrected to confirm that it doesn’t seem abnormally high or low. Our program corrected approximately 7% of words, which seems reasonable without doing more research into expected typo rates. Then we’ll replace each word with its corrected version.
#print out percentage of words corrected
print(‘% corrected: ‘, n_corrected / len(bag_of_words))
Output: % corrected: 0.07122426970506869#rename columns to corrected version
data.rename(columns=column_map, inplace=True)
Now that all words have been spellchecked, some columns that previously encoded different strings might now represent the same corrected word. We can check how many column names are now duplicates of each other by comparing the length of the overall list of column names and the length of the list of unique column names. This reveals that there are over 700 duplicate columns, which we want to combine into a single column for the correctly spelled word.
print(len(data.columns), len(set(data.columns))) #after combining these numbers should be the same
Output: 21331 20554
We’ll do this by identifying the columns that encode the same corrected word and combining them by taking the maximum of the values in each row. The resulting single column will contain a value of 1 for any tweet that contains one of the word variations. We’ll create a modified, slightly smaller dataset by iteratively deleting the duplicate columns and adding the combined ones.
## combine corrected words that are now duplicates
for i in range(len(data.columns)):
this_word = data.columns[i]
#check if word is duplicate by seeing if it appears in the list again
if this_word in data.columns[i+1:]:
#combine duplicate columns by taking the maximum value;
#if any row contains a 1 for a spelling variation, the combined output will also contain a 1
combined_columns = data[this_word].max(axis=1, skipna=False)
#delete duplicate columns
data.drop(labels=this_word, axis=1, inplace=True)
#insert combined column into dataset
data[this_word] = combined_columns
print(this_word, ‘duplicate’)
If the length of all column names and the length of only unique column names is the same after combining all the words, then all duplicates have been successfully resolved.
print(len(data.columns), len(set(data.columns))) #after combining these numbers should be the same
Output: 20554 20554
Limitations
At this point, we’re technically done and ready to store the final dataset in a .csv for use in other notebooks. Before we move on, however, let’s discuss some of the limitations of this approach.
Edit distance: One limitation that is explicitly in the code is the maximum edit distance used to generate permutations for each word, which we manually set to 1. As discussed earlier, this limit saves us time, but it also means that our spellchecker would miss typos that were further removed from the original word.
Informal language and slang vs. dictionary language: If you scan through the spelling corrections, you’ll notice that some of the changed words were likely spelled as the user intended. ‘Prolly’, for example, is a common informal expression for ‘probably’, but the spellchecker corrects it to ‘polly.’ This is because the corpus used to identify the most probable permutations of each word is mostly based on standard or formal English, not informal spellings commonly used on Twitter and other social media platforms. If we had a corpus of word frequencies used in informal digital settings, some of these slang words may have been handled better by our spellchecker.
Lack of context: Pyspellchecker looks at words in isolation, not in context. It therefore can’t account for how the most likely word might change depending on the words that surround it. In some contexts, words that are rare overall might be more probable. A more advanced spellchecker would use the surrounding words to modify the probabilities of each permutation of the word it is correcting.
You can view the Python notebook used to run the spellcheck code here.
Standardizing Locations with Mordecai
Now that spellchecking is done, we’ll standardize the locations represented in the tweet data using geoparsing, or extracting locations from text and matching them to a known place using an index of world locations. As with spellchecking, there are multiple Python geoparsing libraries to choose from. We’ll use a library called Mordecai because it is especially fast and handles non-US place names well compared to other options, such as fuzzy matching locations with the NLTK US-centric gazetteers corpus.
Given a text to geoparse, Mordecai first uses the spaCy NLP library to extract location entities. It then searches the GeoNames index, a huge global dataset of place names, for potential matches to the extracted entity. It decides which potential match is most likely correct using a neural net that was trained on labelled English texts. If one of the matches exceeds a probability cutoff (the default value is 0.6), the output will include the matched place name as well as information like the country where the location is and its latitude and longitude coordinates. For more information on how Mordecai works, see the full documentation and source code.
Setting up Mordecai can be difficult and time-consuming, but the result is fast and informative geoparsing. It has some very specific dependencies, so it’s strongly recommended that you run it in its own virtual environment. One of Mordecai’s dependencies (Tensorflow) doesn’t run on the newest version of Python, so I used a Python 3.7 virtual environment and recommend that you do the same.
Once you’ve created a virtual environment, follow the documentation to install Mordecai and other dependencies. You’ll also need to install Docker, which Mordecai uses to search the GeoNames index. Note that the ‘geonames_index.tar.gz
’ downloaded in the third setup step takes a long time to load (about 30 minutes). Make sure that you don’t quit the command terminal before the ‘wget https://andrewhalterman.com/files/geonames_index.tar.gz — output-file=wget_log.txt
’ command has finished executing.
Once the installation process is complete, setup a new Python notebook for geoparsing as follows:
from mordecai import Geoparser
import pandas as pd
import matplotlib.pyplot as plt
If there’s an error importing Mordecai, it’s likely because you’re not running the notebook in the correct virtual environment. You can run ‘! which python’ in your notebook to check the virtual environment it’s using. If it’s not in the correct one, create a new notebook that is by generating it from the command line of the correct virtual environment.
First, import the preprocessed training data using the same method as the spellchecking example above.
#import training data
train = pd.read_csv(“data/train.csv”)
We’ll test geoparsing with a small sample before running it on the entire dataset. The text of the second tweet mentions a location, so we can use that as our test string.
test_string = train.at[1, ‘text’]
print(test_string)
Output: Forest fire near La Ronge Sask. Canada
Next we’ll instantiate a geoparser, geoparse our test string, and print out the results.
geo = Geoparser()
test_result = geo.geoparse(test_string)
print(test_result)
Output:
[{‘word’: ‘La Ronge Sask’, ‘spans’: [{‘start’: 17, ‘end’: 30}], ‘country_predicted’: ‘USA’, ‘country_conf’: 0.2353872}, {‘word’: ‘Canada’, ‘spans’: [{‘start’: 32, ‘end’: 38}], ‘country_predicted’: ‘CAN’, ‘country_conf’: 0.9516948, ‘geo’: {‘admin1’: ‘NA’, ‘lat’: ‘60.10867’, ‘lon’: ‘-113.64258’, ‘country_code3’: ‘CAN’, ‘geonameid’: ‘6251999’, ‘place_name’: ‘Canada’, ‘feature_class’: ‘A’, ‘feature_code’: ‘PCLI’}}]
We can see that geoparse()
returns a dictionary of information for each of the extracted locations. For our test string, it extracted two locations — ‘La Ronge Sask.’ and ‘Canada’ — and therefore returned two dictionaries. The first location, ‘La Ronge Sask.’, was not matched to a place. The geoparser identified the most probable country for the location as ‘USA’, but its confidence level (given by ‘country_conf’) is less than the default 0.6 cutoff, so the match wasn’t completed. The second location, ‘Canada’, was obviously successfully matched to the country Canada, with an approximately 95% confidence level. Since the match was successful, the output includes a ‘geo’ dictionary of information about the matched place, including its coordinates and standardized place name.
Now that we understand what the geoparser output looks like, we can use it to extract location information from our dataset. We’ll first run it on the ‘location’ column, which contains the location part of the Twitter user’s profile. Note that many of these values are missing because it is common for users to not provide their location; it’s also possible for the value to be a non-location, like ‘Worldwide!!’. We’ll run our geoparser on each location entry and store all of the ‘geo’ dictionaries for successfully matched places. The result is a ‘geoparsed_location’ column that contains the raw output of the geoparser.
##iterate through every row and geoparse location data#add column for geoparsed locations
train.insert(0, ‘geoparsed_location’, [[] for i in range(len(train))])for i, row in train.iterrows():
print(i, ‘: start’)
this_location = row[‘location’]
this_geoparsed_locations = []
#skip locations that arent strings
if type(this_location) == type(‘s’):
this_result = geo.geoparse(this_location)
#iterate through the results for each location entity identified
for location in this_result:
#append the geo dictionary if it exists/if a location was successfully identified
try:
this_geoparsed_locations.append(location[‘geo’])
except:
pass
#store in dataframe
train.at[i, ‘geoparsed_location’] = this_geoparsed_locations
#print for checking accuracy
print(this_location, this_geoparsed_locations)
Next, we’ll repeat the process using the ‘text’ column, which contains the text of each tweet. If a user mentioned a location in their tweet, as is the case with our test string, the geoparser should identify it and match it to a place. Once again, the result is a ‘geoparsed_text’ column that contains the raw output of the geoparser.
##repeat, but for text data instead of location data
train.insert(0, ‘geoparsed_text’, [[] for i in range(len(train))])for i, row in train.iterrows():
print(i, ‘: start’)
this_location = row[‘text’]
this_geoparsed_locations = []
#skip locations that arent strings
if type(this_location) == type(‘s’):
try: #handle errors by replacing with empty list
this_result = geo.geoparse(this_location)
except:
this_result = []
#iterate through the results for each location entity identified
for location in this_result:
#append the geo dictionary if it exists/if a location was successfully identified
try:
this_geoparsed_locations.append(location[‘geo’])
except:
pass
#store in dataframe
train.at[i, ‘geoparsed_text’] = this_geoparsed_locations
#print for checking accuracy
print(this_location, this_geoparsed_locations)
Since geoparsing the ‘location’ and ‘text’ columns takes a while, I recommend storing your dataset as a preliminary .csv file at this point so that you don’t have to re-geoparse the data if you have to restart your notebook for some reason.
We’ll now extract relevant information from the geoparser outputs and store it in new columns. To do this, we’ll first check how many locations were identified for each ‘location’ and ‘text’ value.
##check number of locations identified for location and text data
plt.hist([len(r[‘geoparsed_location’]) for i, r in train.iterrows()])
plt.title(‘Number of locations identified in location entry’)
plt.show()plt.hist([len(r[‘geoparsed_text’]) for i, r in train.iterrows()])
plt.title(‘Number of locations identified in text’)
plt.show()#it looks like both text and location entries mostly had less than 2 geoparsed locations, so we will only store two for each
Though there are some exceptions, most strings contain less than two geoparsed places. We’ll therefore store information for a maximum of two places geoparsed from the ‘location’ data and two places geoparsed from the ‘text’ data. I decided that the most important information is the place name, coordinates, and country, so I inserted new columns as shown below.
##insert columns to store info for geoparsed locations in location and text data
#store name, coordinates, and country for each locationtrain.insert(7, ‘gp_loc_1_place_name’, [‘’ for i in range(len(train))])
train.insert(7, ‘gp_loc_1_country’, [‘’ for i in range(len(train))])
train.insert(7, ‘gp_loc_1_lat’, [0.0 for i in range(len(train))])
train.insert(7, ‘gp_loc_1_long’, [0.0 for i in range(len(train))])train.insert(7, ‘gp_loc_2_place_name’, [‘’ for i in range(len(train))])
train.insert(7, ‘gp_loc_2_country’, [‘’ for i in range(len(train))])
train.insert(7, ‘gp_loc_2_lat’, [0.0 for i in range(len(train))])
train.insert(7, ‘gp_loc_2_long’, [0.0 for i in range(len(train))])train.insert(7, ‘gp_txt_1_place_name’, [‘’ for i in range(len(train))])
train.insert(7, ‘gp_txt_1_country’, [‘’ for i in range(len(train))])
train.insert(7, ‘gp_txt_1_lat’, [0.0 for i in range(len(train))])
train.insert(7, ‘gp_txt_1_long’, [0.0 for i in range(len(train))])train.insert(7, ‘gp_txt_2_place_name’, [‘’ for i in range(len(train))])
train.insert(7, ‘gp_txt_2_country’, [‘’ for i in range(len(train))])
train.insert(7, ‘gp_txt_2_lat’, [0.0 for i in range(len(train))])
train.insert(7, ‘gp_txt_2_long’, [0.0 for i in range(len(train))])#check that columns were inserted successfully
train.columnsOutput: Index([‘geoparsed_text’, ‘geoparsed_location’, ‘id’, ‘keyword’, ‘location’,
‘text’, ‘target’, ‘gp_loc_1_place_name’, ‘gp_txt_2_long’,
‘gp_txt_2_lat’, ‘gp_txt_2_country’, ‘gp_txt_2_place_name’,
‘gp_txt_1_long’, ‘gp_txt_1_lat’, ‘gp_txt_1_country’,
‘gp_txt_1_place_name’, ‘gp_loc_2_long’, ‘gp_loc_2_lat’,
‘gp_loc_2_country’, ‘gp_loc_2_place_name’, ‘gp_loc_1_long’,
‘gp_loc_1_lat’, ‘gp_loc_1_country’],
dtype=’object’)
Now that we’ve decided what information to store, all we need to do is fill in the correct values from the geoparsed outputs we stored previously. We’ll do that for both the ‘location’ and ‘text’ data by iterating through a maximum of two geoparsed results for each value, accessing the ‘geo’ dictionary, and storing the result in the appropriate column.
#fill in columns with stored geoparsing results
for index, row in train.iterrows():
#fill in geoparsed LOCATION data:
if len(row[‘geoparsed_location’]) > 0: #check if geoparsed data was collected
info = row[‘geoparsed_location’]
for i in range(min(len(info), 2)): #iterate through info for max two geoparsed places
geo_dict = info[i]
##store relevant info in correct column
train.at[index, ‘gp_loc_’ + str(i + 1) + ‘_place_name’] = geo_dict[‘place_name’]
train.at[index, ‘gp_loc_’ + str(i + 1) + ‘_country’] = geo_dict[‘country_code3’]
train.at[index, ‘gp_loc_’ + str(i + 1) + ‘_lat’] = geo_dict[‘lat’]
train.at[index, ‘gp_loc_’ + str(i + 1) + ‘_long’] = geo_dict[‘lon’] #fill in geoparsed TEXT data:
if len(row[‘geoparsed_text’]) > 0: #check if geoparsed data was collected
info = row[‘geoparsed_text’]
for i in range(min(len(info), 2)): #iterate through info for max two geoparsed places
geo_dict = info[i]
##store relevant info in correct column
train.at[index, ‘gp_txt_’ + str(i + 1) + ‘_place_name’] = geo_dict[‘place_name’]
train.at[index, ‘gp_txt_’ + str(i + 1) + ‘_country’] = geo_dict[‘country_code3’]
train.at[index, ‘gp_txt_’ + str(i + 1) + ‘_lat’] = geo_dict[‘lat’]
train.at[index, ‘gp_txt_’ + str(i + 1) + ‘_long’] = geo_dict[‘lon’]
Limitations
And with that, we have our geoparsed dataset! As with spellchecking, let’s briefly describe the limitations of this approach.
Lack of context: The geoparser considers each extracted location entity independently, so it may miss classifications that are apparent to humans from the broader context of the text. Our test example demonstrated this limitation. ‘La Ronge Sask. Canada’ apparently refers to a single place in Canada. But the geoparser examined ‘La Ronge Sask.’ and ‘Canada’ as two separate entities. It correctly identified Canada, but failed to use that information to identify La Ronge as a town in the Saskatchewan province. Instead, it guessed that ‘La Ronge Sask.’ was a place in the USA, but correctly assigned a low probability to that incorrect guess.
Relies on GeoNames index: The GeoNames index is remarkably comprehensive, but it doesn’t include all possible locations, such as those that are very small or remote. On the other hand, the detailed locations that are present in the GeoNames index sometimes result in matches that are too specific. For example, the geoparser matched ‘Northern California’ to ‘Golden Gate Baptist Theological Seminary Northern California Campus’ instead of ‘California’ in general.
You can view the Python notebook used to run the geoparsing code here.
Conclusion
Now that we’ve finished spellchecking and geochecking the tweet data, we’ll merge the two datasets we created into one.
complete = geoparsed.merge(spellchecked, left_on=[‘id’, ‘keyword’, ‘location’, ‘text’, ‘target’, ‘Unnamed: 0’], right_on=[‘id’, ‘keyword’, ‘location_data’, ‘text_data’, ‘target_data’, ‘Unnamed: 0’])
The resulting dataset can then be used to train a model to predict which tweets are about ‘real’ disasters. The next tutorial will explain how to do that.