Sign in

Standardizing Spelling and Locations with Python: Pyspellchecker and Mordecai

An important step in preprocessing text data is making sure that strings with essentially the same meaning but different forms are standardized so that they can be used in analysis. A simple form of this is accomplished by lemmatization, which replaces words with their root forms. This tutorial will explain how to standardize spelling and location names in the dataset for Kaggle’s “Real or Not? NLP with Disaster Tweets” competition.

Standardizing Spelling with Pyspellchecker

There are multiple algorithms and Python libraries for standardizing spelling. This section will explain how to use the Pyspellchecker library.

Pyspellchecker works by finding all permutations of characters in a word within a predetermined edit distance. Put another way, it finds all the strings that can be created by inserting, deleting, replacing, and transposing the characters in a given word. It compares each permutation to a dictionary of known words and their frequencies and returns the word with the highest frequency as the most probable correct spelling. For more details on how the algorithm works, check out the documentation.

The default edit distance is 2 characters, which means that it will find all permutations that can be created by editing the original word no more than 2 times. We will reduce the edit distance to 1, however, because our dataset contains over 20,000 words, some of which are very long, and it takes a prohibitively long amount of time to calculate the edit distance 2 permutations for each one.

Now that you understand the basics of how Pyspellchecker works, let’s get started with implementing it. First, install Pyspellchecker and import it into your Python notebook. Then import the data you’ll be spellchecking. For this tutorial, we’ll use the Kaggle disaster tweets training data that has been preprocessed so that the text of the tweets is one-hot encoded as a column for each worrd. For more details on the initial preprocessing process, see the previous tutorial. Since the words are encoded as columns, we will be running the spellchecker on column names.

Once the data is imported, create a spellchecker instance as shown below. Don’t forget to reduce the edit distance so that it runs faster on our large dataset. If you want to play around with the spellchecker before running it on the entire dataset, you can test it on a short list of words. The method we’ll be using is correction(), which returns the single most probable correct spelling for a given word.

Now we can spell check each word in the dataset. For our preprocessed data, this means we’ll spell check all of the column names except for those that contain the original data (the first four columns). We’ll iterate through each word, filter out those that contain hashtags, mentions, or URLs, and then run the spellchecker on the word. If it returns a corrected word that isn’t the same as the input, we’ll store it in a dictionary that we’ll use to change the column names later on. For debugging and monitoring purposes, we’ll also count and print each correction.

Finally, we’ll print out the number of words corrected to confirm that it doesn’t seem abnormally high or low. Our program corrected approximately 7% of words, which seems reasonable without doing more research into expected typo rates. Then we’ll replace each word with its corrected version.

Now that all words have been spellchecked, some columns that previously encoded different strings might now represent the same corrected word. We can check how many column names are now duplicates of each other by comparing the length of the overall list of column names and the length of the list of unique column names. This reveals that there are over 700 duplicate columns, which we want to combine into a single column for the correctly spelled word.

We’ll do this by identifying the columns that encode the same corrected word and combining them by taking the maximum of the values in each row. The resulting single column will contain a value of 1 for any tweet that contains one of the word variations. We’ll create a modified, slightly smaller dataset by iteratively deleting the duplicate columns and adding the combined ones.

If the length of all column names and the length of only unique column names is the same after combining all the words, then all duplicates have been successfully resolved.

Limitations

At this point, we’re technically done and ready to store the final dataset in a .csv for use in other notebooks. Before we move on, however, let’s discuss some of the limitations of this approach.

Edit distance: One limitation that is explicitly in the code is the maximum edit distance used to generate permutations for each word, which we manually set to 1. As discussed earlier, this limit saves us time, but it also means that our spellchecker would miss typos that were further removed from the original word.

Informal language and slang vs. dictionary language: If you scan through the spelling corrections, you’ll notice that some of the changed words were likely spelled as the user intended. ‘Prolly’, for example, is a common informal expression for ‘probably’, but the spellchecker corrects it to ‘polly.’ This is because the corpus used to identify the most probable permutations of each word is mostly based on standard or formal English, not informal spellings commonly used on Twitter and other social media platforms. If we had a corpus of word frequencies used in informal digital settings, some of these slang words may have been handled better by our spellchecker.

Lack of context: Pyspellchecker looks at words in isolation, not in context. It therefore can’t account for how the most likely word might change depending on the words that surround it. In some contexts, words that are rare overall might be more probable. A more advanced spellchecker would use the surrounding words to modify the probabilities of each permutation of the word it is correcting.

You can view the Python notebook used to run the spellcheck code here.

Standardizing Locations with Mordecai

Now that spellchecking is done, we’ll standardize the locations represented in the tweet data using geoparsing, or extracting locations from text and matching them to a known place using an index of world locations. As with spellchecking, there are multiple Python geoparsing libraries to choose from. We’ll use a library called Mordecai because it is especially fast and handles non-US place names well compared to other options, such as fuzzy matching locations with the NLTK US-centric gazetteers corpus.

Given a text to geoparse, Mordecai first uses the spaCy NLP library to extract location entities. It then searches the GeoNames index, a huge global dataset of place names, for potential matches to the extracted entity. It decides which potential match is most likely correct using a neural net that was trained on labelled English texts. If one of the matches exceeds a probability cutoff (the default value is 0.6), the output will include the matched place name as well as information like the country where the location is and its latitude and longitude coordinates. For more information on how Mordecai works, see the full documentation and source code.

Setting up Mordecai can be difficult and time-consuming, but the result is fast and informative geoparsing. It has some very specific dependencies, so it’s strongly recommended that you run it in its own virtual environment. One of Mordecai’s dependencies (Tensorflow) doesn’t run on the newest version of Python, so I used a Python 3.7 virtual environment and recommend that you do the same.

Once you’ve created a virtual environment, follow the documentation to install Mordecai and other dependencies. You’ll also need to install Docker, which Mordecai uses to search the GeoNames index. Note that the ‘geonames_index.tar.gz’ downloaded in the third setup step takes a long time to load (about 30 minutes). Make sure that you don’t quit the command terminal before the ‘wget https://andrewhalterman.com/files/geonames_index.tar.gz — output-file=wget_log.txt’ command has finished executing.

Once the installation process is complete, setup a new Python notebook for geoparsing as follows:

If there’s an error importing Mordecai, it’s likely because you’re not running the notebook in the correct virtual environment. You can run ‘! which python’ in your notebook to check the virtual environment it’s using. If it’s not in the correct one, create a new notebook that is by generating it from the command line of the correct virtual environment.

First, import the preprocessed training data using the same method as the spellchecking example above.

We’ll test geoparsing with a small sample before running it on the entire dataset. The text of the second tweet mentions a location, so we can use that as our test string.

Next we’ll instantiate a geoparser, geoparse our test string, and print out the results.

We can see that geoparse() returns a dictionary of information for each of the extracted locations. For our test string, it extracted two locations — ‘La Ronge Sask.’ and ‘Canada’ — and therefore returned two dictionaries. The first location, ‘La Ronge Sask.’, was not matched to a place. The geoparser identified the most probable country for the location as ‘USA’, but its confidence level (given by ‘country_conf’) is less than the default 0.6 cutoff, so the match wasn’t completed. The second location, ‘Canada’, was obviously successfully matched to the country Canada, with an approximately 95% confidence level. Since the match was successful, the output includes a ‘geo’ dictionary of information about the matched place, including its coordinates and standardized place name.

Now that we understand what the geoparser output looks like, we can use it to extract location information from our dataset. We’ll first run it on the ‘location’ column, which contains the location part of the Twitter user’s profile. Note that many of these values are missing because it is common for users to not provide their location; it’s also possible for the value to be a non-location, like ‘Worldwide!!’. We’ll run our geoparser on each location entry and store all of the ‘geo’ dictionaries for successfully matched places. The result is a ‘geoparsed_location’ column that contains the raw output of the geoparser.

Next, we’ll repeat the process using the ‘text’ column, which contains the text of each tweet. If a user mentioned a location in their tweet, as is the case with our test string, the geoparser should identify it and match it to a place. Once again, the result is a ‘geoparsed_text’ column that contains the raw output of the geoparser.

Since geoparsing the ‘location’ and ‘text’ columns takes a while, I recommend storing your dataset as a preliminary .csv file at this point so that you don’t have to re-geoparse the data if you have to restart your notebook for some reason.

We’ll now extract relevant information from the geoparser outputs and store it in new columns. To do this, we’ll first check how many locations were identified for each ‘location’ and ‘text’ value.

Though there are some exceptions, most strings contain less than two geoparsed places. We’ll therefore store information for a maximum of two places geoparsed from the ‘location’ data and two places geoparsed from the ‘text’ data. I decided that the most important information is the place name, coordinates, and country, so I inserted new columns as shown below.

Now that we’ve decided what information to store, all we need to do is fill in the correct values from the geoparsed outputs we stored previously. We’ll do that for both the ‘location’ and ‘text’ data by iterating through a maximum of two geoparsed results for each value, accessing the ‘geo’ dictionary, and storing the result in the appropriate column.

Limitations

And with that, we have our geoparsed dataset! As with spellchecking, let’s briefly describe the limitations of this approach.

Lack of context: The geoparser considers each extracted location entity independently, so it may miss classifications that are apparent to humans from the broader context of the text. Our test example demonstrated this limitation. ‘La Ronge Sask. Canada’ apparently refers to a single place in Canada. But the geoparser examined ‘La Ronge Sask.’ and ‘Canada’ as two separate entities. It correctly identified Canada, but failed to use that information to identify La Ronge as a town in the Saskatchewan province. Instead, it guessed that ‘La Ronge Sask.’ was a place in the USA, but correctly assigned a low probability to that incorrect guess.

Relies on GeoNames index: The GeoNames index is remarkably comprehensive, but it doesn’t include all possible locations, such as those that are very small or remote. On the other hand, the detailed locations that are present in the GeoNames index sometimes result in matches that are too specific. For example, the geoparser matched ‘Northern California’ to ‘Golden Gate Baptist Theological Seminary Northern California Campus’ instead of ‘California’ in general.

You can view the Python notebook used to run the geoparsing code here.

Conclusion

Now that we’ve finished spellchecking and geochecking the tweet data, we’ll merge the two datasets we created into one.

complete = geoparsed.merge(spellchecked, left_on=[‘id’, ‘keyword’, ‘location’, ‘text’, ‘target’, ‘Unnamed: 0’], right_on=[‘id’, ‘keyword’, ‘location_data’, ‘text_data’, ‘target_data’, ‘Unnamed: 0’])

The resulting dataset can then be used to train a model to predict which tweets are about ‘real’ disasters. The next tutorial will explain how to do that.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store