Plugging Holes in Kaggle’s Titanic Dataset: An Introduction to Combining Datasets with FuzzyWuzzy and Pandas
Kaggle’s “Titanic: Machine Learning from Disaster” competition is one of the first projects many aspiring data scientists tackle. Before you can start fitting regressions or attempting anything fancier, however, you need to clean the data and make sure your model can process it. A key part of this process is resolving missing data.
In this tutorial, you will learn how to fill in missing age information in Kaggle’s Titanic dataset by combining it with another dataset that contains most of the missing ages. We’ll cover basic approaches to handling missing data, locating and verifying the accuracy of secondary sources of information, using fuzzy string matching to combine the datasets, and checking the accuracy of the final combined data.
Handling Missing Data: The Basics
In addition to formatting and encoding values, a key challenge in data cleaning is handling missing data. Most datasets are missing at least some values, and the Kaggle dataset is no exception. Here’s a heatmap of missing data in the combined train and test Titanic datasets Kaggle provides for the competition (white lines are missing data):
Thankfully, many columns seem to be completely filled out! But the Age and Cabin columns are missing significant amounts of information, and it looks like about one passenger is missing data in the Embarked column (the missing Survived data is intentional, because those rows are in the test dataset).
If you want to use any of these columns to train a model or analyze the data, you’ll have to figure out what to do about the missing data. One simple solution is to ignore the column with missing data. This is what we will do for the Cabin column. Most of the Cabin values — approximately 77% — are missing, so the column isn’t providing much information at the moment. Besides, many simple models for the Titanic dataset focus on the Pclass, Age, and Sex columns, which are easy to work with and allow us to answer common questions like “Were women and children more likely to survive?” and “Were rich passengers more likely to survive?”
With the Cabin column gone, we only need to decide what to do about the missing ages and the few missing embarked values. The Age and Embarked columns are mostly complete, so we don’t want to drop them entirely. A second basic approach to missing data is dropping all of the rows that contain a missing value. That can work well with big datasets, but the Titanic dataset is relatively small and we would lose over 260 valuable rows if we dropped those that didn’t have Ages and Embarked information recorded.
A third and more common approach is to infer the missing values using what we already know. This could be as simple as calculating the mean or mode of the column and using it to fill in the missing values. We could replace all the missing ages with 30 — the average age — and replace the missing embarked with S for Southampton, the most common point of embarkation. We’ll use this simple approach for the Embarked column which, even more so than the Cabin column, might not have much of an impact on our analysis anyways. I, at least, am not immediately convinced that the place where a passenger boarded the Titanic had a significant impact on their chances of surviving in the middle of the ocean.
But mean-filling the Age column changes the underlying distribution for the values and might make our analysis less accurate, as the histograms below show:
A more sophisticated approach might fill each missing age value with a random sample from the original ages so that the underlying distribution is maintained. You could also develop a machine learning model to predict the missing ages based on other characteristics like sex and class.
Filling in Missing Data with Another Dataset
These approaches can work well for predictions, but the best approach to handling missing data is simply to fill in the values with the correct information. Finding other sources of information about the Titanic passengers is easy given how infamous the disaster is. A quick google search yields this website, which appears to list ages for many of the passengers that are missing that information in Kaggle’s dataset. As an added bonus, it also lists an occupation for many passengers, which isn’t our main focus now but might be interesting to analyze later.
Retrieving the contents of this website and storing them in a csv file is simple if you understand the basics of web scraping. This is the code I used to scrape the website, and the resulting csv file is here. Web scraping isn’t the focus of this article, but if you want to learn more I recommend this tutorial.
Ok, so we now have a dataset from Kaggle that is missing some age information, and a dataset from another website that hopefully contains accurate values for most of that missing age information. All we have to do is merge them together and then finally get started with making predictions based on the data. Easy, right?
The first thing we need to reexamine is where both of these datasets come from. After all, data and predictions based on it are only useful if they accurately represent a given scenario — the sinking of the Titanic, in this case. Knowing where a dataset came from allows us to critically evaluate how much we can trust it. It can also enable us to extract more information from the dataset — if we know more about the context in which the data was collected, we might find that data values code for more than we originally expected.
Kaggle doesn’t list any source for the data they provide. When I asked about the source of the data on the competition discussion forum, I got one response that pointed towards a few other publicly available Titanic datasets. This one, from OpenML, closely resembles the Kaggle dataset — the columns are mostly the same and they contain the same number of passengers. The OpenML dataset lists its source as a dataset hosted on the website of a Vanderbilt University professor. That website notes that “the principal source for data about Titanic passengers is the Encyclopedia Titanica” and adds that “one of the original sources” is a 1994 book that contains “a passenger list created by many researchers and edited by Michael A. Findlay.”
Ok, so if Encyclopedia Titanica and the research it draws on is the source for the OpenML dataset, and the OpenML dataset and the Kaggle dataset share the same source, then what is Encyclopedia Titanica? According to the website, it’s a crowdsourced attempt to “tell the story of each and every person that travelled on the Titanic.” As mentioned in the paragraph above, the 1994 passenger list edited by Michael Findlay is the starting point for its passenger biographies, but contributors have pulled in information from “newspaper articles, books, broadcasts, and perhaps most importantly of all, the recollections of survivors and the families of victims.” As Encyclopedia Titanica acknowledges on its disclaimer page, the resulting dataset likely contains some inaccuracies. But it appears to be the most complete, accurate, and accessible dataset available online, and it’s probably the best option for anyone who wants to study the Titanic disaster but doesn’t have the time to conduct original research.
Encyclopedia Titanica — or at least a version of it archived in 2013, which we’re using because it’s publicly available (the current version is behind a paywall) — is also the source of the second dataset I created via web scraping. If you want to find more information about a specific passenger, including sources of biographical data about them, you can click on their name. But why do the Kaggle dataset and the scraped dataset contain different information if they both supposedly came from the same source? My guess would be that the Kaggle dataset is more closely based on the original 1994 passenger list we keep seeing references to, while the page I scraped contains more information filled in from other sources. Without better documentation from Kaggle, however, it’s impossible to know for sure.
What’s important, however, is that both datasets seem to come from the most reputable source of Titanic information available online. I’m inclined to trust the scraped dataset more because it contains more sources, citations, and information, allowing us to better understand the data even if it contains discrepancies. Given that we’re working on a Kaggle challenge, however, we’ll use that dataset as the standard of accuracy and just fill in as many holes as possible with the scraped dataset.
Combining the Datasets: Fuzzy Matching and Merging
Ok, so we’ve established that both our datasets are mostly trustworthy, or at least the most trustworthy ones available to us. Now we have to do the actual work of combining them. I’ll only explain and show the key parts of the code here, so if you want to see the full details you can look at this Jupyter notebook.
The first step is to clean the name columns in both datasets so that they are formatted similarly. I won’t go into details about the data cleaning process here, but we’ll clean out punctuation and make all the letters lowercase to start off with. The Kaggle dataset also identifies married women by giving their husband’s name and then their own name, whereas the scraped dataset gives just the woman’s name. We can address that by removing the woman’s husband’s name from her cleaned name in the Kaggle dataset.
The scraped dataset also contains crew members, not just passengers, so we’ll remove them to match the Kaggle dataset. The final scraped dataset contains 1352 rows and the Kaggle dataset contains 1309. It isn’t immediately clear why they’re different, but after exploring the Encyclopedia Titanica site some more it seems likely that the scraped dataset lists the servants who accompanied passengers, whereas the Kaggle dataset only lists passengers. There are 43 servants listed on the Encyclopedia Titanica website, which is the exact difference between the Kaggle and scraped datasets.
It’s worth questioning why the Kaggle dataset apparently excludes both crew members and servants — while they didn’t buy tickets, they were no less impacted by the disaster than the paying passengers. To me, it’s a strange omission that isn’t clearly justified in Kaggle’s description of the challenge. Kaggle mentions that there were over 2,000 people on board including the crew and poses the question ““what sorts of people were more likely to survive?” but only suggests answering it using passenger data. I personally think that Kaggle could make this challenge more interesting and relevant to actually understanding the Titanic disaster by including the servant and crew data. But Kaggle makes the rules here, so we’ll keep going with just the passenger data.
Once the data is cleaned and formatted, we’ll merge the datasets together by matching the names of passengers, the only unique identifier available in both sources (ticket numbers seem promising, but they aren’t completely unique and aren’t available for all passengers). If we were matching the datasets using numerical identifiers or perfectly standardized and formatted names, this would be as easy as running the Pandas merge function once. Let’s try that and see what happens:
We matched over 800 names using the simple Pandas merge, which isn’t bad! But it missed almost 500 passengers that we still need combined data for. Unfortunately, this is inevitable because we are working with messy variations of names from multiple cultures and naming conventions that were recorded over a century ago. Many other names that occur in both datasets contain typos or other variations that prevent them from being easily matched by a computer, even if a human could quickly tell that they are the same.
This is where fuzzy matching comes in — it’s a method of matching strings based on how similar they are, not whether they match exactly. For this tutorial, we’ll rely on FuzzyWuzzy, a Python library for calculating how similar strings are given different metrics. For each unmatched passenger in the Kaggle dataset, we’ll use the FuzzyWuzzy extractBests function to pull out the two names in the scraped dataset that are closest to it. We specify that all potential matches must be above a similarity threshold to prevent it from calculating any extremely inaccurate matches. We also specify a scoring function, which tells FuzzyWuzzy how to calculate how similar the strings are.
After testing out a couple different options, I decided to write my own scoring function for the Titanic dataset. It compares each word in the shorter string to each word in the longer string, finds the best match by calculating the Levenshtein edit distance ratio, and then averages these matches for each word in the shorter string. This scoring function is designed to effectively match names when one option contains a few extra words — like a middle name or a wife’s maiden name — but the words otherwise mostly match, though spelling variations and typos are still common.
Once we’ve specified all the parameters for matching names, we find a maximum of two possible matches for each name in the Kaggle dataset (you could specify a different number of possible matches, but I chose two because it keeps things from getting two complicated while also allowing for the possibility that in some cases the match with the highest fuzzy matching score might not be correct). The result is a list of dictionaries for each passenger in the Kaggle dataset. Each dictionary contains the index and name of the passenger in Kaggle and the names, indices, and similarity scores of its possible matches. It’s important to store the index, not just the name, for each row because this is how we will relocate them in the dataset and merge them together.
Once we have the list of possible matches, we’ll automatically match all of the rows that have a similarity score of 100 — this means that each word of the shorter name has a perfect match in the longer string. To do this, we simply loop through each possible match dictionary and match those that have a score of 100. We’ll deal with less precise matches in another for loop to ensure that we don’t accidentally make an imperfect match between two names when one perfectly matches another.
After making all of the perfect matches, 1052 names in the Kaggle dataset have been matched and 261 still need to be matched. Not bad! Next, we’ll handle less precise matches — anything with a similarity score less than 100 — using a similar for loop. For most rows, we’ll automatically match the Kaggle name with the scraped name that has the highest similarity score to it.
But what about cases where both potential matches have similar scores or the same score? You could devise a tie-breaking function or some other method to handle this computationally, but it might be challenging to create an algorithm that accurately matches all of the many variations of names contained in these datasets. And, as evidenced by the fact that we are now over 2,000 words into an explanation of a solution to a problem most people resolve with one line of code, I value accuracy over pure efficiency. So I decided to manually resolve very similar matches by looking at them myself and telling Python which (if any) of the two options is the best match for a given name. Fortunately, there aren’t too many very similar matches in this dataset, so it isn’t very hard to match them all. If it was more difficult, then writing a function to handle these situations would be more appealing.
After manually going through close matches and automatically matching the rest, we’re left with 17 unmatched passengers from the Kaggle dataset and 60 potential matches for them in the scraped dataset. I matched these manually by searching through the entire unmatched scraped dataset (as opposed to the last section, where I only had to manually choose between two possible options). In some cases, the names were too dissimilar for the fuzzy matching function to detect that they match. In others, the names could only be matched using other information in the dataset, like ticket numbers and points of embarkation.
Once the tedious manual matching is completed, we’ve successfully compiled a dataset of 1313 rows that contain data from both the Kaggle and scraped datasets. You might remember, however, that the original Kaggle dataset only had 1309 rows. How did we somehow add 4 new rows? We can confirm that both the Kaggle and combined datasets contain the same number of unique names — 1307. Looking at the value counts for the names shows us that there are two names in the Kaggle dataset that are exactly the same for two passengers, and each of these passengers was matched to both of the passengers that shared their name in the scraped dataset instead of only their own name. To resolve the issue, we can simply drop the indices that contain mismatched passengers with the same name.
Now that we have our final list of passengers, the last step is to get the information we combined datasets for in the first place — passenger ages. We’ll use a simple function to fill out cleaned ages for each passenger, which will keep their Kaggle age if it exists or fill in the scraped age if it doesn’t. There are unfortunately still five passengers that don’t have ages listed in either dataset, which we could try to fill in by doing more research or using one of the missing data handling methods described above. But we’ve successfully filled in over 250 ages that were previously missing. Hooray!
Final Accuracy Checks and Conclusion
But wait! Before you go riding off into the machine learning sunset with your shiny new filled-in dataset, let’s do a few more accuracy checks. Let’s first look for any discrepancies between passengers that had ages listed in both the Kaggle and scraped datasets. There are 250 passengers that had ages listed in both datasets that weren’t the same. We can check the distribution of the discrepancies to see how large the differences are:
Most of the differences between the Kaggle and scraped dataset ages are close to 0. These small age differences shouldn’t significantly impact the results of any model trained on this data, so this is a reassuring finding. Some of those discrepancies, however, are concerningly large. The biggest difference is for Mrs Julia Florence Siegel Cavendish, whose age is recorded as 76 in the Kaggle dataset and 25 in the scraped dataset. It might be easy to mistake a 25 year old for a 20 year old, or even a 40 year old for a 30 year old, but how would someone confuse a septuagenarian for a 25 year old, or vice versa?
The Encyclopedia Titanica website appears to provide the answer. In its biography of Cavendish, it explains that she was 25 years old while on board the Titanic but survived and later died in 1963, at the age of 76. Whoever compiled the Kaggle dataset seems to have mistaken her age at death for the age at which she was on board the Titanic, but survived the disaster. The same error was made for at least one other passenger — Mr. Algernon Henry Barkworth, who according to Encyclopedia Titanica was 47 when he survived the Titanic disaster and later died at the age of 80, which is the age Kaggle recorded for him.
Barkworth and Cavendish had the largest gaps between the age Kaggle listed for them and the one in the scraped dataset. Without further research, it’s hard to tell whether these two were exceptions or if Kaggle made the same mistake, just with less obvious consequences, for other passengers as well. There are also some cases where Kaggle recorded a much lower age for the passenger than the age listed in the scraped dataset, though there is no clear explanation for these discrepancies.
While we aren’t going to conduct a full historical audit of Kaggle’s dataset right now, it’s worth noting that mistaking someone’s age at death for the age at which they were on board the Titanic could lead us to dramatically incorrect conclusions. Kaggle’s dataset suggests that the oldest passengers were more likely to survive, when it appears to actually be the case that passengers who survived were more likely to live to a very old age that was then incorrectly recorded as their age when they boarded the Titanic.
If we really want to understand the Titanic disaster and answer Kaggle’s “what sorts of people were more likely to survive?” question, we might be better off ignoring the data provided by Kaggle all together and just using the seemingly more correct, better cited data available on Encyclopedia Titanica. Understanding how to handle data with Pandas and train machine learning models on it is important, but the ultimate goal of data analysis is to create valid insights into the real world — which this data can’t completely do.