An important step in preprocessing text data is making sure that strings with essentially the same meaning but different forms are standardized so that they can be used in analysis. A simple form of this is accomplished by lemmatization, which replaces words with their root forms. This tutorial will explain how to standardize spelling and location names in the dataset for Kaggle’s “Real or Not? NLP with Disaster Tweets” competition.

Standardizing Spelling with Pyspellchecker

There are multiple algorithms and Python libraries for standardizing spelling. This section will explain how to use the Pyspellchecker library.

Pyspellchecker works by finding all permutations of characters in a…

Kaggle’s “Titanic: Machine Learning from Disaster” competition is one of the first projects many aspiring data scientists tackle. Before you can start fitting regressions or attempting anything fancier, however, you need to clean the data and make sure your model can process it. A key part of this process is resolving missing data.

In this tutorial, you will learn how to fill in missing age information in Kaggle’s Titanic dataset by combining it with another dataset that contains most of the missing ages. …

Emma Stiefel

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store