An important step in preprocessing text data is making sure that strings with essentially the same meaning but different forms are standardized so that they can be used in analysis. A simple form of this is accomplished by lemmatization, which replaces words with their root forms. This tutorial will explain how to standardize spelling and location names in the dataset for Kaggle’s “Real or Not? NLP with Disaster Tweets” competition.
There are multiple algorithms and Python libraries for standardizing spelling. This section will explain how to use the Pyspellchecker library.
Kaggle’s “Titanic: Machine Learning from Disaster” competition is one of the first projects many aspiring data scientists tackle. Before you can start fitting regressions or attempting anything fancier, however, you need to clean the data and make sure your model can process it. A key part of this process is resolving missing data.
In this tutorial, you will learn how to fill in missing age information in Kaggle’s Titanic dataset by combining it with another dataset that contains most of the missing ages. …