Text Data Cleaning Methods for Language Processing

Image

Data is the new decision-making language, and when used effectively, it can totally transform how a business functions.

Data cleaning and preprocessing are the most crucial aspects of every project that drive performance. Text data is one type of data that is available and can be useful for gaining commercial value. The right data's availability and quality is one of the major hurdles for businesses when it comes to making excellent decisions. Alexa, Siri, and Google Assistant can comprehend and reply to you thanks to Natural Language Processing.

When we use bad data to make critical judgments, we are more likely to harm ourselves.

What's NLP:


NLP image
It’s a powerful technique that gives the ability to computers to understand text and spoken words & the contextual nuances of the language.

NLP is used to power computer programs that translate text from one language to another, respond to spoken commands, and quickly summarise vast amounts of material—even in real time. We may have already interacted with NLP in our daily lives, such as the GPS voice or smart assistants in phones or other gear.

Before getting into NLP modelling there is something important which is the processing of the data and it is the most important step for further analysis.

Some of the preprocessing steps are:

  • Tokenization
  • Removing punctuations like. ,! $ * () % @
  • Removing URLs
  • Removing Stop words
  • Stemming
  • Lemmatization

Tokenization

Tokenization is the process of tokenizing or splitting a string, or text into a list of tokens. It has a significant impact on the remainder of your pipeline. Unstructured data and natural language text are broken down into chunks of information that can be regarded as separate elements using a tokenizer.
Tokenization can be used to distinguish between phrases, words, letters, and subwords. Sentence tokenization is the process of dividing a text into sentences. We call it word tokenization when it comes to words.

Ex:

Punctuation Removal:

It's perfectly okay to remove punctuation marks as they do not add value to the information in the NLP. For example the word Oh and Oh! mean the same. Removal of punctuation is based on the use case. So we need to be extra careful on the selection of punctuation.
In Python, we have string.punctuation that contains these symbols! " # $ % & \ ' () * + , - . / : ; ? @ [ \ \ ] ^ _ { | } ~ `

Case of the text

The type of data case is another crucial cleaning process. In a continuous text we might have the flow of text either in lowercase , uppercase or first alphabet in Upper and remaining in the lower case. This just causes noise in the data .
For example: “Data”,”data”,” DATA ” all represent the same word “data” but when we are giving it as an input to model it accepts as 3 different words/vectors
To overcome this issue in general practice all the text is converted to lowercase.
In python we can use the string.lower()

Removal of URL

When the text is scraped from the website then we get some text with hyperlinks in between and some other reference links. These links are not useful and do not add value to the text data. URLs in the data do not carry information value and has no weight when compared to other text sentences.

Removal of Stop Words

Stop words are a set of commonly used words in a language. Some of the stop words in English include "a," "the," "is," "are," and others. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and do not add much information to the text. Like in any information system words which occur frequently have the least information value. We remove the low-level information from our text by deleting these terms, allowing us to focus more on the crucial information.
We may conclude that removing such phrases has no detrimental repercussions for the model we are training for our assignment.Removing stop words reduces the dataset size, because there are fewer tokens involved in the training, thus reducing the training time.

Removal of Stop Words

Documents will employ different forms of a word for grammatical reasons, such example ask, asking, and asked.
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is a heuristic method that removes derivational affixes off the ends of words in the desire to get it right most of the time. There are 3 most commonly used stemmers those are Lovins Stemmer, Porter Stemmer, and Paice Stemmer.  Though the algorithm looks simple there exists a problem with Stemming which is called Overstemming. Overstemming is the process of chopping off a significantly bigger portion of a word than is required, resulting in two or more words being mistakenly reduced to the same root word or stem when they should have been reduced to two or more stem words. For example, Universe & University is stemmed from “univers” which represents both words but in the real world they mean different.

Lemmatization

To overcome the issues of stemming, lemmatization is used. Lemmatization is a technique for converting any type of word to its base root mode. Lemmatization is the process of combining various inflected forms of words into a single root form with the same meaning.

Lemmatization is one of the most effective strategies to assist chatbots in better grasping your customers' questions. The chatbot can comprehend the contextual form of the words in the text and obtain a better comprehension of the overall meaning of the sentence that is being lemmatized because this entails morphological analysis of the words.

When it comes to stemming and lemmatization, what's the difference?

While stemming and lemmatization both aim to reduce each word's inflectional form to a single base or root, they are not the same thing.
Because the fundamental algorithms differ, the outcomes they create vary as well.
Stemming is the process of removing the end or beginning of a word while keeping in mind frequent prefixes and suffixes found in inflected words. Lemmatization is a morphological examination of a word that employs dictionaries to link it to its lemma. While converting into root form, lemmatization always returns the dictionary meaning of the term.
Lemmatization is more complicated than stemming. This is because the method necessitates the classification of words by part of speech and inflected form.
Stemming is a speedier process than lemmatization because it slices words without considering their context in the sentences they're in.

Written by:

Saitharun Sriram

Data Scientist

LinkedIn

Related Post

Leave a Reply