Dealing with outliers and missing values in a dataset
In a random sampling from a population, an outlier is defined as an observation that deviates abnormally from the standard data. In simple words, an outlier is used to define those data values which are far away from the general values in a dataset. An outlier can be broken down into out-of-line data.
For example, let us consider a row of data [10,15,22,330,30,45,60]. In this dataset, we can easily conclude that 330 is way off from the rest of the values in the dataset, thus 330 is an outlier. It was easy to figure out the outlier in such a small dataset, but when the dataset is huge, we need various methods to determine whether a certain value is an outlier or necessary information.
There are three types of outliers
- Global Outliers
- Collective Outliers
- Contextual Outliers
The data point or points whose values are far outside everything else in the dataset are global outliers. Suppose we look at a taxi service company’s number of rides every day. The rides suddenly dropped to zero due to the pandemic-induced lockdown. This sudden decrease in the number is a global outlier for the taxi company.
Contextual outliers are those values of data points that deviate quite a lot from the rest of the data points that are in the same context, however, in a different context, it may not be an outlier at all. For example, a sudden surge in orders for an e-commerce site at night can be a contextual outlier.
Some data points collectively as a whole deviates from the dataset. These data points individually may not be a global or contextual outlier, but they behave as outliers when aggregated together. For example, closing all shops in a neighborhood is a collective outlier as individual shops keep on opening and closing, but all shops together never close down; hence, this scenario will be considered a collective outlier.
Outliers can lead to vague or misleading predictions while using machine learning models. Specific models like linear regression, logistic regression, and support vector machines are susceptible to outliers. Outliers decrease the mathematical power of these models, and thus the output of the models becomes unreliable. However, outliers are highly subjective to the dataset. Some outliers may portray extreme changes in the data as well.
- Box plots are a simple way to visualize data through quantiles and detect outliers. IQR(Interquartile Range) is the basic mathematics behind boxplots. The top and bottom whiskers can be understood as the boundaries of data, and any data lying outside it will be an outlier.
- Scatter Plots are also a simple yet intuitive way to visualize outliers. Scatter plots are used to show the relationship between two variables in graphical form. Scatter plots usually have a pattern, and those data points not following the pattern are considered outliers.
import matplotlib.pyplot as plt
plt.title("Box Plot to detect Outliers")
Removing and modifying the outliers using statistical detection techniques is a widely followed method. Some of these techniques are:
- Density-based spatial clustering
- Regression Analysis
- Proximity-based clustering
- IQR Scores
In a dataset, we often see the presence of empty cells, rows, and columns, also referred to as Missing values. They make the dataset inconsistent and unable to work on. Many machine learning algorithms return an error if parsed with a dataset containing null values. Detecting and treating missing values is essential while analyzing and formulating data for any purpose.
There are several ways to detect missing values in Python. isnull() function is widely used for the same purpose.
dataframe.isnull().values.any() allows us to find whether we have any null values in the dataframe.
dataframe.isnull().sum() this function displays the total number of null values in each column.
What should be done with the outlier and missing values then? Drop them? Replace them with another value? This entirely depends on the kind of data we are dealing with. There are various methods to treat null values and outliers, and after a deep analysis of the data present, one can estimate the way to be used for treating the null values.
Deleting missing values
This is not a good practice as it can lose valuable insights from the data. A good chunk of data can be lost if one drops the rows containing null values.
Replacing or predicting the data in the stead of the missing values or outliers present is referred to as imputing the data. Various imputation methods are followed to treat the outliers and missing values. Some of these are:
- Filling in zero : The easiest way to treat null values is to fill the missing values as zero or replace the outliers with a zero. It would not be the best method.
- Filling in with a number : One can fill all the null values with a single number by using .fillna() function. For example, if we want to replace every null value with 125.
- Filling with median value : Simply replace all null values with the median value of the column. One major disadvantage of this method is it does not account for the covariance between the data features i.e it considers each entry independently and can cause data leakage.
- Filling with the mean : Replacing the missing value or the outlier with the net mean of the data or a moving average of previous n-data cells is also a widely followed method and is helpful in time series data.
- Interpolation : Using the .interpolate function is a powerful way to fill missing values. This technique takes the above and below values of a missing value and calculates the mean to fill in place of the missing value.
- Using some other algorithms for predicting the values (linear regression, KNN, Naive Bayes, Random Forest) : Using some machine learning based algorithms like K- Nearest Neighbours, Naive Bayes, Decision Trees and Random Forests can help in predicting values without compromising with the data dependence on the other categorical variables. Some of these can be trivial to implement in Python but they are the best methods to replace and predict the missing values or outliers in the dataset.
#Replacing using median
median = df['column_name'].median()
Treating missing values and outliers is an essential step of data cleansing and preparation and should be one of the first operations done on a dataset. Only after these steps are administered is the data considered usable to build models and take insights from. Only after the data is treated of the outliers and the missing values and other deformities,it is taken to the next step of the data analysis cycle i.e data pre-processing.
Your Trusted Partner for Data Visualization
We specialize in Power BI, Tableau, AWS Quicksight, Looker and Google Data Studio Implementation
TABLEAU vs POWERBI - WHICH ONE TO CHOOSE?
June 02, 2021
Power BI Custom Visualization - Polar Scatter Plot
June 28, 2021
Predictive Customer Lifetime Value
March 28, 2018