Outlier Detection Methods

June 15, 2023

By Admin


Outlier Detection Methods

Outliers are something that comes very naturally to the data They can have hidden patterns/meanings, which, when revealed, can improve the model performance because unnecessary/erroneous data points are removed from the analysis.

Outlier detection is usually performed in the Exploratory Data Analysis stage of the Data Science Project Management process, and our decision to deal with them decides how well or bad the model performs for the business problem at hand. The model, and hence, the entire workflow, is greatly affected by the presence of outliers.

Outlier-Detection-Methods

OUTLIER DETECTION METHODS

1. Statistical Methods :

Simply starting with visual analysis of the Univariate data by using Box plots, Scatter plots, Whisker plots, etc., can help in finding the extreme values in the data. Assuming a normal distribution, calculate the z-score, which means the standard deviation (σ) times the data point is from the sample’s mean. Because we know from the Empirical Rule, which says that 68% of the data falls within one standard deviation, 95% percent within two standard deviations, and 99.7% within three standard deviations from the mean, we can identify data points that are more than three times the standard deviation, as outliers.

Another way would be to use InterQuartile Range (IQR) as a criterion and treating outliers outside the range of 1.5 times from the first or the third quartile.

2. Proximity Methods

Proximity-based methods deploy clustering techniques to identify the clusters in the data and find out the centroid of each cluster. They assume that an object is an outlier if the nearest neighbors of the object are far away in feature space; that is, the proximity of the object to its neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set. The usual approach is as follows - Fix a threshold and evaluate the distance of each data point from the cluster centroid and then remove the outlier data points and go ahead with the modeling.

3. Projection Methods

Projection methods utilize techniques such as the PCA to model the data into a lower-dimensional subspace using linear correlations. Post that, the distance of each data point to a plane that fits the sub-space is calculated. This distance can be used then to find the outliers. Projection methods are simple and easy to apply and can highlight irrelevant values.

Outlier-Detection-Methods

CHALLENGES OF OUTLIER DETECTION

1. Effective Identification:

Outlier definition is a highly subjective task and depends on the domain and the application scenario. The grey area between normal observations and outliers often is very small, and even a little ignorance can lead to the treatment of a possible outlier as a normal observation or vice-versa. Hence, we must be very cautious while selecting the outlier detection method to treat the outliers.

2. Application-Specific Challenges:

As stated earlier, choosing the similarity or distance measure and the relationship model to describe data objects is of utmost importance in outlier detection. Unfortunately, they are often application-dependent. Different applications may have very different requirements; for example, datasets from the medical field may have outliers that are even slightly deviating from the rest of the dataset. Hence individual outlier detection methods that are dedicated to specific applications must be developed.

3. Handling Noise:

Noise in the data tends to be similar to the actual outliers and hence is difficult to distinguish and remove them from malicious outliers. We must understand that outliers and noise are two different entities and are different from each other. And because the noise, often invariably, can be present in all kinds of data collected, it can bring a lot of challenges to outlier detection by blurring the difference between normal observations and outliers. Noise hides outlier objects, thus dropping the effectiveness of the outlier detection algorithm.

Interview Questions :

1. What is Outlier Detection?

2. What are the outlier detection methods?

3. What are the challenges of outlier detection ?