In machine learning, statistics is an essential thing that plays a significant role in almost all kinds of traditional machine learning algorithms and techniques. Mean, and median are one of the most important measures in statistics, which are used in case of outliers to check the data’s central tendency, analyze data distribution plots and make appropriate decisions according to the data.

Although we know that mean and median are both critical and used widely, in the case of outliers, the median is considered a better choice than the mean due to several statistical reasons and to enhance the performance of the model.

In this article, we will discuss the core idea behind mean and mediums, the cases where they are used, and why the median is a better choice than the mean in the case of outliers with code examples.

This article will help one understand the core intuition behind using mean and median and the statistical reason behind using median, which will also help to answer interview questions related to the same.

So before jumping directly to the answer to this question, let us discuss the basic idea behind mean and median and their use cases.

What is Mean?

Mean or average is the type of statistical measure, which is the ratio of the sum of all the values and the total number of observations.

The formula for the mean is,

Mean = Sum of All Observations / Total Number of Observations

Sometimes in machine learning and data science, the weighted mean can also be used where the weights of the observations are given to specify the importance of observations.

In such cases, the observation, which is very helpful or informative, the higher weightage is given to the same. Note that the sum of the weights of all the observations will always be 1.

Use Cases of Mean in Data Science

The mean is widely used in so many machine learning and data science techniques, but there are some essential techniques where it is used and considered as a significant property of the dataset.

1. Missing Value Imputation

In machine learning, the data may have some missing values, which are either not filled during acquiring the data or intentionally removed. In such cases, the missing values can be filled or imputed by the mean of all the observations.

2. Normalization

While training the machine learning models, the features of the dataset may not have the same scale, and hence the normalization techniques are used in order to change the scale of the features and bring them to the same scale. The mean is used in almost all the normalization techniques to scale the data properly.

3. Outliers Handling

Before training the machine learning model, the presence of outliers should be detected if present, and they should be handled properly to get a reliable and representative model. The mean is used in order to identify the outliers, and in fact, it is also used to replace the outlier values in some cases.

4. Loss Function

In machine learning, the loss function is used to get an idea about the performance of the model and to evaluate the same with the help of actual and predicted values. The mean is used in the loss function, which helps identify the performance and reliability of the model.

What is the Median?

In statistics, the median is the middle value of the dataset when the value of the dataset is sorted in either ascending or descending orders.

The median is calculated very easily by sorting the data values in any order first and then finding the middle values of the same.

In machine learning and data science, the median is widely used in many techniques and helps train and enhance the performance of the models. In fact, in some cases, it is preferred over the mean.

Use Cases of Median in Data Science

The median is also widely used in many machine earning and data science techniques, but there are some techniques where the median is extensively used, and it has very high significance for the same.

1. Data Distribution Studies

As the median is the middle value of the dataset, it is used to study the data distribution before training the models. In the case of skewed distribution, the median is used to study the data distribution and get an idea about the central tendency of the dataset.

2. Outliers Handling

The median is very helpful in handling and dealing with outliers. The value of the mean can skew and disturb the data distribution in such cases, whereas the median is more robust than the mean and hence preferred over it in case of outliers.

3. Classification Threshold

The median is also used as a classification threshold to classify the two classes of target variables correctly. The algorithms like logistic regression use the median to classify between two classes with the help of calculated probabilities.

Why is the Median Better Than the Mean in the Case of Outliers?

We know that the outliers are those data observation values that are very different than the other data values, either very high or very low. In case of an outlier, it is essential to detect and treat them appropriately. Once the outliers are detected, they are either removed or replaced with other values. If we have a sufficient amount of data, the outliers are removed chiefly, but in the case of limited data, the outliers are preferably replaced with other values.

Normally if we are choosing to replace the outliers with other values, then the mean and median are considered, as it is a straightforward approach, and fewer calculations are involved.

Now we know that mean is the measure calculated under all the values of the data observations meaning that it will also consider the values of outliers while calculations and due to which, the mean will be biased towards the outliers resulting in lower or higher than the actual value of the mean without outliers.

In short, if the mean is calculated with the presence of outliers, the value of the outliers will bias the value of the mean to either lower or higher side, depending upon the value of outliers.

In such cases, the mean can not be used as it is the biased value, and that value can not replace the outliers values as it will not be representative of the model.

Now if we consider the median instead of the mean, the value of the median will not be affected by the outliers as the median is simply the 50th percentile of the data observations. We know that the 50th percentile of the data observations can not be outliers, and hence the value of the median will be independent of the outliers.

In such cases, the median is preferred over the mean as its value is not affected by the outliers and can be a good choice for training a representative and reliable model.

Now let us try to understand the same with the help of code examples.

Mean Vs. Median With Outliers: Code Example

Let us try to discuss the same concept with the help of code examples. Here we will use a dummy dataset to understand the same step-by-step.

Step 1: Install Required Libraries.

Let us install teh required libraries that will be used to generate the dataset, get the mean and median values, and plot the graphs.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

Step 2: Generate the Dataset

Here, we will generate a dummy dataset with 500 data points and 20 outliers.

# Generate a dataset with 500 data points and 20 outliers with increased values

np.random.seed(42)

data = np.concatenate((np.random.normal(loc=50, scale=10, size=480), np.random.normal(loc=200, scale=50, size=20)))

# Add outliers with significantly increased values

outliers_indices = np.random.choice(range(len(data)), size=20, replace=False)

data[outliers_indices] = np.random.uniform(low=5000, high=10000, size=20)

# Create a DataFrame from the data

df = pd.dataframe(data, columns=['Values'])

Step 3: Calculate the Mean and Median

Let us calculate the mean and median of the dataset in order to replace the same with the value of the outlier later.

# Calculate mean with outliers

mean_with_outliers = np.mean(data)

# Calculate median

median = np.median(data)

Step 4: Detect the Outliers

Now let us use the z-score method to identify the outliers from the dataset.

# Identify outliers using z-score

z_scores = (data - mean_with_outliers) / np.std(data)

outliers = np.where(np.abs(z_scores) > 3)[0]

Step 5: Replace the Outliers with Mean and Median

Now let us replace the detected outliers values with the mean and median of the dataset.

# Replace outliers with mean and median

df_mean_replaced = df['Values'].copy()

df_mean_replaced[outliers] = mean_with_outliers

df_median_replaced = df['Values'].copy()

df_median_replaced[outliers] = median

Step 6: Print the Results

Now let us print the mean and median values before and after replacing the outliers with the same.

# Print the mean and median values before and after replacing outliers

print("Mean (before replacing outliers):", mean_with_outliers)

print("Median (before replacing outliers):", median)

print("Mean (after replacing outliers):", df_mean_replaced.mean())

print("Median (after replacing outliers):", df_median_replaced.median())

Step 7: Plot the mean and median Value Before and After Replacements

Now let us plot the values of the mean and median before and after replacing the same with the outliers values to get a better idea about the better fit for placement.

# Plot the mean and median values before and after the outlier replacement

plt.figure(figsize=(8, 6))

plt.plot(range(len(outliers)), [mean_before] * len(outliers), 'b--', label='Mean Before')

plt.plot(range(len(outliers)), [median_before] * len(outliers), 'g--', label='Median Before')

plt.plot(range(len(outliers)), df_mean_replaced[outliers], 'r-', label='Mean After')

plt.plot(range(len(outliers)), df_median_replaced[outliers], 'm-', label='Median After')

plt.xlabel('Outlier Index')

plt.ylabel('Value')

plt.title('Mean and Median Before/After Outlier Replacement')

plt.legend()

plt.grid(True)

plt.show()

Output

Step 8: Plot the Mean Replaced Graph

Now let us plot the final graph of original and mean replaced values.

# Plot the mean-replaced values

plt.figure(figsize=(10, 6))

plt.scatter(range(len(df)), df['Values'], color='lightblue', label='Original', alpha=0.7)

plt.scatter(outliers, df['Values'].iloc[outliers], color='red', label='Outliers')

plt.scatter(range(len(df)), df_mean_replaced, color='orange', label='Mean Replaced')

plt.axhline(mean_with_outliers, color='red', linestyle='--', linewidth=1.5, label='Mean (with Outliers)')

plt.xlabel('Index')

plt.ylabel('Values')

plt.title('Mean Replacement of Outliers')

plt.legend()

plt.grid(True)

plt.show()

Output

In the above graph, we can clearly see that the replacement values of the outliers have very high values than the original data points, as the mean was biased because of high-value outliers.

Step 9: Plot the Median Replaced Graph

Now, let us plot the graph of original and median replaced values.

# Plot the median-replaced values

plt.figure(figsize=(10, 6))

plt.scatter(range(len(df)), df['Values'], color='lightblue', label='Original', alpha=0.7)

plt.scatter(outliers, df['Values'].iloc[outliers], color='red', label='Outliers')

plt.scatter(range(len(df)), df_median_replaced, color='lightgreen', label='Median Replaced')

plt.axhline(median, color='green', linestyle='--', linewidth=1.5, label='Median')

plt.xlabel('Index')

plt.ylabel('Values')

plt.title('Median Replacement of Outliers')

plt.legend()

plt.grid(True)

plt.show()

Output

As we can see in the above plot, the median replaced outliers values has almost the same values as the other original data points, as the median is not affected by the presence of outliers.

Now let us discuss the key points to remember from the article.

Key Takeaways

1. Mean and median is statistical measures that are used to get an idea about the central tendency of the dataset.

2. Mean is the sum of all the observations divided by the total number of observations.

3. The median is the middle value of the dataset, simply the 50th percentile of the dataset.

4. When using the mean in the case of outliers, the value of the mean will be higher or lower than the actual value of the mean, as the mean will consider the value of outliers during calculations.

5. The median will not be affected by the value of the outliers as the median is the 50th percentile of the dataset which can not be outliers in any case.

Conclusion

In this article, we discuss the core idea behind mean and median, how theory is calculated, their use cases in data science, and the reason behind using Medina in case of outliers over mean.

To conclude, the measure or the values that are independent of outliers can be used to replace the outliers values in the dataset, and in such cases, the median best fits the situations and is used for the same.

This article will help one to understand the core intuition behind the mean and median, their use cases, and the statistical reason behind using median over mean in the case of outliers. It will also help one to answer interview questions related to the same and implements the same while dealing with the outliers.

What is Mean?

Use Cases of Mean in Data Science

1. Missing Value Imputation

2. Normalization

3. Outliers Handling

4. Loss Function

What is the Median?

Use Cases of Median in Data Science

1. Data Distribution Studies

2. Outliers Handling

3. Classification Threshold

Why is the Median Better Than the Mean in the Case of Outliers?

Mean Vs. Median With Outliers: Code Example

Step 1: Install Required Libraries.

Step 2: Generate the Dataset

Step 3: Calculate the Mean and Median

Step 4: Detect the Outliers

Step 5: Replace the Outliers with Mean and Median

Step 6: Print the Results

Step 7: Plot the mean and median Value Before and After Replacements

Output

Step 8: Plot the Mean Replaced Graph

Output

Step 9: Plot the Median Replaced Graph

Output

Key Takeaways

Conclusion

Parth Shukla

Next Post

Leave a Reply Cancel reply

Parth Shukla

Why is Median Better than Mean in Case of Outliers?

What is Mean?

Use Cases of Mean in Data Science

1. Missing Value Imputation

2. Normalization

3. Outliers Handling

4. Loss Function

What is the Median?

Use Cases of Median in Data Science

1. Data Distribution Studies

2. Outliers Handling

3. Classification Threshold

Why is the Median Better Than the Mean in the Case of Outliers?

Mean Vs. Median With Outliers: Code Example

Step 1: Install Required Libraries.

Step 2: Generate the Dataset

Step 3: Calculate the Mean and Median

Step 4: Detect the Outliers

Step 5: Replace the Outliers with Mean and Median

Step 6: Print the Results

Step 7: Plot the mean and median Value Before and After Replacements

Output

Step 8: Plot the Mean Replaced Graph

Output

Step 9: Plot the Median Replaced Graph

Output

Key Takeaways

Conclusion

Parth Shukla

Next Post

Leave a Reply Cancel reply

Parth Shukla