Gaussian Distribution, Normal Distribution, and Standard Deviation
The image above is a typical normal distribution graph.
Let's examine the essence without complex formulas.
Generally speaking, data around us tends to cluster near the mean in this shape.
For example, if the average height of Korean men is 173cm, there are more people at 170cm than at 160cm.
Looking at the graph above, you can see that the closer to the mean, the higher the curve, and the farther away, the lower it drops.
The degree to which data is distributed around the mean is expressed by the standard deviation.
When the standard deviation is small, the data is tightly packed around the mean.
When the standard deviation is large, the data is spread far from the mean.
Understanding Through an Example
Looking at the two graphs below,
Graph 1 has a smaller standard deviation than Graph 2.
This means Graph 1's data is more concentrated around the mean.
In machine learning, understanding the characteristics of your data is critically important.
So whenever you get a new dataset, the first thing you should do is check the mean, standard deviation, min, and max to develop an intuition for the data.
Standard Deviation Formula
The standard deviation formula is as follows:
First, calculate the mean (average) of all data points.
Then compute the difference between each data point and the mean.
Square each difference.
Calculate the average of the squared differences (this is called the variance).
Take the square root of the variance. This value is the standard deviation.
Once you understand the concept, the formula is straightforward.
In Python
Standard deviation can be easily calculated using NumPy:
import numpy as np
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
mean = np.mean(data)
std = np.std(data)
var = np.var(data)
print(f"Mean: {mean}")
print(f"Standard Deviation: {std}")
print(f"Variance: {var}")
With Pandas
You can also compute it using Pandas:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
print(df.describe())
That wraps up the explanation of Gaussian distribution and standard deviation.
The concepts of mean and standard deviation are foundational to machine learning and statistics.
I hope this article helped you build a solid understanding of these important concepts.