A Normal or Gaussian distribution is used to represent continuous random variables. BMI of people, height of people amongst other phenomena tend to follow a Normal distribution. It is generally used to describe a lot of natural phenomena around us. A normal distribution generally follows a bell curve. Let's see this in action.
A normal distribution is defined by 2 parameters viz. Mean and Standard Deviation. This is how you can define this distribution using the Stats functionality from Scipy.
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm
import scipy
x= np.linspace(0,700,1000000)##Create evenly spaced numbers from 0 to 400
r1 = norm.rvs(loc=350,scale=50,size=1000000) ###Create samples with mean=350 and stdev=50
Notice the rvs attribute of norm. We will talk about it in a while. Let's see how the plot for this distribution looks like.
fig, ax = plt.subplots(1, 1)
ax.set_xlim([0, 700])
ax.hist(r1, density=True, bins='auto', histtype='stepfilled', alpha=0.2)
ax.plot(x, norm.pdf(x,loc=350,scale=50),
'r-', lw=5, alpha=0.6)
plt.show()
Mathematically, equation of normal distribution is given below. rvs attribute is kind of tricky to explain. My understanding is that we create a number of samples x such that it follows the distribution under study. In this case, we will find more instances of x values close to 350 and less instances of values towards left of 100 and towards right of 500. When we plot a histogram of these values, it would follow a nice bell curve.
Let's see if we can derive some of these values on our own without any library. We will define our own normal distribution function.
mean=350
stdev=50
def normal(x,mean,stdev):
return 1/(stdev*np.sqrt(2*np.pi)) * np.exp((-1*(x-mean)**2)/(2*stdev)**2)
normal(x[500000:500000+10],mean,stdev) ##Our definition
--->array([0.00797885, 0.00797885, 0.00797885, 0.00797885, 0.00797885,
0.00797885, 0.00797885, 0.00797885, 0.00797885, 0.00797885])
n1 = norm.pdf(x,mean,stdev) ###Stats function
n1[500000:500000+10]
--->array([0.00797885, 0.00797885, 0.00797885, 0.00797885, 0.00797885,
0.00797885, 0.00797885, 0.00797885, 0.00797885, 0.00797885])
As we can see, we can easily retrieve values derived from a normal distribution on our own. Let's create 2 more normal distributions with same mean but different standard deviations.
r1 = norm.rvs(loc=350,scale=50,size=1000000)
r2 = norm.rvs(loc=350,scale=75,size=1000000) ###Create a number of samples which follows the normal distribution
r3 = norm.rvs(loc=350,scale=125,size=1000000) ###Create a number of samples which follows the normal distribution
fig, ax = plt.subplots(1, 1)
ax.set_xlim([0, 700])
ax.hist(r1, density=True, bins='auto', histtype='stepfilled', alpha=0.2)
ax.hist(r2, density=True, bins='auto', histtype='stepfilled', alpha=0.2)
ax.hist(r3, density=True, bins='auto', histtype='stepfilled', alpha=0.2)
ax.plot(x, norm.pdf(x,loc=350,scale=50),
'r-', lw=5, alpha=0.6, label='stdev 50')
ax.plot(x, norm.pdf(x,loc=350,scale=75),
'b-', lw=5, alpha=0.6, label='stdev 75')
ax.plot(x, norm.pdf(x,loc=350,scale=150),
'g-', lw=5, alpha=0.6, label='stdev 150')
ax.legend(loc='best', frameon=False)
plt.show()
We see that the standard deviation of the distribution determines the shape of the distribution. The distribution with Stdev 50 has a narrower but taller distribution while Stdev with 125 is flatter and wider.
Important Properties of Normal Distribution
- A normal distribution is defined by 2 parameters. Mean and Standard Deviation.
- Highest point in the distribution is at its mean. Median and mode are also same as mean.
- More the standard deviation, more variation in the data, flatter is the curve.
- Normal distribution is symmetric. Shape of curve to left is symmetric to the shape of curve to the right. In effect, skewness is 0.
- 68.3% values of a normal random variable lie within 1 standard deviation.
95.4% values of a normal random variable lie within 2 standard deviation.
99.7% values of a normal random variable lie within 3 standard deviation.
mean,variance,skewness,kurtosis = norm.stats(loc=350,scale=50,moments="mvsk")
mean, np.sqrt(variance),skewness,kurtosis
--->(array(350.), 50.0, array(0.), array(0.))
###Skewness is 0##
##Within 1 standard deviation###
len(np.where((r1>=mean-1*stdev) & (r1<=mean+1*stdev))[0])/len(r1)*100 #<>
--->68.3155
##Within 2 standard deviation###
len(np.where((r1>=mean-2*stdev) & (r1<=mean+2*stdev))[0])/len(r1)*100 #<>
--->95.4419
##Within 3 standard deviation###
len(np.where((r1>=mean-3*stdev) & (r1<=mean+3*stdev))[0])/len(r1)*100 #<>
---> 99.7363
You can verify the same for r2 and r3. We have covered most of the points about Normal Distribution in this post. We will talk about standard normal distribution in a later post. I have linked a Colab Notebook for anyone who wants to play with this.
Comments
Post a Comment