Skip to main content

Standard Normal Distribution with examples using Python

Standard Normal Distribution with examples using Python

Standard Normal Distribution with examples

In our previous post, we talked about Normal Distribution and its properties. In this post, we extend those ideas and discuss about Standard Normal Distribution in detail.

What is a Standard Normal Distribution?

A Normal Distribution with mean 0 and standard deviation 1 is called a Standard Normal Distribution. Mathematicallty, it is given as below.

Fig 1:Standard Normal Probability Distribution Function

For comparison, have a look at the Normal Probability Distribution Function. If you substitute mean as 0 ,standard deviation as 1, you derive the standard normal probability distribution function

Fig 2: Normal Probability Distribution Function

Need for a standard normal probability distribution function

We need to extract probability information about events that we are interested in. For this, first we need to convert any normal random variable x into a standard normal random variable z. Mathematically this is given by below equation. \begin{equation} z = \dfrac{x-\mu}{\sigma} \end{equation}

How to interpret z values in a standard normal distribution function?

z values represent how many standard deviations $\sigma$ away (positive or negative) does the random variable x lie from the mean $\mu$. For a random variable x which is one standard deviation $\sigma$ above mean $\mu$ \begin{equation} \begin{aligned} z &= \dfrac{x-\mu}{\sigma} \\[10pt] x &= \mu+\sigma \\[10pt] z &= \dfrac{\mu+\sigma -\mu}{\sigma} \\[10pt] z &= 1 \end{aligned} \end{equation} Similarly, for a random normal variable x equal to mean $\mu$, we can derive z=0. We can say that the standard random normal variable is 0 standard deviations away from the mean.

Calculate Cumulative Normal Distribution Function in Python

How do you calculate probability events for your data if you assume your data to be normally distributed. We will work through a dataset and understand this better. The dataset is linked here and is taken from Kaggle. The data gives information on Medical cost incurred by any patient along with demographic and other information like BMI, Age , Charges incurred. Go ahead and download the dataset to understand this section better.

Let's import all libraries and load the dataset.


import pandas as pd
import numpy as np
import scipy as sp
from matplotlib import pyplot as plt
from scipy import stats
from scipy.stats import norm

df = pd.read_csv("..path_to_dataset/insurance.csv")
df.head()
Fig 3: Medical Cost dataset

Let's look at how BMI values are distributed for people


fig, ax = plt.subplots()
s = df["bmi"]
s.plot.kde(ax = ax,legend = False, title = "Distribution of BMI")
s.plot.hist(density=True, ax=ax)
ax.set_xlabel('BMI')
ax.set_ylabel('Probability')
ax.grid(axis='y')
Fig 4: Distribution of BMI values

##Mean and Standard Deviation of BMI###
np.mean(df["bmi"]), np.std(df["bmi"])
--->(30.66339686098655, 6.0959076415894256)

What is the probability of finding BMI less than equal to 20?. The corresponding z value for this event can be calculated as below.

\begin{equation} z = (x-np.mean(df["bmi"]))/np.std(df["bmi"]) \end{equation} z turns out to be -1.74. So the question can be reformulated as What is the probability of finding a standard random normal variable less than equal to -1.74 standard deviations away from mean?

##Deriving z value###

x= 20
mean = np.mean(df["bmi"])
stdev= np.std(df["bmi"])
z = (x-mean)/stdev
print(z)
--->1.7492713945065959

###Probability of finding a random normal variable less than equal to 20
norm.cdf(20, loc= mean, scale = stdev)*100
--->4.012205908022238

So, there is a 4% probability of finding someone with a BMI of 20 or less.

What is the probability of finding someone with a BMI between 25 to 35? The corresponding z values are -0.92 and 0.71 and we can now reformulate the problem as What is the probability of finding a standard random normal variable between -0.92 standard deviation and 0.711 standard deviation from mean? Note that since we are using norm.cdf, we do not need to derive z values. We just have to feed the random variable x, mean and standard deviation to the norm.cdf function


(norm.cdf(35,loc=mean, scale= stdev)- norm.cdf(25,loc=mean, scale= stdev))*100
--->58.51486537490964

So, there is a 58.5% probability of finding someone with a BMI between 25 and 35.

Finally, what is the probability of finding someone with a BMI of 45 or more?

Turns out, z value is 2.35. We can now reformulate the problem as What is the probability of finding a standard random normal variable which is 2.35 standard deviations or greater from mean?

(1- norm.cdf(45,loc=mean,scale=stdev))*100
--->0.9340388925257126

So, there is a 0.93% of finding someone with a BMI of 45 or more. A very rare event indeed

Calculate inverse of normal cumulative distribution function in Python

Many a times, we are given probability information and we are expected to come up with the range of random normal variable. An example would help here to understand.

What is the range of BMI values for people in lowest 10% BMI category? Can I say anyone with BMI less than or equal to 35 belongs to lowest 10% category? Or is it less than equal to 30? Or 15? Let's find out.


norm.ppf(0.1,loc=mean,scale=stdev)
--->22.85117687949233

Roughly people with BMI of 23 or less belong to the lowest 10% category. This can be visualized in the figure below


x = np.arange(np.min(df['bmi']), np.max(df['bmi']), 0.05)
y = norm.pdf(x,loc=mean,scale=stdev)
plt.plot(x,y)
plt.fill_between(x,y,where=(x<=22.85))#<====
plt.title("Bottom 10% BMI")#
plt.show()
Fig 5: Bottom 10% of BMI

What is the range of BMI values for people in highest 5% BMI category? Can I say anyone with BMI equal to 35 or more belongs to highest 5% category? Or is it 32 or more? Or 45 or more? Let's find out.


norm.isf(0.05,loc=mean,scale=stdev)
--->40.69027265481611

Alternatively, we can do this as well


norm.ppf(1-0.05,loc=mean,scale=stdev)
--->40.69027265481611

Roughly, people with BMI of 40.5 or above belong to the highest 5% category. This can be visualized in the figure below.


x = np.arange(np.min(df['bmi']), np.max(df['bmi']), 0.05)
y = norm.pdf(x,loc=mean,scale=stdev)
plt.plot(x,y)
plt.fill_between(x,y,where=(x>=40.69))
plt.title("Top 5% BMI")
plt.show()
Fig 6: Top 5% of BMI

This concludes our discussion on Standard Normal Distribution function. The Colab link is provided for anyone to play with this.

Comments

Post a Comment

Popular posts from this blog

How to adopt Embeddings for Categorical features in Tabular Data using PyTorch's nn.Embedding( )-- Part 2

In the previous post , we set up the context to utilize embeddings for categorical features. In this post, we will figure out how to create these embeddings and combine them with other continuous features to build a neural network model. Dataset Download We will utilize the UCI machine learning repo which has a dataset on credit card default for customers in Taiwan. This dataset is also available in Kaggle . Metadata about this dataset is available on the respective websites. To follow this post, it is recommended to download the dataset from Kaggle. Most of the features are self explanatory. Embedding Creation A few definitions first. Levels in a categorical feature represent unique values available for that categorical feature. For e.g. MARRIAGE has levels 0,1,2,3. Each level of a categorical feature is represented by a vector of numbers. So, if you stack up all the levels together and all the vectors together, you can imagine levels to be a colum

How to adopt Embeddings for Categorical features in Tabular Data using PyTorch's nn.Embedding( )-- Part 1

How to adopt Embeddings for Categorical features in Tabular Data using PyTorch's nn.Embedding( )-- Part 1 In this post, we will talk about using embeddings for categorical features using PyTorch. This post will be broken down into following parts. Dataset Download Data Understanding Data Preprocessing Embedding Creation Define Dataset and Dataloaders in PyTorch Neural Network definition in PyTorch The Training Loop Model Validation The idea about using Embeddings from Categorical Features was first mooted during a Kaggle contest and a paper was also published on this. In the context of NLP and word embeddings, we represent each word in an n dimesnional vector space. In a similar way, we can represent any categorical feature in an n dimesnional vector space as well. 1. Dataset Download We will utilize the UCI machine learning repo which has a dataset on credit card default for customers in Taiwan. This dataset is also av