Finding Variation with Python

This one article was originally published on Built-in by Eric Kleppen.

Variance is a powerful metric used in data analysis and machine learning† It is one of the four main measures of variability along with range, interquartile range (IQR) and standard deviation. Understanding variance is important because it gives you insight into the dispersion of your data and can be used to compare differences in sample groups or identify key modeling features. Variance is also used in machine learning to understand changes in model performance as a result of using different samples of training data.

Calculating variance is easy using Python† Before you dive into Python code, I will first explain what variance is and how to calculate it. By the end of this tutorial, you’ll have a better understanding of why variance is an important metric, along with several methods for calculating it using Python.

What is variance?

Variance is a statistic that measures dispersion. A low variance indicates that the values ​​are generally similar and do not deviate much from the mean, while a high variance indicates that the values ​​are further from the mean. You can use variance on a sample set or on the entire population, since the calculation includes all data points in the given set. While the calculation differs slightly when you look at a sample versus population, you can calculate the variance as the mean of the squared differences from the mean.

Since the variance is a squared value, it can be difficult to interpret compared to other measures of variability, such as standard deviation. Either way, it can be helpful to assess the variance; this can make it easier for you to decide which one statistical tests use with your data. Depending on the statistical tests, unequal variance between samples skew or prejudice Results.

One of the popular statistical tests that applies variance is called the analysis of variance (ANOVA) test. An ANOVA test is used to measure whether any of the group means are significantly different from each other when analyzing a categorical independent variable and a quantitative dependent variable. For example, let’s say you want to analyze whether social media use affects the number of hours you sleep. You can divide social media use into different categories, such as low use, medium use, and high use, and then perform an ANOVA test to gauge whether there are statistical differences between the group means. The test can show whether results are explained by group differences or individual differences.

How do you find the deviation?

Calculating the variance for a data set can differ based on whether the set is the entire population or a sample of the population.

The formula for calculating the variance of an entire population looks like this:

σ² = ∑ (Xᵢ— μ)² / N

An explanation of the formula:

  • σ² = population variance
  • Σ = sum of…
  • Χᵢ = any value
  • μ = population mean
  • Ν = number of values ​​in the population
  • Using an example set of numbers, let’s walk through the calculation step by step.

Example sequence of numbers: 8, 6, 12, 3, 13, 9

Find the population mean (μ):

Calculation for Finding Variance in Python

Calculate deviations from the mean by subtracting the mean from each value.

Calculation for Finding Variance in Python

Make a square for each deviation to get a positive number.

Celebrate every deviation with a positive number

Add the squared values.

Add the squared values

Divide the sum of squares by N or n-1.

Since we’re working with the entire population, we’re dividing by N. If we were working with a sample of the population, we’d be dividing by n-1.

69.5/6 = 11.583

There we have it! The variance of our population is 11,583.

Why use n-1 when calculating the sample variance?

Applying n-1 to the formula is called Bessel’s correction, named after Friedrich Bessel. When we use sampling, we need to calculate the estimated variance for the population. Using N instead of n-1 for the sample would bias the estimate, potentially underestimating the population variance. Using n-1 increases the variance estimate, overestimating the variability in samples, reducing biases.

Let’s recalculate the variance by pretending the values ​​come from a sample:

recalculate the variance by pretending the values ​​come from a sample

As we can see, the variance is greater!

Calculating Variance with Python

Now that we’ve done the calculation by hand, we can see that it would be very tedious to fill it in for a large set of values. Fortunately, Python can easily handle the computation for very large data. We will explore two methods with Python:

  • Write our own variance calculation function
  • Use Pandas built-in function

Writing a variance function

As we begin writing a calculation variance function, think back to the steps we took when calculating manually. We want the function to take two parameters:

  • population: a series of numbers
  • is_sample: a Boolean to change the calculation depending on whether we are working with a sample or population

Start by defining the function that takes the two parameters.

Start by defining the function that takes the two parameters.

Then add logic to calculate the population mean.

Then add logic to calculate the population mean.

After calculating the mean, find the differences from the mean for each value. You can do this in one line using a list comprehension.

find the differences from the mean for each value.

Then square the differences and add them up.

Then square the differences and add them up.

Finally, calculate the variance. Using an If/Else statement, we can use the is_sample parameter. If is_samplei is true, calculate the variance with (n-1). If false (the default), use N:

Calculate the variance

We can test the calculation using the sequence of numbers we cracked by hand:

How to find the variance in Python

Finding Variety with Pandas

While we can write a function to calculate variance in less than 10 lines of code, there is an even easier way to find variance. You can do it in one line of code with Pandas. Let’s load up some data and run through a real-life example of finding variance.

Load sample data

The Pandas example uses the BMW Price Challenge dataset from Kaggle, which is free to download. Start by importing the Pandas library and then read the CSV file in a Pandas dataframe:

read the CSV file in a Pandas data frame

We can count the number of rows in the dataset and display the first five rows to make sure everything loads correctly:

We can count the number of rows in the dataset and display the first five rows to make sure everything loads correctly:

Display the first rows with bmw_df.head()