This one article was originally published on Built-in by Eric Kleppen.
Variance is a powerful metric used in data analysis and machine learning† It is one of the four main measures of variability along with range, interquartile range (IQR) and standard deviation. Understanding variance is important because it gives you insight into the dispersion of your data and can be used to compare differences in sample groups or identify key modeling features. Variance is also used in machine learning to understand changes in model performance as a result of using different samples of training data.
Calculating variance is easy using Python† Before you dive into Python code, I will first explain what variance is and how to calculate it. By the end of this tutorial, you’ll have a better understanding of why variance is an important metric, along with several methods for calculating it using Python.
What is variance?
Variance is a statistic that measures dispersion. A low variance indicates that the values are generally similar and do not deviate much from the mean, while a high variance indicates that the values are further from the mean. You can use variance on a sample set or on the entire population, since the calculation includes all data points in the given set. While the calculation differs slightly when you look at a sample versus population, you can calculate the variance as the mean of the squared differences from the mean.
Since the variance is a squared value, it can be difficult to interpret compared to other measures of variability, such as standard deviation. Either way, it can be helpful to assess the variance; this can make it easier for you to decide which one statistical tests use with your data. Depending on the statistical tests, unequal variance between samples skew or prejudice Results.
One of the popular statistical tests that applies variance is called the analysis of variance (ANOVA) test. An ANOVA test is used to measure whether any of the group means are significantly different from each other when analyzing a categorical independent variable and a quantitative dependent variable. For example, let’s say you want to analyze whether social media use affects the number of hours you sleep. You can divide social media use into different categories, such as low use, medium use, and high use, and then perform an ANOVA test to gauge whether there are statistical differences between the group means. The test can show whether results are explained by group differences or individual differences.
How do you find the deviation?
Calculating the variance for a data set can differ based on whether the set is the entire population or a sample of the population.
The formula for calculating the variance of an entire population looks like this:
σ² = ∑ (Xᵢ— μ)² / N
An explanation of the formula:
- σ² = population variance
- Σ = sum of…
- Χᵢ = any value
- μ = population mean
- Ν = number of values in the population
- Using an example set of numbers, let’s walk through the calculation step by step.
Example sequence of numbers: 8, 6, 12, 3, 13, 9
Find the population mean (μ):
Calculate deviations from the mean by subtracting the mean from each value.
Make a square for each deviation to get a positive number.
Add the squared values.
Divide the sum of squares by N or n-1.
Since we’re working with the entire population, we’re dividing by N. If we were working with a sample of the population, we’d be dividing by n-1.
69.5/6 = 11.583
There we have it! The variance of our population is 11,583.
Why use n-1 when calculating the sample variance?
Applying n-1 to the formula is called Bessel’s correction, named after Friedrich Bessel. When we use sampling, we need to calculate the estimated variance for the population. Using N instead of n-1 for the sample would bias the estimate, potentially underestimating the population variance. Using n-1 increases the variance estimate, overestimating the variability in samples, reducing biases.
Let’s recalculate the variance by pretending the values come from a sample:
As we can see, the variance is greater!
Calculating Variance with Python
Now that we’ve done the calculation by hand, we can see that it would be very tedious to fill it in for a large set of values. Fortunately, Python can easily handle the computation for very large data. We will explore two methods with Python:
- Write our own variance calculation function
- Use Pandas built-in function
Writing a variance function
As we begin writing a calculation variance function, think back to the steps we took when calculating manually. We want the function to take two parameters:
- population: a series of numbers
- is_sample: a Boolean to change the calculation depending on whether we are working with a sample or population
Start by defining the function that takes the two parameters.
Then add logic to calculate the population mean.
After calculating the mean, find the differences from the mean for each value. You can do this in one line using a list comprehension.
Then square the differences and add them up.
Finally, calculate the variance. Using an If/Else statement, we can use the is_sample parameter. If is_samplei is true, calculate the variance with (n-1). If false (the default), use N:
We can test the calculation using the sequence of numbers we cracked by hand:
Finding Variety with Pandas
While we can write a function to calculate variance in less than 10 lines of code, there is an even easier way to find variance. You can do it in one line of code with Pandas. Let’s load up some data and run through a real-life example of finding variance.
Load sample data
The Pandas example uses the BMW Price Challenge dataset from Kaggle, which is free to download. Start by importing the Pandas library and then read the CSV file in a Pandas dataframe:
We can count the number of rows in the dataset and display the first five rows to make sure everything loads correctly:
Finding the variance for the BMW data
Since the BMW dataset is 4843 rows, it wouldn’t be fun to calculate that by hand. Instead, we can simply insert the column of the dataframe into our function calculate_variance and return the variance. Let’s find the variance for the numerical columns mileage, engine power, and price.
Using Panda’s var() function
In case we forget the variance calculation and can’t write our own function, Pandas has a built-in function to calculate variance called var(). It defaults to a sample population and uses n-1 in the calculation; however, you can modify the calculation by passing the argument ddof=0.
As we can see, the Var() function matches the values produced by our calculation_variance function, and it’s just one line of code. Looking at the results, we can see that the mileage has a high variance, meaning the values tend to deviate a lot from the mean. That makes sense, because many factors play a role in the distance a person has to drive. In comparison, engine_power has a low variance, which indicates that the values do not deviate much from the mean.
Understanding variance can be an important part of data analytics and machine learning, as you can use it to assess group differences. Variance also affects which statistical tests can help us make data-driven decisions. High variance means that the values deviate widely from the mean, while low variance means that numbers are not widely dispersed from the mean. If we have a small set of values, it is possible to calculate the variance by hand in just five steps. For large data sets, we saw how easy it is to calculate variance using Python and Pandas. The Var() function in Pandas calculates the variance for the numeric columns in a data frame in just one line of code, which is quite handy!