In today's digital world, data is everywhere - whether it's website traffic, social media activity, or sales numbers.
And with all that information pouring in, how do you make sense of it all?
Well, that’s where summary statistics come in because they help you cut through the chaos and quickly get to the insights that matter.
In this guide, I’m going to walk you through some basic summary statistics features using Python (one of the most popular programming languages for beginners and experts) to better understand your data and extract powerful insights.
This way you can get an idea of what's possible with this tool, and how it can help to uncover valuable patterns and spot trends in your business to analyze market shifts. Not only that, but if you’re been doing this with other tools, then you’ll see how simple it is to get the same information (or more) using a few lines of Python code.
Heck, we’ll even recap just what summary statistics are if you’re a complete beginner and not sure how they can help you, so, are you ready to transform data chaos into actionable strategy?
Well, grab a coffee and a notepad and let’s dive in!
Sidenote: This guide is just a brief introduction to this topic. If you want to learn everything about summary statistics (as well as how to use Python for this), then check out my complete course:
No prior coding or math experience is required, as everything is taught from scratch inside the course. You’ll learn statistics from an industry expert (me) and even have fun!
And because one of the best ways to learn is by doing and applying, you also build 6 different statistics-based projects, as well as solidify your skills with 18 quizzes, practice tests, and challenges.
Plus you'll also learn how to utilize ChatGPT to work with statistics and conduct data analysis efficiently so you’re ahead of the curve!
With that out of the way, let’s get into this guide.
In very simple terms, summary statistics are numbers that help you quickly understand a big set of data. They tell you important things about the data without you having to look at every single number.
Summary statistics are powerful analytical tools that unlock the stories hidden within your data.
Whether you're sifting through vast amounts of commercial data, tuning sensors for cutting-edge technological applications, or balancing risk in volatile financial markets, summary statistics are your gateway to actionable insights.
For example
For data analysts working in retail environments, understanding key metrics like the mean, median, and mode isn't just about processing numbers - it’s about translating these figures into business strategies.
Whereas in the realm of finance, metrics like standard deviation and variance are not just statistical measures - they are essential for assessing market dynamics. These tools help quantify the volatility of asset prices, offering insights into the level of risk involved in investment portfolios.
Understanding these aspects can guide traders and investors in aligning their strategies with their risk tolerance and market outlook, aiding in the crafting of more robust investment portfolios.
By harnessing the power of summary statistics, professionals across multiple industries can enhance their understanding of complex datasets, leading to smarter decisions, optimized operations, and breakthrough innovations.
From my own personal perspective, summary statistics is THE starting point for any type of data analysis:
The good news is that we can use tools like Python to easily get this information for us.
So let’s look at some methods now…
To help explain Python and its features for summary statistics, let’s create a classic marketing dataset so that we can add some context.
This example dataset includes:
Here’s a peek at what your data might look like as Python code.
Input:
import pandas as pd
import numpy as np
# Creating a DataFrame with various types of data
data = {
'Date': pd.date_range(start='2024-01-01', periods=7),
'Temperature': [78, 85, 74, 84, 79, 73, 77],
'Sales': [234, 190, 302, 280, 310, 215, 275],
'CustomerSatisfaction': [4.5, 3.8, 4.2, 4.0, 5.0, 3.5, 4.1]
}
df = pd.DataFrame(data)
df.head()
Output:
Date Temperature Sales CustomerSatisfaction
0 2024-01-01 78 234 4.5
1 2024-01-02 85 190 3.8
2 2024-01-03 74 302 4.2
3 2024-01-04 84 280 4.0
4 2024-01-05 79 310 5.0
5 2024-01-06 73 215 3.5
6 2024-01-07 77 275 4.1
So now that we have that data ready to use, let’s break this all down and look at some basic summary statistics.
The good news is that Python offers several methods through the Pandas library to compute these statistics effortlessly:
count()
: This provides the number of non-null observations in the series. Why it matters: It helps you understand the volume of data available, which is essential for assessing the reliability of your analysismean()
: It calculates the average of the series. Why it matters: The mean gives a central value of the data, helping you gauge overall performance (e.g., average sales or temperature)std()
: This method returns the standard deviation of the series. Why it matters: Standard deviation measures the amount of variation or dispersion in the data, indicating how spread out the values are around the meanmin()
and max()
: These methods are used to find the minimum and maximum of the series respectively. Why it matters: Knowing the extremes can help in understanding the range and identifying potential outliers50%()
: Also known as the median, it splits the data into two equal halves. Why it matters: The median provides a better central value when your data has outliers or is skewed25%
, 75%()
: These are the first and third quartiles, dividing the data into four equal parts. Why it matters: Quartiles help in understanding the spread and identifying the middle 50% of the datadescribe()
: It provides the summary statistics all at once. Why it matters: This method offers a quick overview of the data's main characteristics, saving time and providing a comprehensive summaryFor example
We can find all of these basic summary statistics with just a single piece of code.
Input:
# Compute and print basic summary statistics for numerical data
df.describe()
And here’s the output of that code, showing all of the basic summary statistics for our example dataset.
Output:
Temperature Sales CustomerSatisfaction
count 7.000000 7.000000 7.000000
mean 78.571429 258.000000 4.157143
std 4.577377 45.574115 0.485994
min 73.000000 190.000000 3.500000
25% 75.500000 224.500000 3.900000
50% 78.000000 275.000000 4.100000
75% 81.500000 291.000000 4.350000
max 85.000000 310.000000 5.000000
How easy was that!
These statistics provide a quick overview of the dataset, illustrating the central tendencies and variability of temperature, sales, and customer satisfaction over the week.
Now that we have this table, let’s take a look at the information we’re given and how it can be used:
The average temperature over the week was approximately 78.57 degrees. The temperatures ranged from a low of 73 degrees to a high of 85 degrees, with a standard deviation of about 4.58.
How is this relevant?
Well, this information could help businesses that are weather-dependent to plan their inventory and marketing strategies. For instance, higher temperatures might correlate with increased sales of cold beverages
The average sales figure was $258. The minimum recorded sales were $190, and the maximum was $310, with the middle value (median) being $275. The sales data shows a standard deviation of 45.57.
How is this relevant?
Understanding sales patterns helps in inventory management, forecasting future sales, and identifying peak performance days. If you notice certain days with higher sales, you can investigate what factors contributed to this and replicate the strategy
On a scale (likely from 1 to 5), the average customer satisfaction rating was 4.16. The ratings ranged from a low of 3.5 to a high of 5.0. The median rating was 4.1, and the quartile range shows that ratings were generally above 3.9, with a standard deviation of 0.49.
How is this relevant?
High customer satisfaction ratings suggest successful customer service practices, while lower ratings can indicate areas needing improvement. Tracking this over time helps in maintaining high service standards and improving customer retention
Understanding these summary statistics is crucial because they help you make informed decisions based on data.
Here are a few practical ways you can use this information:
We might even want to collect more data to delve deeper into these summaries.
For example
By cross-referencing the initial table with the summaries, we can identify patterns that warrant further investigation:
If we look at the initial table again and cross-reference with the summaries, we can see that on the hottest day, sales were down but so was customer satisfaction.
In this case, we would want more information:
By getting the summary, we can drill down for more information. For instance, it could be something as simple as the air conditioning being broken on the hottest day, which then got fixed for the following days, explaining why sales and satisfaction improved.
As you can see, it’s incredibly easy to pull basic summary statistics from large datasets with Python. These summaries provide invaluable insights that help in decision-making, optimizing operations, and improving customer satisfaction.
Dealing with numerical data is a common task in data analysis. Python provides a variety of tools to summarize this data effectively, helping us understand its distribution and central tendencies.
Here are several examples of how you might use summary statistics in a business analytics context:
If we wanted to find the average temperature of the 7 days in our dataset, we could simply use this code.
Input:
# Average temperature
avg_temp = df['Temperature'].mean()
print(f"The average temperature is: {avg_temp:.2f} degrees.")
Output:
The average temperature is: 78.57 degrees.
Knowing the average temperature helps businesses, especially those in retail or hospitality, to plan their operations and inventory more effectively.
If we wanted to calculate the combined total sales, we could use
Input:
# Total sales
total_sales = df['Sales'].sum()
print(f"Total sales for the week: ${total_sales}")
Output:
Total sales for the week: $1806
Total sales figures are crucial for assessing overall performance, setting targets, and making financial forecasts.
To find the median customer satisfaction rating:
Input:
# Median Customer Satisfaction
median_cs = df['CustomerSatisfaction'].median()
print(f"Median customer satisfaction rating: {median_cs}")
Output:
Median customer satisfaction rating: 4.1
The median satisfaction score helps identify the typical customer experience, guiding improvements in service quality.
The .agg()
method in Python's pandas library significantly enhances the flexibility of data analysis.
It allows for the simultaneous calculation of multiple statistics across different DataFrame
columns, streamlining the process of data summarization.
For example
Input:
# Example of using the .agg() method to apply multiple functions to columns
print("\nDetailed Statistics with .agg():")
detailed_stats = df.agg({
'Temperature': ['mean', 'min', 'max'],
'Sales': ['sum', 'mean'],
'CustomerSatisfaction': ['median', 'std'] # std is standard deviation
})
print(detailed_stats)
Output:
Temperature Sales CustomerSatisfaction
mean 78.571429 258.0 NaN
min 73.000000 NaN NaN
max 85.000000 NaN NaN
sum NaN 1806.0 NaN
median NaN NaN 4.100000
std NaN NaN 0.485994
So what's happening here?
Well, in this example the .agg()
method is employed to execute a variety of statistical functions on specific columns of a DataFrame
.
For each specified column, a list of desired functions is applied:
This method is especially valuable when you need comprehensive summaries without running separate functions for each statistic.
It not only simplifies the coding required but also enhances the readability and efficiency of your data analysis workflows.
Working with time-series or sequential data often involves understanding the cumulative effects over time. In our current example, this might mean tracking total sales, monitoring temperature trends, or analyzing performance metrics.
This is when we might use cumulative statistics, as it can offer valuable insights into the ongoing progression of data.
The good news is that the Pandas library provides straightforward functions for calculating these statistics, which can help reveal underlying patterns and trends in your data, so let’s take a look at some of them.
cumsum()
This function calculates the cumulative sum, useful for tracking total values over time. It also helps you understand the overall growth or accumulation of a value over a period.
It is particularly useful for financial data, inventory tracking, and sales performance.
Example: Tracking sales growth
Input:
df['Sales'].cumsum()
Output:
0 234
1 424
2 726
3 1006
4 1316
5 1531
6 1806
Name: Sales, dtype: int64
By calculating the cumulative sales like this, you can track how your revenue grows day by day. This is useful for setting sales targets, forecasting future performance, and identifying periods of rapid growth.
cummax()
This function helps in tracking the highest value recorded over a period. It is useful for performance benchmarking and setting targets.
Example: Monitoring the maximum temperature
Input:
df['Temperature'].cummax()
Output:
0 78
1 85
2 85
3 85
4 85
5 85
6 85
Name: Temperature, dtype: int64
Tracking the cumulative maximum temperature helps in understanding the extreme conditions over time.
This can be crucial for climate studies, resource planning, and ensuring that products and services are optimized for peak conditions.
cummin()
This function helps in identifying the lowest value recorded over a period. It is useful for risk management and setting performance floors.
Example: Monitoring the minimum temperature
Input:
df['Temperature'].cummin()
Output:
0 78
1 78
2 74
3 74
4 74
5 73
6 73
Name: Temperature, dtype: int64
Tracking the cumulative minimum temperature helps in understanding the lowest conditions over time.
This can be essential for planning resources, managing risks, and ensuring that systems are prepared for the lowest possible performance thresholds.
cumprod()
This function is useful for understanding compound growth over time, such as in finance where interest or returns compound.
Example: Measuring the compound growth in customer satisfaction
Input:
df['CustomerSatisfaction'].cumprod() # Assuming values are suitable for multiplication
Output:
0 4.50
1 17.10
2 71.82
3 287.28
4 1436.40
5 5027.40
6 20612.34
Name: CustomerSatisfaction, dtype: float64
While this example might not be practical for customer satisfaction, it demonstrates the capability of cumulative functions.
(In finance, cumulative product calculations can help understand compound interest or growth over time, which is essential for investment planning and strategy development).
Let's apply these functions to a more complex dataset that includes temperature, sales, and customer satisfaction data to see how these statistics evolve over time.
Input:
# Apply cumulative functions
df['Cumulative Sales'] = df['Sales'].cumsum()
df['Cumulative Max Temp'] = df['Temperature'].cummax()
df['Cumulative Min Temp'] = df['Temperature'].cummin()
df['Cumulative Product Satisfaction'] = df['CustomerSatisfaction'].cumprod()
# Display the DataFrame with cumulative statistics
print("\nDataFrame with Cumulative Statistics:")
print(df[['Sales', 'Cumulative Sales', 'Temperature', 'Cumulative Max Temp', 'Cumulative Min Temp', 'CustomerSatisfaction', 'Cumulative Product Satisfaction']])
Output:
Sales Cumulative Sales Temperature Cumulative Max Temp \
0 234 234 78 78
1 190 424 85 85
2 302 726 74 85
3 280 1006 84 85
4 310 1316 79 85
5 215 1531 73 85
6 275 1806 77 85
Cumulative Min Temp CustomerSatisfaction Cumulative Product Satisfaction
0 78 4.5 4.50
1 78 3.8 17.10
2 74 4.2 71.82
3 74 4.0 287.28
4 74 5.0 1436.40
5 73 3.5 5027.40
6 73 4.1 20612.34
These cumulative measures provide a dynamic view of the data:
As you can see, learning to use summary statistics with Python can be incredibly beneficial if you’re looking to start analyzing your data and form insights from it.
I know we’ve only gleaned the surface of what's possible here, but you now have a much better idea of what's possible, and how easy it can be to use.
And don’t worry if you can't follow everything. Remember, the path to becoming a proficient data analyst or data scientist is a marathon, not a sprint. By continually applying these techniques and expanding your knowledge, you’ll become more adept at uncovering the stories your data has to tell.
Stick with it and you’ll understand your data and your business far better now and in the future!
Remember, if you want to learn everything about summary statistics (as well as how to use Python), then check out my complete course!
No prior coding or math experience is required, as everything is taught from scratch inside the course. You’ll learn statistics from an industry expert (me) and even have fun!
And because one of the best ways to learn is by doing and applying, you also build 6 different statistics-based projects, as well as solidify your skills with 18 quizzes, practice tests, and challenges. Plus you'll also learn how to utilize ChatGPT to work with statistics and conduct data analysis efficiently so you’re ahead of the curve!
And as an added bonus?
When you take this course and join the Zero To Mastery Academy, you’ll also have access to every data analytics course that we cover, as well as access to our private Discord server.
Here you can ask questions of me directly (or any teacher) as well as fellow students and working professionals!