An Introduction To Summary Statistics In Python (With Code Examples)

In today's digital world, data is everywhere - whether it's website traffic, social media activity, or sales numbers.

And with all that information pouring in, how do you make sense of it all?

What does it mean?
What's relevant?
What’s important?
And how can we use this data?

Well, that’s where summary statistics come in because they help you cut through the chaos and quickly get to the insights that matter.

In this guide, I’m going to walk you through some basic summary statistics features using Python (one of the most popular programming languages for beginners and experts) to better understand your data and extract powerful insights.

This way you can get an idea of what's possible with this tool, and how it can help to uncover valuable patterns and spot trends in your business to analyze market shifts. Not only that, but if you’re been doing this with other tools, then you’ll see how simple it is to get the same information (or more) using a few lines of Python code.

Heck, we’ll even recap just what summary statistics are if you’re a complete beginner and not sure how they can help you, so, are you ready to transform data chaos into actionable strategy?

Well, grab a coffee and a notepad and let’s dive in!

Sidenote: This guide is just a brief introduction to this topic. If you want to learn everything about summary statistics (as well as how to use Python for this), then check out my complete course:

No prior coding or math experience is required, as everything is taught from scratch inside the course. You’ll learn statistics from an industry expert (me) and even have fun!

And because one of the best ways to learn is by doing and applying, you also build 6 different statistics-based projects, as well as solidify your skills with 18 quizzes, practice tests, and challenges.

Plus you'll also learn how to utilize ChatGPT to work with statistics and conduct data analysis efficiently so you’re ahead of the curve!

With that out of the way, let’s get into this guide.

What are summary statistics?

In very simple terms, summary statistics are numbers that help you quickly understand a big set of data. They tell you important things about the data without you having to look at every single number.

They make data more simple: They shrink a lot of numbers down to just a few important ones, making it easier to see what's going on
They help you to spot patterns: They help you see general trends, like what the average is, how spread out the numbers are, and if there are any unusual numbers
They help you to more easily compare: They let you compare different groups of data quickly. For example, you can compare the average test scores of two different classes
They help you make business decisions: They give you useful information to help make choices based on the data. For instance, a company might use summary statistics to decide what products to sell more of.
They help you to more easily explain results: They make it easier to explain the data to others, especially if they don't want to see all the details
They help you to find odd numbers: They help you find numbers that are much higher or lower than the rest, which might need more attention.

Summary statistics are powerful analytical tools that unlock the stories hidden within your data.

Whether you're sifting through vast amounts of commercial data, tuning sensors for cutting-edge technological applications, or balancing risk in volatile financial markets, summary statistics are your gateway to actionable insights.

For example

For data analysts working in retail environments, understanding key metrics like the mean, median, and mode isn't just about processing numbers - it’s about translating these figures into business strategies.

These statistics could reveal the average spend per customer, helping tailor marketing campaigns more effectively
You could identify average spend, then identify the type of customer with the highest average spend, and then identify common trends that dictate those actions, before leveraging them more
It can be even simpler than that though. The median purchase can show typical consumer behavior, while identifying the mode can help stock the most popular products more efficiently, ensuring you meet customer demand without overstocking

Whereas in the realm of finance, metrics like standard deviation and variance are not just statistical measures - they are essential for assessing market dynamics. These tools help quantify the volatility of asset prices, offering insights into the level of risk involved in investment portfolios.

Understanding these aspects can guide traders and investors in aligning their strategies with their risk tolerance and market outlook, aiding in the crafting of more robust investment portfolios.

TL;DR

By harnessing the power of summary statistics, professionals across multiple industries can enhance their understanding of complex datasets, leading to smarter decisions, optimized operations, and breakthrough innovations.

From my own personal perspective, summary statistics is THE starting point for any type of data analysis:

You get the data
You perform the summary statistics to get to know the basics of that data
Then you would do Data Visualization for a complete 360 assessment of the data
And then take action from that data

The good news is that we can use tools like Python to easily get this information for us.

So let’s look at some methods now…

Introduction to summary statistics and their methods in Python

To help explain Python and its features for summary statistics, let’s create a classic marketing dataset so that we can add some context.

This example dataset includes:

Dates
Sales figures for each day
The temperatures on each day, and
Customer satisfaction scores on those days

Here’s a peek at what your data might look like as Python code.

Input:

import pandas as pd
import numpy as np

# Creating a DataFrame with various types of data
data = {
    'Date': pd.date_range(start='2024-01-01', periods=7),
    'Temperature': [78, 85, 74, 84, 79, 73, 77],
    'Sales': [234, 190, 302, 280, 310, 215, 275],
    'CustomerSatisfaction': [4.5, 3.8, 4.2, 4.0, 5.0, 3.5, 4.1]
}

df = pd.DataFrame(data)
df.head()

Output:

         Date  Temperature  Sales  CustomerSatisfaction
0  2024-01-01           78    234                   4.5
1  2024-01-02           85    190                   3.8
2  2024-01-03           74    302                   4.2
3  2024-01-04           84    280                   4.0
4  2024-01-05           79    310                   5.0
5  2024-01-06           73    215                   3.5
6  2024-01-07           77    275                   4.1

So now that we have that data ready to use, let’s break this all down and look at some basic summary statistics.

The good news is that Python offers several methods through the Pandas library to compute these statistics effortlessly:

count(): This provides the number of non-null observations in the series. Why it matters: It helps you understand the volume of data available, which is essential for assessing the reliability of your analysis
mean(): It calculates the average of the series. Why it matters: The mean gives a central value of the data, helping you gauge overall performance (e.g., average sales or temperature)
std(): This method returns the standard deviation of the series. Why it matters: Standard deviation measures the amount of variation or dispersion in the data, indicating how spread out the values are around the mean
min() and max(): These methods are used to find the minimum and maximum of the series respectively. Why it matters: Knowing the extremes can help in understanding the range and identifying potential outliers
50%(): Also known as the median, it splits the data into two equal halves. Why it matters: The median provides a better central value when your data has outliers or is skewed
25%, 75%(): These are the first and third quartiles, dividing the data into four equal parts. Why it matters: Quartiles help in understanding the spread and identifying the middle 50% of the data
describe(): It provides the summary statistics all at once. Why it matters: This method offers a quick overview of the data's main characteristics, saving time and providing a comprehensive summary

For example

We can find all of these basic summary statistics with just a single piece of code.

Input:

# Compute and print basic summary statistics for numerical data
df.describe()

And here’s the output of that code, showing all of the basic summary statistics for our example dataset.

Output:

              Temperature       Sales  CustomerSatisfaction
count           7.000000    7.000000              7.000000
mean           78.571429  258.000000              4.157143
std             4.577377   45.574115              0.485994
min            73.000000  190.000000              3.500000
25%            75.500000  224.500000              3.900000
50%            78.000000  275.000000              4.100000
75%            81.500000  291.000000              4.350000
max            85.000000  310.000000              5.000000

How easy was that!

These statistics provide a quick overview of the dataset, illustrating the central tendencies and variability of temperature, sales, and customer satisfaction over the week.

Getting insights from our summary statistics

Now that we have this table, let’s take a look at the information we’re given and how it can be used:

Temperature

The average temperature over the week was approximately 78.57 degrees. The temperatures ranged from a low of 73 degrees to a high of 85 degrees, with a standard deviation of about 4.58.

How is this relevant?

Well, this information could help businesses that are weather-dependent to plan their inventory and marketing strategies. For instance, higher temperatures might correlate with increased sales of cold beverages

Sales

The average sales figure was $258. The minimum recorded sales were $190, and the maximum was $310, with the middle value (median) being $275. The sales data shows a standard deviation of 45.57.

How is this relevant?

Understanding sales patterns helps in inventory management, forecasting future sales, and identifying peak performance days. If you notice certain days with higher sales, you can investigate what factors contributed to this and replicate the strategy

Customer Satisfaction

On a scale (likely from 1 to 5), the average customer satisfaction rating was 4.16. The ratings ranged from a low of 3.5 to a high of 5.0. The median rating was 4.1, and the quartile range shows that ratings were generally above 3.9, with a standard deviation of 0.49.

How is this relevant?

High customer satisfaction ratings suggest successful customer service practices, while lower ratings can indicate areas needing improvement. Tracking this over time helps in maintaining high service standards and improving customer retention

Understanding these summary statistics is crucial because they help you make informed decisions based on data.

Applying the initial insights from our summary statistics

Here are a few practical ways you can use this information:

Investigate Lower Sales: If the average sales are lower on certain days, you might want to investigate why and take corrective actions, such as offering promotional deals or improving customer service on those days
Maintain Customer Satisfaction: Consistent customer satisfaction scores indicate a stable customer experience. However, if you notice a decline, you can immediately address the issues causing dissatisfaction, whether it's related to product quality, service efficiency, or environmental factors
Plan for Seasonal Changes: Knowing temperature trends helps businesses plan for seasonal changes. For instance, a café might stock more iced beverages during hotter days and adjust marketing campaigns accordingly to boost sales

Example scenario and further investigation

We might even want to collect more data to delve deeper into these summaries.

For example

By cross-referencing the initial table with the summaries, we can identify patterns that warrant further investigation:

Why were sales down on the hottest day? Was there less foot traffic, or is the product less desirable on hot days?
Why was customer satisfaction lower on that day? Was it due to service quality, environmental factors like a broken air conditioner, or other reasons?

If we look at the initial table again and cross-reference with the summaries, we can see that on the hottest day, sales were down but so was customer satisfaction.

In this case, we would want more information:

Why were sales down? Was there less foot traffic through the door that day, or was it the same as normal?
What's the product? Is it a café selling hot drinks, and thus seeing less demand on the hottest day? Can we introduce iced coffees?
Why was customer satisfaction down? Was the service quality lower than usual? Were staff off sick?

By getting the summary, we can drill down for more information. For instance, it could be something as simple as the air conditioning being broken on the hottest day, which then got fixed for the following days, explaining why sales and satisfaction improved.

TL;DR

As you can see, it’s incredibly easy to pull basic summary statistics from large datasets with Python. These summaries provide invaluable insights that help in decision-making, optimizing operations, and improving customer satisfaction.

How to summarize numerical data in Python

Dealing with numerical data is a common task in data analysis. Python provides a variety of tools to summarize this data effectively, helping us understand its distribution and central tendencies.

Here are several examples of how you might use summary statistics in a business analytics context:

Example 1: Finding the average temperature over the dataset

If we wanted to find the average temperature of the 7 days in our dataset, we could simply use this code.

Input:

# Average temperature
avg_temp = df['Temperature'].mean()
print(f"The average temperature is: {avg_temp:.2f} degrees.")

Output:

The average temperature is: 78.57 degrees.

Knowing the average temperature helps businesses, especially those in retail or hospitality, to plan their operations and inventory more effectively.

Example 2: Finding the total sales over the week

If we wanted to calculate the combined total sales, we could use

Input:

# Total sales
total_sales = df['Sales'].sum()
print(f"Total sales for the week: ${total_sales}")

Output:

Total sales for the week: $1806

Total sales figures are crucial for assessing overall performance, setting targets, and making financial forecasts.

Example 3: Finding the Median customer satisfaction rating for the same week

To find the median customer satisfaction rating:

Input:

# Median Customer Satisfaction
median_cs = df['CustomerSatisfaction'].median()
print(f"Median customer satisfaction rating: {median_cs}")

Output:

Median customer satisfaction rating: 4.1

The median satisfaction score helps identify the typical customer experience, guiding improvements in service quality.

Introduction to the .agg() method

The .agg() method in Python's pandas library significantly enhances the flexibility of data analysis.

It allows for the simultaneous calculation of multiple statistics across different DataFrame columns, streamlining the process of data summarization.

For example

Input:

# Example of using the .agg() method to apply multiple functions to columns
print("\nDetailed Statistics with .agg():")
detailed_stats = df.agg({
    'Temperature': ['mean', 'min', 'max'],
    'Sales': ['sum', 'mean'],
    'CustomerSatisfaction': ['median', 'std']  # std is standard deviation
})
print(detailed_stats)

Output:

        Temperature   Sales  CustomerSatisfaction
mean      78.571429   258.0                   NaN
min       73.000000     NaN                   NaN
max       85.000000     NaN                   NaN
sum             NaN  1806.0                   NaN
median          NaN     NaN              4.100000
std             NaN     NaN              0.485994

So what's happening here?

Well, in this example the .agg() method is employed to execute a variety of statistical functions on specific columns of a DataFrame.

For each specified column, a list of desired functions is applied:

Temperature: Here, the method is calculating the mean, minimum, and maximum, providing a quick snapshot of temperature variations
Sales: Here, it’s computing both the total (sum) and average (mean) sales, offering insights into overall performance as well as average outcomes
Customer Satisfaction: Determines the median to understand the central tendency and standard deviation to gauge the spread of customer feedback

This method is especially valuable when you need comprehensive summaries without running separate functions for each statistic.

It not only simplifies the coding required but also enhances the readability and efficiency of your data analysis workflows.

Understanding Cumulative Sum, Maximum, Minimum, and Product

Working with time-series or sequential data often involves understanding the cumulative effects over time. In our current example, this might mean tracking total sales, monitoring temperature trends, or analyzing performance metrics.

This is when we might use cumulative statistics, as it can offer valuable insights into the ongoing progression of data.

The good news is that the Pandas library provides straightforward functions for calculating these statistics, which can help reveal underlying patterns and trends in your data, so let’s take a look at some of them.

Key cumulative statistics functions in Python, and their applications

`cumsum()`

This function calculates the cumulative sum, useful for tracking total values over time. It also helps you understand the overall growth or accumulation of a value over a period.

It is particularly useful for financial data, inventory tracking, and sales performance.

Example: Tracking sales growth

Input:

df['Sales'].cumsum()

Output:

0     234
1     424
2     726
3    1006
4    1316
5    1531
6    1806
Name: Sales, dtype: int64

By calculating the cumulative sales like this, you can track how your revenue grows day by day. This is useful for setting sales targets, forecasting future performance, and identifying periods of rapid growth.

`cummax()`

This function helps in tracking the highest value recorded over a period. It is useful for performance benchmarking and setting targets.

Example: Monitoring the maximum temperature

Input:

df['Temperature'].cummax()

Output:

0    78
1    85
2    85
3    85
4    85
5    85
6    85
Name: Temperature, dtype: int64

Tracking the cumulative maximum temperature helps in understanding the extreme conditions over time.

This can be crucial for climate studies, resource planning, and ensuring that products and services are optimized for peak conditions.

`cummin()`

This function helps in identifying the lowest value recorded over a period. It is useful for risk management and setting performance floors.

Example: Monitoring the minimum temperature

Input:

df['Temperature'].cummin()

Output:

0    78
1    78
2    74
3    74
4    74
5    73
6    73
Name: Temperature, dtype: int64

Tracking the cumulative minimum temperature helps in understanding the lowest conditions over time.

This can be essential for planning resources, managing risks, and ensuring that systems are prepared for the lowest possible performance thresholds.

`cumprod()`

This function is useful for understanding compound growth over time, such as in finance where interest or returns compound.

Example: Measuring the compound growth in customer satisfaction

Input:

df['CustomerSatisfaction'].cumprod()  # Assuming values are suitable for multiplication

Output:

0        4.50
1       17.10
2       71.82
3      287.28
4     1436.40
5     5027.40
6    20612.34
Name: CustomerSatisfaction, dtype: float64

While this example might not be practical for customer satisfaction, it demonstrates the capability of cumulative functions.

(In finance, cumulative product calculations can help understand compound interest or growth over time, which is essential for investment planning and strategy development).

A practical example of cumulative functions using our multi-dimensional DataFrame

Let's apply these functions to a more complex dataset that includes temperature, sales, and customer satisfaction data to see how these statistics evolve over time.

Input:

# Apply cumulative functions
df['Cumulative Sales'] = df['Sales'].cumsum()
df['Cumulative Max Temp'] = df['Temperature'].cummax()
df['Cumulative Min Temp'] = df['Temperature'].cummin()
df['Cumulative Product Satisfaction'] = df['CustomerSatisfaction'].cumprod()

# Display the DataFrame with cumulative statistics
print("\nDataFrame with Cumulative Statistics:")
print(df[['Sales', 'Cumulative Sales', 'Temperature', 'Cumulative Max Temp', 'Cumulative Min Temp', 'CustomerSatisfaction', 'Cumulative Product Satisfaction']])

Output:

   Sales  Cumulative Sales  Temperature  Cumulative Max Temp  
0    234               234           78                   78   
1    190               424           85                   85   
2    302               726           74                   85   
3    280              1006           84                   85   
4    310              1316           79                   85   
5    215              1531           73                   85   
6    275              1806           77                   85   

   Cumulative Min Temp  CustomerSatisfaction  Cumulative Product Satisfaction  
0                   78                   4.5                             4.50  
1                   78                   3.8                            17.10  
2                   74                   4.2                            71.82  
3                   74                   4.0                           287.28  
4                   74                   5.0                          1436.40  
5                   73                   3.5                          5027.40  
6                   73                   4.1                         20612.34

These cumulative measures provide a dynamic view of the data:

Cumulative Sales: Helps in understanding the overall revenue growth over the period. This data can help businesses track growth, set future sales targets, and make informed decisions about inventory and marketing
Cumulative Max and Min Temperatures: Show the extremes that have been reached throughout the dataset, which can be crucial for climate studies or resource planning. Setting performance benchmarks based on these values allows businesses to measure current performance against historical highs and manage risks accordingly
Cumulative Product of Satisfaction: Although less common, this can illustrate compounding effects in scenarios where satisfaction metrics multiply over customer interactions. In finance, understanding compound growth over time is essential for investment planning and strategy development

So what’s next?

As you can see, learning to use summary statistics with Python can be incredibly beneficial if you’re looking to start analyzing your data and form insights from it.

I know we’ve only gleaned the surface of what's possible here, but you now have a much better idea of what's possible, and how easy it can be to use.

And don’t worry if you can't follow everything. Remember, the path to becoming a proficient data analyst or data scientist is a marathon, not a sprint. By continually applying these techniques and expanding your knowledge, you’ll become more adept at uncovering the stories your data has to tell.

Stick with it and you’ll understand your data and your business far better now and in the future!

P.S.

Remember, if you want to learn everything about summary statistics (as well as how to use Python), then check out my complete course!

No prior coding or math experience is required, as everything is taught from scratch inside the course. You’ll learn statistics from an industry expert (me) and even have fun!

And because one of the best ways to learn is by doing and applying, you also build 6 different statistics-based projects, as well as solidify your skills with 18 quizzes, practice tests, and challenges. Plus you'll also learn how to utilize ChatGPT to work with statistics and conduct data analysis efficiently so you’re ahead of the curve!

And as an added bonus?

When you take this course and join the Zero To Mastery Academy, you’ll also have access to every data analytics course that we cover, as well as access to our private Discord server.