In partnership with

Imagine you’re managing a department budget and you want to see how much money is earned per hour each week, grouped and compared across other important features like gender or education.

Most people think you need multiple Pandas functions like groupby; this is a viable solution, but you’ll need several other operations to get it formatted in the way you want it to.

But, there’s a better way: leveraging pivot_table

A pivot_table lets you take flat row-based data and reshape it into a summarized, structured table.

For example, looking at this data frame:

>>> data = {
    'age': [39, 50, 38, 53, 28,31],
    'education': ['Bachelors', 'Bachelors', 'HS-grad', 'Doctorate', 'Bachelors','Masters'],
    'gender': ['Male', 'Male', 'Male', 'Male', 'Female','Female'],
    'hours-per-week': [40, 13, 40, 40, 40, 30]
}

# Create the DataFrame
>>> df = pd.DataFrame(data)

   age  education  gender  hours-per-week
0   39  Bachelors    Male              40
1   50  Bachelors    Male              13
2   38    HS-grad    Male              40
3   53  Doctorate    Male              40
4   28  Bachelors  Female              40
5   31    Masters  Female              30

Let's suppose that we want to calculate the average hours worked per week based off of gender and age.

Without pivot_table, we have to perform the following steps:

>>> df["age"] = pd.cut(df["age"],[0,17,29,64,79])

>>> df.groupby(["gender","age"])["hours-per-week"].mean().unstack("age")

age     (0, 17]  (17, 29]  (29, 64]  (64, 79]
gender                                         
Female      NaN      40.0     30.00       NaN
Male        NaN       NaN     33.25       NaN

>>> df.groupby(["gender","age"])["hours-per-week"].mean().unstack("gender")

gender       Female   Male
age                    
(0, 17]      NaN    NaN
(17, 29]    40.0    NaN
(29, 64]    30.0  33.25
(64, 79]     NaN    NaN

Breaking this down operation-by-operation:

  • cut: This Pandas function divides the ages into different bins or age brackets to help us summarize the data. These bins are 0-17, 17-29, 29-64, and 64-79 years old.

  • groupby: This groups the data by age and gender bins based upon the average (mean) hours per week worked.

  • unstack: It helps us with visualization by converting a dimension into a column. If we leave off unstack, we’ll see the following after calling groupby:

>>> df.groupby(["gender","age"])["hours-per-week"].mean()
Female  (0, 17]       NaN
        (17, 29]    40.00
        (29, 64]    30.00
        (64, 79]      NaN
Male    (0, 17]       NaN
        (17, 29]      NaN
        (29, 64]    33.25
        (64, 79]      NaN
Name: hours-per-week, dtype: float64

But by using unstack, it’s much easier to read and interpret based off of gender:

>>> df.groupby(["gender","age"])["hours-per-week"].mean().unstack("gender")
gender    Female   Male
age                    
(0, 17]      NaN    NaN
(17, 29]    40.0    NaN
(29, 64]    30.0  33.25
(64, 79]     NaN    NaN

Looking at this syntax, it may not be so obvious and clear as to what we're doing and what the end result would look like. We can consolidate all of this code using pivot_table instead:

# Creating pivot table with 'gender' as index
>>> df.pivot_table(
    index='gender',
    columns='age',
    values='hours-per-week',
    aggfunc='mean'
)

age     (17, 29]  (29, 64]
gender                       
Female      40.0     30.00
Male         NaN     33.25

# Creating a pivot table with 'age' as index 
>>> gender = df.pivot_table(
    index='age',
    columns='gender',
    values='hours-per-week',
    aggfunc='mean'
)

gender       Female   Male
age                    
(17, 29]    40.0    NaN
(29, 64]    30.0  33.25

By consolidating the grouping, restructuring, and aggregation into a single, clean function call, your code instantly becomes more readable without having to use methods like groupby and unstack.

Breaking down the parameters used here:

  1. index: The columns from the original DataFrame you want to show as the primary labels down the rows of the new pivot table.

  2. columns: The columns’ unique values you want to show as the column headers in your new pivot table

  3. values: Numeric data you want to perform the aggregations on, hours-per-week in this case.

  4. aggfunc: The aggregation function you want to apply on the numeric data, which is the mean function for this example.

For a full breakdown of all of the parameters for pivot_table, visit the pandas documentation.

Protip: Use fill_value=0 as a pivot_table parameter to replace NaN with 0 for cleaner reports and better calculations.

Happy coding!

📧 Join the Python Snacks Newsletter! 🐍

Want even more Python-related content that’s useful? Here’s 3 reasons why you should subscribe the Python Snacks newsletter:

  1. Get Ahead in Python with bite-sized Python tips and tricks delivered straight to your inbox, like the one above.

  2. Exclusive Subscriber Perks: Receive a curated selection of up to 6 high-impact Python resources, tips, and exclusive insights with each email.

  3. Get Smarter with Python in under 5 minutes. Your next Python breakthrough could just an email away.

You can unsubscribe at any time.

Interested in starting a newsletter or a blog?

Do you have a wealth of knowledge and insights to share with the world? Starting your own newsletter or blog is an excellent way to establish yourself as an authority in your field, connect with a like-minded community, and open up new opportunities.

If TikTok, Twitter, Facebook, or other social media platforms were to get banned, you’d lose all your followers. This is why you should start a newsletter: you own your audience.

This article may contain affiliate links. Affiliate links come at no cost to you and support the costs of this blog. Should you purchase a product/service from an affiliate link, it will come at no additional cost to you.

Reply

or to participate

Keep Reading

No posts found