Groupby in Pandas | Lecture 5
Groupby in Pandas
The groupby operation in Pandas is a powerful tool for
analyzing and summarizing data. It allows you to split your data into groups
based on a column or index, apply a function to each group, and combine the
results.
1. Basic Syntax
grouped = df.groupby("column_name")
- groupby() groups the data based on
the values in the specified column(s).
- It returns a DataFrameGroupBy object, which you can
aggregate, transform, or iterate over.
2. Common Operations with groupby
a) Aggregation
Aggregation
functions like sum(), mean(), count(), etc.,
can be applied to grouped data.
Example:
import pandas as pd
# Creating a sample DataFrame
data = {
"Department": ["HR", "IT", "HR",
"IT", "Sales", "HR", "Sales"],
"Employee": ["Alice", "Bob",
"Charlie", "David", "Eva", "Frank",
"Grace"],
"Salary": [50000, 60000, 45000, 70000, 40000, 52000, 39000]
}
df = pd.DataFrame(data)
# Group by Department and calculate the total
Salary
grouped = df.groupby("Department")["Salary"].sum()
print(grouped)
Output:
Department
HR
147000
IT
130000
Sales
79000
Name: Salary, dtype: int64
b) Multiple Aggregations
Use agg() to apply multiple functions at once.
Example:
# Apply multiple aggregations
grouped = df.groupby("Department")["Salary"].agg(["mean",
"sum", "max"])
print(grouped)
Output:
mean sum max
Department
HR
49000.000 147000 52000
IT
65000.000 130000 70000
Sales
39500.000 79000 40000
c) Filtering Groups
Use filter() to keep groups based on a condition.
Example:
# Keep groups with total salary > 100000
filtered =
df.groupby("Department").filter(lambda x: x["Salary"].sum()
> 100000)
print(filtered)
d) Transforming Groups
Use transform() to apply a function to each group and return a
Series with the same index as the original DataFrame.
Example:
# Calculate the percentage of each salary within
its department
df["Salary_Percent"] = df.groupby("Department")["Salary"].transform(lambda
x: x / x.sum())
print(df)
e) Iterating Over Groups
You can
iterate through each group using a for loop.
Example:
# Iterate through groups
for dept, group in
df.groupby("Department"):
print(f"Department: {dept}")
print(group)
3. Grouping by Multiple Columns
You can
group by more than one column.
Example:
# Group by multiple columns
grouped = df.groupby(["Department",
"Employee"])["Salary"].sum()
print(grouped)
4. Grouping with Custom Functions
You can
group using a custom function applied to the index or column values.
Example:
# Group by first letter of Employee names
grouped = df.groupby(lambda x: df.loc[x,
"Employee"][0])["Salary"].sum()
print(grouped)
5. Resetting Index After Grouping
Use reset_index() to convert the result back to a DataFrame.
Example:
# Convert grouped data back to DataFrame
grouped =
df.groupby("Department")["Salary"].sum().reset_index()
print(grouped)
Summary Table of Common Functions
|
Function |
Purpose |
|
groupby() |
Group
data based on a column or index |
|
agg() |
Apply
multiple aggregation functions |
|
filter() |
Filter
groups based on a condition |
|
transform() |
Transform
data while retaining original structure |
|
reset_index() |
Reset
the index of grouped data |
groupby is a flexible tool for
aggregating, analyzing, and transforming data. Let me know if you'd like to
dive deeper into any of these examples!
Comments
Post a Comment