DataFrames in Python with examples | Lecture 3
Here’s a
detailed guide to DataFrames in Python with examples:
What is a DataFrame?
- A DataFrame is a
two-dimensional, labeled data structure provided by the pandas
library.
- It can be thought of as a
table, similar to a spreadsheet, SQL table, or a dictionary of Series
objects.
- A DataFrame is highly
flexible and can handle data in various formats.
Key Features
- Labeled axes: Rows and columns have
labels (index and column names).
- Heterogeneous data: Can contain different
types of data (integers, floats, strings, etc.).
- Size mutable: Rows and columns can be
added or deleted.
How to Create a DataFrame?
1. From a Dictionary
import pandas as pd
data = {
'Name':
['Alice', 'Bob', 'Charlie'],
'Age':
[25, 30, 35],
'City':
['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0
Alice 25 New York
1
Bob 30 Los Angeles
2
Charlie 35 Chicago
2. From a List of Lists
data = [
['Alice',
25, 'New York'],
['Bob',
30, 'Los Angeles'],
['Charlie', 35, 'Chicago']
]
df = pd.DataFrame(data, columns=['Name', 'Age',
'City'])
print(df)
3. From a Dictionary of Series
data = {
'Name':
pd.Series(['Alice', 'Bob', 'Charlie']),
'Age':
pd.Series([25, 30, 35]),
'City':
pd.Series(['New York', 'Los Angeles', 'Chicago'])
}
df = pd.DataFrame(data)
print(df)
4. From a CSV File
df = pd.read_csv('data.csv') # Replace 'data.csv' with your file path
print(df)
5. From a NumPy Array
import numpy as np
data = np.array([
['Alice',
25, 'New York'],
['Bob',
30, 'Los Angeles'],
['Charlie', 35, 'Chicago']
])
df = pd.DataFrame(data, columns=['Name', 'Age',
'City'])
print(df)
Accessing Data in a DataFrame
1. Access Columns
print(df['Name'])
# Access 'Name' column
2. Access Rows by Index
print(df.loc[0])
# Access row with index 0
print(df.iloc[1])
# Access row by position (1st row)
3. Access Specific Elements
print(df.loc[0, 'Name']) # Access specific element by label
print(df.iloc[0, 0]) # Access specific element by position
4. Slicing
print(df[0:2])
# Access first two rows
print(df[['Name', 'City']]) # Access multiple columns
Operations on DataFrames
1. Basic Statistics
print(df.describe()) # Summary statistics for numerical columns
2. Sorting
df_sorted = df.sort_values(by='Age',
ascending=False)
print(df_sorted)
3. Filtering
df_filtered = df[df['Age'] > 30]
print(df_filtered)
4. Adding New Columns
df['Salary'] = [50000, 60000, 70000] # Add a new column
print(df)
5. Deleting Columns
df = df.drop('City', axis=1) # Drop the 'City' column
print(df)
6. Renaming Columns
df = df.rename(columns={'Age': 'Years'})
print(df)
7. Group By
data = {
'Name':
['Alice', 'Bob', 'Alice', 'Bob'],
'Subject': ['Math', 'Math', 'Science', 'Science'],
'Score':
[90, 85, 95, 80]
}
df = pd.DataFrame(data)
grouped = df.groupby('Name').mean()
print(grouped)
Handling Missing Data
1. Check for Missing Data
print(df.isnull())
# Check for null values
print(df.isnull().sum()) # Count null values per column
2. Fill Missing Data
df['Age'] = df['Age'].fillna(df['Age'].mean()) # Fill missing values with mean
3. Drop Rows/Columns with Missing Data
df = df.dropna()
# Drop rows with missing data
Merging, Joining, and Concatenating
1. Merging
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice',
'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Age': [25, 30]})
merged = pd.merge(df1, df2, on='ID')
print(merged)
2. Concatenating
df1 = pd.DataFrame({'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'Name': ['Charlie', 'David']})
concatenated = pd.concat([df1, df2])
print(concatenated)
Key Methods
|
Method |
Description |
Example |
|
head(n) |
First n rows |
df.head(3) |
|
tail(n) |
Last n rows |
df.tail(3) |
|
info() |
Information
about DataFrame |
df.info() |
|
shape |
Shape
of DataFrame (rows, columns) |
df.shape |
|
columns |
List of
column names |
df.columns |
|
value_counts() |
Count
of unique values |
df['Column'].value_counts() |
|
apply() |
Apply a
function |
df['Column'].apply(lambda x:
x**2) |
|
pivot_table() |
Create
pivot table |
df.pivot_table(index='Column1') |
Practical Example
Employee Salary Analysis
data = {
'Name':
['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'IT', 'IT', 'HR'],
'Salary':
[50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
# Average salary by department
avg_salary =
df.groupby('Department')['Salary'].mean()
print("Average Salary by Department:\n",
avg_salary)
# Employees earning above 55,000
high_earners = df[df['Salary'] > 55000]
print("\nHigh Earners:\n", high_earners)
Output:
Average Salary by Department:
Department
HR 52500.0
IT 65000.0
Name: Salary, dtype: float64
High Earners:
Name
Department Salary
1
Bob IT 60000
2
Charlie IT 70000
This
provides a comprehensive guide to working with DataFrames in Python. Let me
know if you’d like to explore more use cases!
Comments
Post a Comment