DataFrames in Python with examples | Lecture 3

 

Here’s a detailed guide to DataFrames in Python with examples:


What is a DataFrame?

  • A DataFrame is a two-dimensional, labeled data structure provided by the pandas library.
  • It can be thought of as a table, similar to a spreadsheet, SQL table, or a dictionary of Series objects.
  • A DataFrame is highly flexible and can handle data in various formats.

Key Features

  1. Labeled axes: Rows and columns have labels (index and column names).
  2. Heterogeneous data: Can contain different types of data (integers, floats, strings, etc.).
  3. Size mutable: Rows and columns can be added or deleted.

How to Create a DataFrame?

1. From a Dictionary

import pandas as pd

 

data = {

    'Name': ['Alice', 'Bob', 'Charlie'],

    'Age': [25, 30, 35],

    'City': ['New York', 'Los Angeles', 'Chicago']

}

df = pd.DataFrame(data)

print(df)

Output:

      Name  Age         City

0    Alice   25     New York

1      Bob   30  Los Angeles

2  Charlie   35      Chicago

2. From a List of Lists

data = [

    ['Alice', 25, 'New York'],

    ['Bob', 30, 'Los Angeles'],

    ['Charlie', 35, 'Chicago']

]

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])

print(df)

3. From a Dictionary of Series

data = {

    'Name': pd.Series(['Alice', 'Bob', 'Charlie']),

    'Age': pd.Series([25, 30, 35]),

    'City': pd.Series(['New York', 'Los Angeles', 'Chicago'])

}

df = pd.DataFrame(data)

print(df)

4. From a CSV File

df = pd.read_csv('data.csv')  # Replace 'data.csv' with your file path

print(df)

5. From a NumPy Array

import numpy as np

 

data = np.array([

    ['Alice', 25, 'New York'],

    ['Bob', 30, 'Los Angeles'],

    ['Charlie', 35, 'Chicago']

])

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])

print(df)


Accessing Data in a DataFrame

1. Access Columns

print(df['Name'])  # Access 'Name' column

2. Access Rows by Index

print(df.loc[0])  # Access row with index 0

print(df.iloc[1])  # Access row by position (1st row)

3. Access Specific Elements

print(df.loc[0, 'Name'])  # Access specific element by label

print(df.iloc[0, 0])      # Access specific element by position

4. Slicing

print(df[0:2])  # Access first two rows

print(df[['Name', 'City']])  # Access multiple columns


Operations on DataFrames

1. Basic Statistics

print(df.describe())  # Summary statistics for numerical columns

2. Sorting

df_sorted = df.sort_values(by='Age', ascending=False)

print(df_sorted)

3. Filtering

df_filtered = df[df['Age'] > 30]

print(df_filtered)

4. Adding New Columns

df['Salary'] = [50000, 60000, 70000]  # Add a new column

print(df)

5. Deleting Columns

df = df.drop('City', axis=1)  # Drop the 'City' column

print(df)

6. Renaming Columns

df = df.rename(columns={'Age': 'Years'})

print(df)

7. Group By

data = {

    'Name': ['Alice', 'Bob', 'Alice', 'Bob'],

    'Subject': ['Math', 'Math', 'Science', 'Science'],

    'Score': [90, 85, 95, 80]

}

df = pd.DataFrame(data)

grouped = df.groupby('Name').mean()

print(grouped)


Handling Missing Data

1. Check for Missing Data

print(df.isnull())  # Check for null values

print(df.isnull().sum())  # Count null values per column

2. Fill Missing Data

df['Age'] = df['Age'].fillna(df['Age'].mean())  # Fill missing values with mean

3. Drop Rows/Columns with Missing Data

df = df.dropna()  # Drop rows with missing data


Merging, Joining, and Concatenating

1. Merging

df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})

df2 = pd.DataFrame({'ID': [1, 2], 'Age': [25, 30]})

merged = pd.merge(df1, df2, on='ID')

print(merged)

2. Concatenating

df1 = pd.DataFrame({'Name': ['Alice', 'Bob']})

df2 = pd.DataFrame({'Name': ['Charlie', 'David']})

concatenated = pd.concat([df1, df2])

print(concatenated)


Key Methods

Method

Description

Example

head(n)

First n rows

df.head(3)

tail(n)

Last n rows

df.tail(3)

info()

Information about DataFrame

df.info()

shape

Shape of DataFrame (rows, columns)

df.shape

columns

List of column names

df.columns

value_counts()

Count of unique values

df['Column'].value_counts()

apply()

Apply a function

df['Column'].apply(lambda x: x**2)

pivot_table()

Create pivot table

df.pivot_table(index='Column1')


Practical Example

Employee Salary Analysis

data = {

    'Name': ['Alice', 'Bob', 'Charlie', 'David'],

    'Department': ['HR', 'IT', 'IT', 'HR'],

    'Salary': [50000, 60000, 70000, 55000]

}

df = pd.DataFrame(data)

 

# Average salary by department

avg_salary = df.groupby('Department')['Salary'].mean()

print("Average Salary by Department:\n", avg_salary)

 

# Employees earning above 55,000

high_earners = df[df['Salary'] > 55000]

print("\nHigh Earners:\n", high_earners)

Output:

Average Salary by Department:

 Department

HR    52500.0

IT    65000.0

Name: Salary, dtype: float64

 

High Earners:

       Name Department  Salary

1      Bob        IT   60000

2  Charlie        IT   70000


This provides a comprehensive guide to working with DataFrames in Python. Let me know if you’d like to explore more use cases!

 

Comments

Popular posts from this blog

NumPy: A Comprehensive Guide with Examples | Lecture 1

Ecommerce Purchases Data Analysis Exercises (Pandas Practice)

Handling Missing Data in Python DataFrames | Lecture 4