Python Pandas Notes (Beginner to Advanced)

@amitmund June 03, 2026

What is Pandas?

Pandas is a powerful Python library used for:

Data Analysis
Data Cleaning
Data Manipulation
Data Transformation
Data Science
Machine Learning Data Preparation

Installation

pip install pandas

Import Pandas

import pandas as pd

1. Pandas Data Structures

Pandas provides two primary data structures:

Structure	Description
Series	One-dimensional labeled array
DataFrame	Two-dimensional table

2. Pandas Series

Creating a Series

import pandas as pd

s = pd.Series([10, 20, 30, 40])

print(s)

Output:

0    10
1    20
2    30
3    40
dtype: int64

Series with Custom Index

s = pd.Series(
    [10, 20, 30],
    index=["A", "B", "C"]
)

print(s)

Access Value:

print(s["B"])

3. DataFrame

A DataFrame is a table of rows and columns.

Create DataFrame from Dictionary

data = {
    "Name": ["John", "Alice", "Bob"],
    "Age": [25, 30, 35]
}

df = pd.DataFrame(data)

print(df)

Output:

    Name  Age
0   John   25
1  Alice   30
2    Bob   35

4. Reading Data Files

CSV

df = pd.read_csv("data.csv")

Excel

df = pd.read_excel("data.xlsx")

Install Excel support:

pip install openpyxl

JSON

df = pd.read_json("data.json")

5. Viewing Data

First Rows

df.head()

df.head(10)

Last Rows

df.tail()

Shape

df.shape

Example:

(100, 5)

Columns

df.columns

Data Types

df.dtypes

Information

df.info()

Statistical Summary

df.describe()

6. Selecting Data

Single Column

df["Name"]

Multiple Columns

df[["Name", "Age"]]

Row Selection using iloc

df.iloc[0]

df.iloc[0:3]

Row Selection using loc

df.loc[0]

7. Filtering Data

Single Condition

df[df["Age"] > 25]

Multiple Conditions (AND)

df[
    (df["Age"] > 25)
    &
    (df["Age"] < 40)
]

OR Condition

df[
    (df["Age"] < 25)
    |
    (df["Age"] > 50)
]

8. Adding Columns

Add New Column

df["Salary"] = [50000, 60000, 70000]

Derived Column

df["Bonus"] = df["Salary"] * 0.10

9. Updating Data

Update Single Value

df.loc[0, "Age"] = 28

Update Multiple Rows

df.loc[
    df["Age"] > 30,
    "Category"
] = "Senior"

10. Deleting Data

Delete Column

df.drop("Salary", axis=1)

Permanent:

df.drop(
    "Salary",
    axis=1,
    inplace=True
)

Delete Row

df.drop(0)

11. Handling Missing Values

Detect Missing Values

df.isnull()

Count Missing Values

df.isnull().sum()

Remove Missing Rows

df.dropna()

Fill Missing Values

df.fillna(0)

Fill with Mean:

df["Age"].fillna(
    df["Age"].mean()
)

12. Sorting Data

Ascending

df.sort_values("Age")

Descending

df.sort_values(
    "Age",
    ascending=False
)

Multiple Columns

df.sort_values(
    ["Age", "Salary"]
)

13. Renaming Columns

df.rename(
    columns={
        "Age": "Employee_Age"
    }
)

Permanent:

df.rename(
    columns={
        "Age": "Employee_Age"
    },
    inplace=True
)

14. String Operations

Uppercase

df["Name"].str.upper()

Lowercase

df["Name"].str.lower()

Contains

df["Name"].str.contains("John")

Replace

df["Name"].str.replace(
    "John",
    "Johnny"
)

15. Apply Function

def grade(age):
    if age > 30:
        return "Senior"
    return "Junior"

df["Category"] = df["Age"].apply(grade)

Using Lambda:

df["Double"] = df["Age"].apply(
    lambda x: x * 2
)

16. GroupBy

Mean Salary by Department

df.groupby(
    "Department"
)["Salary"].mean()

Multiple Aggregations

df.groupby(
    "Department"
).agg({
    "Salary": [
        "sum",
        "mean",
        "max"
    ],
    "Age": "mean"
})

17. Pivot Table

pd.pivot_table(
    df,
    values="Salary",
    index="Department",
    aggfunc="mean"
)

18. Merge DataFrames

Inner Join

pd.merge(
    df1,
    df2,
    on="ID"
)

Left Join

pd.merge(
    df1,
    df2,
    on="ID",
    how="left"
)

Right Join

pd.merge(
    df1,
    df2,
    on="ID",
    how="right"
)

Outer Join

pd.merge(
    df1,
    df2,
    on="ID",
    how="outer"
)

19. Concatenation

Vertical Concatenation

pd.concat([df1, df2])

Horizontal Concatenation

pd.concat(
    [df1, df2],
    axis=1
)

20. Date and Time Operations

Convert to Datetime:

df["Date"] = pd.to_datetime(
    df["Date"]
)

Extract Year:

df["Date"].dt.year

Extract Month:

df["Date"].dt.month

Extract Day:

df["Date"].dt.day

21. Index Operations

Set Index

df.set_index("ID")

Reset Index

df.reset_index()

22. MultiIndex

df.set_index(
    [
        "Department",
        "Employee"
    ]
)

Access:

df.loc["IT"]

23. Advanced Aggregation

df.groupby(
    "Department"
).agg(
    total_salary=(
        "Salary",
        "sum"
    ),
    avg_salary=(
        "Salary",
        "mean"
    ),
    max_salary=(
        "Salary",
        "max"
    )
)

24. Window Functions

Rolling Average

df["RollingMean"] = (
    df["Sales"]
    .rolling(3)
    .mean()
)

Rolling Sum

df["RollingSum"] = (
    df["Sales"]
    .rolling(3)
    .sum()
)

25. Cumulative Functions

Cumulative Sum

df["CumSum"] = (
    df["Sales"]
    .cumsum()
)

Cumulative Maximum

df["CumMax"] = (
    df["Sales"]
    .cummax()
)

Cumulative Minimum

df["CumMin"] = (
    df["Sales"]
    .cummin()
)

26. Explode Lists

df = pd.DataFrame({
    "Name": ["John", "Alice"],
    "Skills": [
        ["Python", "SQL"],
        ["Java"]
    ]
})

df.explode("Skills")

27. Performance Optimization

Convert to Category

df["Department"] = (
    df["Department"]
    .astype("category")
)

Memory Usage

df.memory_usage(
    deep=True
)

28. Method Chaining

result = (
    df
    .dropna()
    .query("Age > 25")
    .sort_values("Salary")
)

29. Export Data

CSV

df.to_csv(
    "output.csv",
    index=False
)

Excel

df.to_excel(
    "output.xlsx",
    index=False
)

JSON

df.to_json(
    "output.json"
)

30. Real-World Workflow Example

import pandas as pd

df = pd.read_csv(
    "employees.csv"
)

df.dropna(
    inplace=True
)

df = df[
    df["Salary"] > 50000
]

df["Bonus"] = (
    df["Salary"] * 0.10
)

summary = (
    df.groupby("Department")
    .agg({
        "Salary": "mean",
        "Bonus": "sum"
    })
)

summary.to_excel(
    "report.xlsx"
)

Pandas Interview Questions

Difference Between loc and iloc

loc	iloc
Label Based	Position Based
Uses Names	Uses Integer Index

Difference Between merge and concat

merge	concat
SQL-style Join	Stacking Data
Requires Key	No Key Required

Difference Between Series and DataFrame

Series	DataFrame
1-Dimensional	2-Dimensional
Single Column	Multiple Columns

Missing Value Functions

dropna()
fillna()
interpolate()

Pandas Cheat Sheet

pd.read_csv()

pd.read_excel()

df.head()

df.info()

df.describe()

df.shape

df.columns

df.loc[]

df.iloc[]

df.groupby()

df.merge()

df.concat()

df.pivot_table()

df.dropna()

df.fillna()

df.sort_values()

df.to_csv()

df.to_excel()

Recommended Learning Roadmap

Series
DataFrame
Reading Files
Selection & Filtering
Missing Values
GroupBy
Merge & Join
Pivot Tables
DateTime Operations
Window Functions
MultiIndex
Performance Optimization
Real-world Data Analysis Projects
Interview Preparation

Pandas Simple Examples and Real-World Use Cases

What Problem Does Pandas Solve?

Imagine you have thousands of rows of data:

Name	Age	Salary
John	25	50000
Alice	30	60000
Bob	35	70000

If you use normal Python lists, finding information becomes difficult as the data grows.

Pandas allows you to:

Read data from CSV/Excel files
Filter data
Clean missing values
Calculate statistics
Generate reports
Prepare data for Machine Learning

Example 1: Employee Salary Report

Real-World Scenario

An HR department stores employee information in an Excel file.

They want to find:

Employees earning more than ₹50,000
Average salary
Highest salary

Data

import pandas as pd

data = {
    "Name": ["John", "Alice", "Bob"],
    "Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)

print(df)

Output:

    Name  Salary
0   John   50000
1  Alice   60000
2    Bob   70000

Employees earning more than 50,000

print(df[df["Salary"] > 50000])

Output:

    Name  Salary
1  Alice   60000
2    Bob   70000

Average Salary

print(df["Salary"].mean())

Output:

Maximum Salary

print(df["Salary"].max())

Output:

Example 2: Student Marks Analysis

Real-World Scenario

A teacher wants to:

Find the highest mark
Calculate average marks
Identify students who passed

Data

import pandas as pd

data = {
    "Student": ["Rahul", "Priya", "Amit"],
    "Marks": [85, 40, 65]
}

df = pd.DataFrame(data)

print(df)

Output:

  Student  Marks
0   Rahul     85
1   Priya     40
2    Amit     65

Average Marks

print(df["Marks"].mean())

Output:

63.33

Students Who Passed

passed = df[df["Marks"] >= 50]

print(passed)

Output:

  Student  Marks
0   Rahul     85
2    Amit     65

Example 3: Online Shop Sales Report

Real-World Scenario

An e-commerce company wants to know:

Total sales
Best-selling day

Data

import pandas as pd

sales = {
    "Day": ["Mon", "Tue", "Wed"],
    "Amount": [1000, 1500, 2000]
}

df = pd.DataFrame(sales)

print(df)

Total Sales

print(df["Amount"].sum())

Output:

Highest Sale

print(df["Amount"].max())

Output:

Example 4: Reading Data from CSV File

employees.csv

Name,Age,Salary
John,25,50000
Alice,30,60000
Bob,35,70000

Read CSV

import pandas as pd

df = pd.read_csv("employees.csv")

print(df)

Output:

    Name  Age  Salary
0   John   25   50000
1  Alice   30   60000
2    Bob   35   70000

Why This Is Useful

Instead of manually reading each line from a file, Pandas loads the entire file into a table instantly.

Example 5: Handling Missing Data

Real-World Scenario

A customer database contains missing ages.

import pandas as pd

data = {
    "Name": ["John", "Alice", "Bob"],
    "Age": [25, None, 35]
}

df = pd.DataFrame(data)

print(df)

Output:

    Name   Age
0   John  25.0
1  Alice   NaN
2    Bob  35.0

Count Missing Values

print(df.isnull().sum())

Output:

Name    0
Age     1
dtype: int64

Fill Missing Value

df["Age"] = df["Age"].fillna(30)

print(df)

Output:

    Name   Age
0   John  25.0
1  Alice  30.0
2    Bob  35.0

Example 6: Grouping Data

Real-World Scenario

A company wants average salary by department.

import pandas as pd

data = {
    "Department": ["IT", "IT", "HR", "HR"],
    "Salary": [50000, 60000, 40000, 45000]
}

df = pd.DataFrame(data)

result = df.groupby("Department")["Salary"].mean()

print(result)

Output:

Department
HR    42500
IT    55000
Name: Salary, dtype: float64

Why Companies Use This

Managers can quickly see:

Average salary
Total salary expense
Department performance

Example 7: Creating a New Column

Real-World Scenario

An organization gives a 10% bonus.

import pandas as pd

data = {
    "Name": ["John", "Alice"],
    "Salary": [50000, 60000]
}

df = pd.DataFrame(data)

df["Bonus"] = df["Salary"] * 0.10

print(df)

Output:

    Name  Salary   Bonus
0   John   50000  5000.0
1  Alice   60000  6000.0

Where Pandas Is Used in Industry

Banking

Customer transaction analysis
Fraud detection
Loan data analysis

Example:

transactions.groupby("Customer")["Amount"].sum()

E-Commerce

Sales reports
Customer behavior analysis
Product performance tracking

Example:

orders.groupby("Product")["Quantity"].sum()

Healthcare

Patient records
Medical research
Hospital reporting

Example:

patients["Age"].mean()

Education

Student marks analysis
Attendance tracking
Exam reports

Example:

students["Marks"].mean()

Data Science & Machine Learning

Pandas is usually the first library used before machine learning.

Typical workflow:

import pandas as pd

# Load data
df = pd.read_csv("data.csv")

# Clean data
df.dropna(inplace=True)

# Analyze data
print(df.describe())

# Send to ML model

Quick Summary

Task	Pandas Function
Read CSV	`read_csv()`
Read Excel	`read_excel()`
View Data	`head()`
Statistics	`describe()`
Filter Rows	`df[df["col"] > value]`
Group Data	`groupby()`
Remove Missing Data	`dropna()`
Fill Missing Data	`fillna()`
Sort Data	`sort_values()`
Export CSV	`to_csv()`
Export Excel	`to_excel()`

Remember

Pandas is essentially a super-powered spreadsheet inside Python.

If Excel can do it manually, Pandas can usually do it automatically and much faster on thousands or millions of rows of data.

Example: Working with Excel Data Using Pandas

Scenario

Suppose you work in an office and receive an Excel file every month named:

employees.xlsx

Employee ID	Name	Department	Salary
101	John	IT	50000
102	Alice	HR	60000
103	Bob	IT	55000
104	Emma	Finance	70000

Management asks:

Show all IT employees
Calculate average salary
Find the highest-paid employee
Generate a department-wise salary report

Step 1: Read Excel File

import pandas as pd

df = pd.read_excel("employees.xlsx")

print(df)

Output

   Employee ID   Name Department  Salary
0          101   John         IT   50000
1          102  Alice         HR   60000
2          103    Bob         IT   55000
3          104   Emma    Finance   70000

Step 2: Show IT Employees

it_employees = df[df["Department"] == "IT"]

print(it_employees)

Output

   Employee ID  Name Department  Salary
0          101  John         IT   50000
2          103   Bob         IT   55000

Step 3: Calculate Average Salary

avg_salary = df["Salary"].mean()

print(avg_salary)

Output

58750.0

Step 4: Find Highest Salary

highest_salary = df["Salary"].max()

print(highest_salary)

Output

Step 5: Find Highest Paid Employee

highest_paid = df[df["Salary"] == df["Salary"].max()]

print(highest_paid)

Output

   Employee ID  Name Department  Salary
3          104  Emma    Finance   70000

Step 6: Department-Wise Salary Report

report = df.groupby("Department")["Salary"].mean()

print(report)

Output

Department
Finance    70000
HR         60000
IT         52500
Name: Salary, dtype: float64

Step 7: Add Bonus Column

Company gives a 10% bonus.

df["Bonus"] = df["Salary"] * 0.10

print(df)

Output

   Employee ID   Name Department  Salary   Bonus
0          101   John         IT   50000  5000.0
1          102  Alice         HR   60000  6000.0
2          103    Bob         IT   55000  5500.0
3          104   Emma    Finance   70000  7000.0

Step 8: Save Updated Data to Excel

df.to_excel(
    "employee_report.xlsx",
    index=False
)

This creates a new Excel file:

employee_report.xlsx

with the Bonus column included.

Real Office Use Cases

HR Department

# Employees older than 30
df[df["Age"] > 30]

Used for promotion analysis.

Sales Team

# Total monthly sales
df["Sales"].sum()

Used for monthly reporting.

Finance Team

# Total salary expense
df["Salary"].sum()

Used for budgeting.

Inventory Team

# Products with low stock
df[df["Stock"] < 10]

Used for reordering items.

Typical Daily Workflow

import pandas as pd

# Load Excel file
df = pd.read_excel("employees.xlsx")

# Analyze
print(df.head())

# Filter
it_staff = df[df["Department"] == "IT"]

# Create bonus
df["Bonus"] = df["Salary"] * 0.10

# Save report
df.to_excel(
    "monthly_report.xlsx",
    index=False
)

What Happened?

Read data from Excel.
Filtered required records.
Calculated new values.
Generated a report.
Saved the report back to Excel.

This is one of the most common ways Pandas is used in real office jobs, data analysis roles, HR reporting, finance reporting, and business automation.

Key Takeaway

Pandas acts like an automated Excel worker.

Instead of manually filtering, calculating, and creating reports in Excel, you write a few lines of Python and let Pandas do it automatically—even for files with hundreds of thousands of rows.

Pandas acts like an automated Excel worker. Instead of manually filtering, calculating, and creating reports in Excel, you write a few lines of Python and let Pandas do it automatically, even for files with hundreds of thousands of rows.

NumPy vs Pandas

Overview

Both NumPy and Pandas are essential Python libraries for data processing.

NumPy is designed for fast numerical and mathematical computations.
Pandas is designed for data analysis and manipulation of tabular data.

Think of it this way:

NumPy  → Numbers and Mathematics
Pandas → Tables and Data Analysis

Real-World Analogy

Suppose you have an Excel sheet:

ID	Name	Salary
101	John	50000
102	Alice	60000

NumPy View

[
 [101, "John", 50000],
 [102, "Alice", 60000]
]

NumPy only sees rows and columns.

You must remember:

Column 0 = ID
Column 1 = Name
Column 2 = Salary

Pandas View

ID   Name   Salary
101  John   50000
102  Alice  60000

Pandas understands:

Column names
Row indexes
Missing values
Data types
Spreadsheet-like operations

Comparison Table

Feature	NumPy	Pandas
Main Object	ndarray	DataFrame, Series
Purpose	Numerical Computing	Data Analysis
Column Names	No	Yes
Row Labels	No	Yes
CSV Support	No	Yes
Excel Support	No	Yes
Missing Data Handling	Basic	Advanced
Grouping Data	No	Yes
Performance	Faster	Slightly Slower
Memory Usage	Lower	Higher

Example 1: Creating Data

NumPy

import numpy as np

data = np.array([
    [101, 50000],
    [102, 60000],
    [103, 70000]
])

print(data)

Output:

[[  101 50000]
 [  102 60000]
 [  103 70000]]

Pandas

import pandas as pd

df = pd.DataFrame({
    "Employee_ID": [101, 102, 103],
    "Salary": [50000, 60000, 70000]
})

print(df)

Output:

   Employee_ID  Salary
0          101   50000
1          102   60000
2          103   70000

Example 2: Average Salary

NumPy

import numpy as np

salaries = np.array([
    50000,
    60000,
    70000
])

print(np.mean(salaries))

Output:

60000.0

Pandas

import pandas as pd

df = pd.DataFrame({
    "Salary": [50000, 60000, 70000]
})

print(df["Salary"].mean())

Output:

60000.0

Example 3: Filter Employees

Find employees earning more than 60,000.

NumPy

import numpy as np

data = np.array([
    [101, 50000],
    [102, 60000],
    [103, 70000]
])

result = data[data[:,1] > 60000]

print(result)

Output:

[[  103 70000]]

Notice:

data[:,1]

You must know that column index 1 contains salary.

Pandas

import pandas as pd

df = pd.DataFrame({
    "Employee_ID": [101, 102, 103],
    "Salary": [50000, 60000, 70000]
})

result = df[df["Salary"] > 60000]

print(result)

Output:

   Employee_ID  Salary
2          103   70000

Example 4: Reading Excel Files

NumPy

NumPy does not directly support Excel files.

Additional libraries are required.

Pandas

import pandas as pd

df = pd.read_excel(
    "employees.xlsx"
)

One line of code.

Example 5: Department-Wise Salary Report

Data:

Name	Department	Salary
John	IT	50000
Alice	HR	60000
Bob	IT	70000

NumPy

Requires custom loops and logic.

Not straightforward.

Pandas

df.groupby(
    "Department"
)["Salary"].mean()

Output:

Department
HR    60000
IT    60000

Very simple.

Performance Comparison

NumPy

import numpy as np

arr = np.arange(1000000)

arr * 2

Extremely fast.

Optimized for numerical calculations.

Pandas

import pandas as pd

s = pd.Series(
    range(1000000)
)

s * 2

Fast, but slightly slower because Pandas provides additional functionality.

When to Use NumPy

Use NumPy when working with:

Mathematical calculations
Linear algebra
Matrix operations
Statistics
Scientific computing
Machine Learning internals

Example:

import numpy as np

matrix = np.array([
    [1, 2],
    [3, 4]
])

print(matrix.dot(matrix))

Output:

[[ 7 10]
 [15 22]]

When to Use Pandas

Use Pandas when working with:

Excel files
CSV files
Business reports
Data cleaning
Missing values
Data analysis
Data Science projects

Example:

import pandas as pd

df = pd.read_csv(
    "sales.csv"
)

report = (
    df.groupby("Region")
      ["Sales"]
      .sum()
)

print(report)

Relationship Between NumPy and Pandas

Pandas is built on top of NumPy.

Python
│
├── NumPy
│
└── Pandas

Internally, Pandas uses NumPy arrays.

Example:

import pandas as pd

df = pd.DataFrame({
    "Salary": [50000, 60000]
})

print(
    type(
        df["Salary"].values
    )
)

Output:

<class 'numpy.ndarray'>

Real-World Use Cases

NumPy

Physics simulations
Engineering calculations
Machine Learning algorithms
Scientific research
Mathematical modeling

Example:

velocity = np.array(
    [10, 20, 30]
)

distance = velocity * 5

Pandas

HR reports
Sales reports
Financial analysis
Customer analytics
Data cleaning
Business dashboards

Example:

sales.groupby(
    "Product"
)["Quantity"].sum()

Interview Answer

What is the difference between NumPy and Pandas?

NumPy is a library for fast numerical and mathematical computations using multidimensional arrays. Pandas is a library for data analysis and manipulation that provides DataFrame and Series objects. Pandas is built on top of NumPy and adds support for column names, indexing, grouping, file handling, and missing-value management.

Quick Summary

Situation	Recommended Library
Mathematical Calculations	NumPy
Matrix Operations	NumPy
Scientific Computing	NumPy
Excel Analysis	Pandas
CSV Processing	Pandas
Data Cleaning	Pandas
Business Reporting	Pandas
Data Science Projects	Both

Rule of Thumb

Need mathematics?
→ Use NumPy

Need Excel/CSV analysis?
→ Use Pandas

Need Data Science?
→ Use Both