Python Pandas Notes (Beginner to Advanced)

amitmund June 03, 2026


What is Pandas?

Pandas is a powerful Python library used for:

  • Data Analysis
  • Data Cleaning
  • Data Manipulation
  • Data Transformation
  • Data Science
  • Machine Learning Data Preparation

Installation

pip install pandas

Import Pandas

import pandas as pd

1. Pandas Data Structures

Pandas provides two primary data structures:

Structure Description
Series One-dimensional labeled array
DataFrame Two-dimensional table

2. Pandas Series

Creating a Series

import pandas as pd

s = pd.Series([10, 20, 30, 40])

print(s)

Output:

0    10
1    20
2    30
3    40
dtype: int64

Series with Custom Index

s = pd.Series(
    [10, 20, 30],
    index=["A", "B", "C"]
)

print(s)

Access Value:

print(s["B"])

3. DataFrame

A DataFrame is a table of rows and columns.

Create DataFrame from Dictionary

data = {
    "Name": ["John", "Alice", "Bob"],
    "Age": [25, 30, 35]
}

df = pd.DataFrame(data)

print(df)

Output:

    Name  Age
0   John   25
1  Alice   30
2    Bob   35

4. Reading Data Files

CSV

df = pd.read_csv("data.csv")

Excel

df = pd.read_excel("data.xlsx")

Install Excel support:

pip install openpyxl

JSON

df = pd.read_json("data.json")

5. Viewing Data

First Rows

df.head()
df.head(10)

Last Rows

df.tail()

Shape

df.shape

Example:

(100, 5)

Columns

df.columns

Data Types

df.dtypes

Information

df.info()

Statistical Summary

df.describe()

6. Selecting Data

Single Column

df["Name"]

Multiple Columns

df[["Name", "Age"]]

Row Selection using iloc

df.iloc[0]
df.iloc[0:3]

Row Selection using loc

df.loc[0]

7. Filtering Data

Single Condition

df[df["Age"] > 25]

Multiple Conditions (AND)

df[
    (df["Age"] > 25)
    &
    (df["Age"] < 40)
]

OR Condition

df[
    (df["Age"] < 25)
    |
    (df["Age"] > 50)
]

8. Adding Columns

Add New Column

df["Salary"] = [50000, 60000, 70000]

Derived Column

df["Bonus"] = df["Salary"] * 0.10

9. Updating Data

Update Single Value

df.loc[0, "Age"] = 28

Update Multiple Rows

df.loc[
    df["Age"] > 30,
    "Category"
] = "Senior"

10. Deleting Data

Delete Column

df.drop("Salary", axis=1)

Permanent:

df.drop(
    "Salary",
    axis=1,
    inplace=True
)

Delete Row

df.drop(0)

11. Handling Missing Values

Detect Missing Values

df.isnull()

Count Missing Values

df.isnull().sum()

Remove Missing Rows

df.dropna()

Fill Missing Values

df.fillna(0)

Fill with Mean:

df["Age"].fillna(
    df["Age"].mean()
)

12. Sorting Data

Ascending

df.sort_values("Age")

Descending

df.sort_values(
    "Age",
    ascending=False
)

Multiple Columns

df.sort_values(
    ["Age", "Salary"]
)

13. Renaming Columns

df.rename(
    columns={
        "Age": "Employee_Age"
    }
)

Permanent:

df.rename(
    columns={
        "Age": "Employee_Age"
    },
    inplace=True
)

14. String Operations

Uppercase

df["Name"].str.upper()

Lowercase

df["Name"].str.lower()

Contains

df["Name"].str.contains("John")

Replace

df["Name"].str.replace(
    "John",
    "Johnny"
)

15. Apply Function

def grade(age):
    if age > 30:
        return "Senior"
    return "Junior"

df["Category"] = df["Age"].apply(grade)

Using Lambda:

df["Double"] = df["Age"].apply(
    lambda x: x * 2
)

16. GroupBy

Mean Salary by Department

df.groupby(
    "Department"
)["Salary"].mean()

Multiple Aggregations

df.groupby(
    "Department"
).agg({
    "Salary": [
        "sum",
        "mean",
        "max"
    ],
    "Age": "mean"
})

17. Pivot Table

pd.pivot_table(
    df,
    values="Salary",
    index="Department",
    aggfunc="mean"
)

18. Merge DataFrames

Inner Join

pd.merge(
    df1,
    df2,
    on="ID"
)

Left Join

pd.merge(
    df1,
    df2,
    on="ID",
    how="left"
)

Right Join

pd.merge(
    df1,
    df2,
    on="ID",
    how="right"
)

Outer Join

pd.merge(
    df1,
    df2,
    on="ID",
    how="outer"
)

19. Concatenation

Vertical Concatenation

pd.concat([df1, df2])

Horizontal Concatenation

pd.concat(
    [df1, df2],
    axis=1
)

20. Date and Time Operations

Convert to Datetime:

df["Date"] = pd.to_datetime(
    df["Date"]
)

Extract Year:

df["Date"].dt.year

Extract Month:

df["Date"].dt.month

Extract Day:

df["Date"].dt.day

21. Index Operations

Set Index

df.set_index("ID")

Reset Index

df.reset_index()

22. MultiIndex

df.set_index(
    [
        "Department",
        "Employee"
    ]
)

Access:

df.loc["IT"]

23. Advanced Aggregation

df.groupby(
    "Department"
).agg(
    total_salary=(
        "Salary",
        "sum"
    ),
    avg_salary=(
        "Salary",
        "mean"
    ),
    max_salary=(
        "Salary",
        "max"
    )
)

24. Window Functions

Rolling Average

df["RollingMean"] = (
    df["Sales"]
    .rolling(3)
    .mean()
)

Rolling Sum

df["RollingSum"] = (
    df["Sales"]
    .rolling(3)
    .sum()
)

25. Cumulative Functions

Cumulative Sum

df["CumSum"] = (
    df["Sales"]
    .cumsum()
)

Cumulative Maximum

df["CumMax"] = (
    df["Sales"]
    .cummax()
)

Cumulative Minimum

df["CumMin"] = (
    df["Sales"]
    .cummin()
)

26. Explode Lists

df = pd.DataFrame({
    "Name": ["John", "Alice"],
    "Skills": [
        ["Python", "SQL"],
        ["Java"]
    ]
})

df.explode("Skills")

27. Performance Optimization

Convert to Category

df["Department"] = (
    df["Department"]
    .astype("category")
)

Memory Usage

df.memory_usage(
    deep=True
)

28. Method Chaining

result = (
    df
    .dropna()
    .query("Age > 25")
    .sort_values("Salary")
)

29. Export Data

CSV

df.to_csv(
    "output.csv",
    index=False
)

Excel

df.to_excel(
    "output.xlsx",
    index=False
)

JSON

df.to_json(
    "output.json"
)

30. Real-World Workflow Example

import pandas as pd

df = pd.read_csv(
    "employees.csv"
)

df.dropna(
    inplace=True
)

df = df[
    df["Salary"] > 50000
]

df["Bonus"] = (
    df["Salary"] * 0.10
)

summary = (
    df.groupby("Department")
    .agg({
        "Salary": "mean",
        "Bonus": "sum"
    })
)

summary.to_excel(
    "report.xlsx"
)

Pandas Interview Questions

Difference Between loc and iloc

loc iloc
Label Based Position Based
Uses Names Uses Integer Index

Difference Between merge and concat

merge concat
SQL-style Join Stacking Data
Requires Key No Key Required

Difference Between Series and DataFrame

Series DataFrame
1-Dimensional 2-Dimensional
Single Column Multiple Columns

Missing Value Functions

dropna()
fillna()
interpolate()

Pandas Cheat Sheet

pd.read_csv()

pd.read_excel()

df.head()

df.info()

df.describe()

df.shape

df.columns

df.loc[]

df.iloc[]

df.groupby()

df.merge()

df.concat()

df.pivot_table()

df.dropna()

df.fillna()

df.sort_values()

df.to_csv()

df.to_excel()

Recommended Learning Roadmap

  1. Series
  2. DataFrame
  3. Reading Files
  4. Selection & Filtering
  5. Missing Values
  6. GroupBy
  7. Merge & Join
  8. Pivot Tables
  9. DateTime Operations
  10. Window Functions
  11. MultiIndex
  12. Performance Optimization
  13. Real-world Data Analysis Projects
  14. Interview Preparation


Pandas Simple Examples and Real-World Use Cases

What Problem Does Pandas Solve?

Imagine you have thousands of rows of data:

Name Age Salary
John 25 50000
Alice 30 60000
Bob 35 70000

If you use normal Python lists, finding information becomes difficult as the data grows.

Pandas allows you to:

  • Read data from CSV/Excel files
  • Filter data
  • Clean missing values
  • Calculate statistics
  • Generate reports
  • Prepare data for Machine Learning

Example 1: Employee Salary Report

Real-World Scenario

An HR department stores employee information in an Excel file.

They want to find:

  • Employees earning more than ₹50,000
  • Average salary
  • Highest salary

Data

import pandas as pd

data = {
    "Name": ["John", "Alice", "Bob"],
    "Salary": [50000, 60000, 70000]
}

df = pd.DataFrame(data)

print(df)

Output:

    Name  Salary
0   John   50000
1  Alice   60000
2    Bob   70000

Employees earning more than 50,000

print(df[df["Salary"] > 50000])

Output:

    Name  Salary
1  Alice   60000
2    Bob   70000

Average Salary

print(df["Salary"].mean())

Output:

60000

Maximum Salary

print(df["Salary"].max())

Output:

70000

Example 2: Student Marks Analysis

Real-World Scenario

A teacher wants to:

  • Find the highest mark
  • Calculate average marks
  • Identify students who passed

Data

import pandas as pd

data = {
    "Student": ["Rahul", "Priya", "Amit"],
    "Marks": [85, 40, 65]
}

df = pd.DataFrame(data)

print(df)

Output:

  Student  Marks
0   Rahul     85
1   Priya     40
2    Amit     65

Average Marks

print(df["Marks"].mean())

Output:

63.33

Students Who Passed

passed = df[df["Marks"] >= 50]

print(passed)

Output:

  Student  Marks
0   Rahul     85
2    Amit     65

Example 3: Online Shop Sales Report

Real-World Scenario

An e-commerce company wants to know:

  • Total sales
  • Best-selling day

Data

import pandas as pd

sales = {
    "Day": ["Mon", "Tue", "Wed"],
    "Amount": [1000, 1500, 2000]
}

df = pd.DataFrame(sales)

print(df)

Total Sales

print(df["Amount"].sum())

Output:

4500

Highest Sale

print(df["Amount"].max())

Output:

2000

Example 4: Reading Data from CSV File

employees.csv

Name,Age,Salary
John,25,50000
Alice,30,60000
Bob,35,70000

Read CSV

import pandas as pd

df = pd.read_csv("employees.csv")

print(df)

Output:

    Name  Age  Salary
0   John   25   50000
1  Alice   30   60000
2    Bob   35   70000

Why This Is Useful

Instead of manually reading each line from a file, Pandas loads the entire file into a table instantly.


Example 5: Handling Missing Data

Real-World Scenario

A customer database contains missing ages.

import pandas as pd

data = {
    "Name": ["John", "Alice", "Bob"],
    "Age": [25, None, 35]
}

df = pd.DataFrame(data)

print(df)

Output:

    Name   Age
0   John  25.0
1  Alice   NaN
2    Bob  35.0

Count Missing Values

print(df.isnull().sum())

Output:

Name    0
Age     1
dtype: int64

Fill Missing Value

df["Age"] = df["Age"].fillna(30)

print(df)

Output:

    Name   Age
0   John  25.0
1  Alice  30.0
2    Bob  35.0

Example 6: Grouping Data

Real-World Scenario

A company wants average salary by department.

import pandas as pd

data = {
    "Department": ["IT", "IT", "HR", "HR"],
    "Salary": [50000, 60000, 40000, 45000]
}

df = pd.DataFrame(data)

result = df.groupby("Department")["Salary"].mean()

print(result)

Output:

Department
HR    42500
IT    55000
Name: Salary, dtype: float64

Why Companies Use This

Managers can quickly see:

  • Average salary
  • Total salary expense
  • Department performance

Example 7: Creating a New Column

Real-World Scenario

An organization gives a 10% bonus.

import pandas as pd

data = {
    "Name": ["John", "Alice"],
    "Salary": [50000, 60000]
}

df = pd.DataFrame(data)

df["Bonus"] = df["Salary"] * 0.10

print(df)

Output:

    Name  Salary   Bonus
0   John   50000  5000.0
1  Alice   60000  6000.0

Where Pandas Is Used in Industry

Banking

  • Customer transaction analysis
  • Fraud detection
  • Loan data analysis

Example:

transactions.groupby("Customer")["Amount"].sum()

E-Commerce

  • Sales reports
  • Customer behavior analysis
  • Product performance tracking

Example:

orders.groupby("Product")["Quantity"].sum()

Healthcare

  • Patient records
  • Medical research
  • Hospital reporting

Example:

patients["Age"].mean()

Education

  • Student marks analysis
  • Attendance tracking
  • Exam reports

Example:

students["Marks"].mean()

Data Science & Machine Learning

Pandas is usually the first library used before machine learning.

Typical workflow:

import pandas as pd

# Load data
df = pd.read_csv("data.csv")

# Clean data
df.dropna(inplace=True)

# Analyze data
print(df.describe())

# Send to ML model

Quick Summary

Task Pandas Function
Read CSV read_csv()
Read Excel read_excel()
View Data head()
Statistics describe()
Filter Rows df[df["col"] > value]
Group Data groupby()
Remove Missing Data dropna()
Fill Missing Data fillna()
Sort Data sort_values()
Export CSV to_csv()
Export Excel to_excel()

Remember

Pandas is essentially a super-powered spreadsheet inside Python.

If Excel can do it manually, Pandas can usually do it automatically and much faster on thousands or millions of rows of data.




Example: Working with Excel Data Using Pandas

Scenario

Suppose you work in an office and receive an Excel file every month named:

employees.xlsx

Employee ID Name Department Salary
101 John IT 50000
102 Alice HR 60000
103 Bob IT 55000
104 Emma Finance 70000

Management asks:

  • Show all IT employees
  • Calculate average salary
  • Find the highest-paid employee
  • Generate a department-wise salary report

Step 1: Read Excel File

import pandas as pd

df = pd.read_excel("employees.xlsx")

print(df)

Output

   Employee ID   Name Department  Salary
0          101   John         IT   50000
1          102  Alice         HR   60000
2          103    Bob         IT   55000
3          104   Emma    Finance   70000

Step 2: Show IT Employees

it_employees = df[df["Department"] == "IT"]

print(it_employees)

Output

   Employee ID  Name Department  Salary
0          101  John         IT   50000
2          103   Bob         IT   55000

Step 3: Calculate Average Salary

avg_salary = df["Salary"].mean()

print(avg_salary)

Output

58750.0

Step 4: Find Highest Salary

highest_salary = df["Salary"].max()

print(highest_salary)

Output

70000

Step 5: Find Highest Paid Employee

highest_paid = df[df["Salary"] == df["Salary"].max()]

print(highest_paid)

Output

   Employee ID  Name Department  Salary
3          104  Emma    Finance   70000

Step 6: Department-Wise Salary Report

report = df.groupby("Department")["Salary"].mean()

print(report)

Output

Department
Finance    70000
HR         60000
IT         52500
Name: Salary, dtype: float64

Step 7: Add Bonus Column

Company gives a 10% bonus.

df["Bonus"] = df["Salary"] * 0.10

print(df)

Output

   Employee ID   Name Department  Salary   Bonus
0          101   John         IT   50000  5000.0
1          102  Alice         HR   60000  6000.0
2          103    Bob         IT   55000  5500.0
3          104   Emma    Finance   70000  7000.0

Step 8: Save Updated Data to Excel

df.to_excel(
    "employee_report.xlsx",
    index=False
)

This creates a new Excel file:

employee_report.xlsx

with the Bonus column included.


Real Office Use Cases

HR Department

# Employees older than 30
df[df["Age"] > 30]

Used for promotion analysis.


Sales Team

# Total monthly sales
df["Sales"].sum()

Used for monthly reporting.


Finance Team

# Total salary expense
df["Salary"].sum()

Used for budgeting.


Inventory Team

# Products with low stock
df[df["Stock"] < 10]

Used for reordering items.


Typical Daily Workflow

import pandas as pd

# Load Excel file
df = pd.read_excel("employees.xlsx")

# Analyze
print(df.head())

# Filter
it_staff = df[df["Department"] == "IT"]

# Create bonus
df["Bonus"] = df["Salary"] * 0.10

# Save report
df.to_excel(
    "monthly_report.xlsx",
    index=False
)

What Happened?

  1. Read data from Excel.
  2. Filtered required records.
  3. Calculated new values.
  4. Generated a report.
  5. Saved the report back to Excel.

This is one of the most common ways Pandas is used in real office jobs, data analysis roles, HR reporting, finance reporting, and business automation.


Key Takeaway

Pandas acts like an automated Excel worker.

Instead of manually filtering, calculating, and creating reports in Excel, you write a few lines of Python and let Pandas do it automatically—even for files with hundreds of thousands of rows.




Pandas acts like an automated Excel worker. Instead of manually filtering, calculating, and creating reports in Excel, you write a few lines of Python and let Pandas do it automatically, even for files with hundreds of thousands of rows.



NumPy vs Pandas

Overview

Both NumPy and Pandas are essential Python libraries for data processing.

  • NumPy is designed for fast numerical and mathematical computations.
  • Pandas is designed for data analysis and manipulation of tabular data.

Think of it this way:

NumPy  → Numbers and Mathematics
Pandas → Tables and Data Analysis

Real-World Analogy

Suppose you have an Excel sheet:

ID Name Salary
101 John 50000
102 Alice 60000

NumPy View

[
 [101, "John", 50000],
 [102, "Alice", 60000]
]

NumPy only sees rows and columns.

You must remember:

Column 0 = ID
Column 1 = Name
Column 2 = Salary

Pandas View

ID   Name   Salary
101  John   50000
102  Alice  60000

Pandas understands:

  • Column names
  • Row indexes
  • Missing values
  • Data types
  • Spreadsheet-like operations

Comparison Table

Feature NumPy Pandas
Main Object ndarray DataFrame, Series
Purpose Numerical Computing Data Analysis
Column Names No Yes
Row Labels No Yes
CSV Support No Yes
Excel Support No Yes
Missing Data Handling Basic Advanced
Grouping Data No Yes
Performance Faster Slightly Slower
Memory Usage Lower Higher

Example 1: Creating Data

NumPy

import numpy as np

data = np.array([
    [101, 50000],
    [102, 60000],
    [103, 70000]
])

print(data)

Output:

[[  101 50000]
 [  102 60000]
 [  103 70000]]

Pandas

import pandas as pd

df = pd.DataFrame({
    "Employee_ID": [101, 102, 103],
    "Salary": [50000, 60000, 70000]
})

print(df)

Output:

   Employee_ID  Salary
0          101   50000
1          102   60000
2          103   70000

Example 2: Average Salary

NumPy

import numpy as np

salaries = np.array([
    50000,
    60000,
    70000
])

print(np.mean(salaries))

Output:

60000.0

Pandas

import pandas as pd

df = pd.DataFrame({
    "Salary": [50000, 60000, 70000]
})

print(df["Salary"].mean())

Output:

60000.0

Example 3: Filter Employees

Find employees earning more than 60,000.

NumPy

import numpy as np

data = np.array([
    [101, 50000],
    [102, 60000],
    [103, 70000]
])

result = data[data[:,1] > 60000]

print(result)

Output:

[[  103 70000]]

Notice:

data[:,1]

You must know that column index 1 contains salary.


Pandas

import pandas as pd

df = pd.DataFrame({
    "Employee_ID": [101, 102, 103],
    "Salary": [50000, 60000, 70000]
})

result = df[df["Salary"] > 60000]

print(result)

Output:

   Employee_ID  Salary
2          103   70000

More readable and easier to maintain.


Example 4: Reading Excel Files

NumPy

NumPy does not directly support Excel files.

Additional libraries are required.


Pandas

import pandas as pd

df = pd.read_excel(
    "employees.xlsx"
)

One line of code.


Example 5: Department-Wise Salary Report

Data:

Name Department Salary
John IT 50000
Alice HR 60000
Bob IT 70000

NumPy

Requires custom loops and logic.

Not straightforward.


Pandas

df.groupby(
    "Department"
)["Salary"].mean()

Output:

Department
HR    60000
IT    60000

Very simple.


Performance Comparison

NumPy

import numpy as np

arr = np.arange(1000000)

arr * 2

Extremely fast.

Optimized for numerical calculations.


Pandas

import pandas as pd

s = pd.Series(
    range(1000000)
)

s * 2

Fast, but slightly slower because Pandas provides additional functionality.


When to Use NumPy

Use NumPy when working with:

  • Mathematical calculations
  • Linear algebra
  • Matrix operations
  • Statistics
  • Scientific computing
  • Machine Learning internals

Example:

import numpy as np

matrix = np.array([
    [1, 2],
    [3, 4]
])

print(matrix.dot(matrix))

Output:

[[ 7 10]
 [15 22]]

When to Use Pandas

Use Pandas when working with:

  • Excel files
  • CSV files
  • Business reports
  • Data cleaning
  • Missing values
  • Data analysis
  • Data Science projects

Example:

import pandas as pd

df = pd.read_csv(
    "sales.csv"
)

report = (
    df.groupby("Region")
      ["Sales"]
      .sum()
)

print(report)

Relationship Between NumPy and Pandas

Pandas is built on top of NumPy.

Python
│
├── NumPy
│
└── Pandas

Internally, Pandas uses NumPy arrays.

Example:

import pandas as pd

df = pd.DataFrame({
    "Salary": [50000, 60000]
})

print(
    type(
        df["Salary"].values
    )
)

Output:

<class 'numpy.ndarray'>

Real-World Use Cases

NumPy

  • Physics simulations
  • Engineering calculations
  • Machine Learning algorithms
  • Scientific research
  • Mathematical modeling

Example:

velocity = np.array(
    [10, 20, 30]
)

distance = velocity * 5

Pandas

  • HR reports
  • Sales reports
  • Financial analysis
  • Customer analytics
  • Data cleaning
  • Business dashboards

Example:

sales.groupby(
    "Product"
)["Quantity"].sum()

Interview Answer

What is the difference between NumPy and Pandas?

NumPy is a library for fast numerical and mathematical computations using multidimensional arrays. Pandas is a library for data analysis and manipulation that provides DataFrame and Series objects. Pandas is built on top of NumPy and adds support for column names, indexing, grouping, file handling, and missing-value management.


Quick Summary

Situation Recommended Library
Mathematical Calculations NumPy
Matrix Operations NumPy
Scientific Computing NumPy
Excel Analysis Pandas
CSV Processing Pandas
Data Cleaning Pandas
Business Reporting Pandas
Data Science Projects Both

Rule of Thumb

Need mathematics?
→ Use NumPy

Need Excel/CSV analysis?
→ Use Pandas

Need Data Science?
→ Use Both


0 Likes
12 Views

Filters

No filters available for this view.

Reset All