Python Pandas Notes (Beginner to Advanced)
What is Pandas?
Pandas is a powerful Python library used for:
- Data Analysis
- Data Cleaning
- Data Manipulation
- Data Transformation
- Data Science
- Machine Learning Data Preparation
Installation
pip install pandas
Import Pandas
import pandas as pd
1. Pandas Data Structures
Pandas provides two primary data structures:
| Structure | Description |
|---|---|
| Series | One-dimensional labeled array |
| DataFrame | Two-dimensional table |
2. Pandas Series
Creating a Series
import pandas as pd
s = pd.Series([10, 20, 30, 40])
print(s)
Output:
0 10
1 20
2 30
3 40
dtype: int64
Series with Custom Index
s = pd.Series(
[10, 20, 30],
index=["A", "B", "C"]
)
print(s)
Access Value:
print(s["B"])
3. DataFrame
A DataFrame is a table of rows and columns.
Create DataFrame from Dictionary
data = {
"Name": ["John", "Alice", "Bob"],
"Age": [25, 30, 35]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age
0 John 25
1 Alice 30
2 Bob 35
4. Reading Data Files
CSV
df = pd.read_csv("data.csv")
Excel
df = pd.read_excel("data.xlsx")
Install Excel support:
pip install openpyxl
JSON
df = pd.read_json("data.json")
5. Viewing Data
First Rows
df.head()
df.head(10)
Last Rows
df.tail()
Shape
df.shape
Example:
(100, 5)
Columns
df.columns
Data Types
df.dtypes
Information
df.info()
Statistical Summary
df.describe()
6. Selecting Data
Single Column
df["Name"]
Multiple Columns
df[["Name", "Age"]]
Row Selection using iloc
df.iloc[0]
df.iloc[0:3]
Row Selection using loc
df.loc[0]
7. Filtering Data
Single Condition
df[df["Age"] > 25]
Multiple Conditions (AND)
df[
(df["Age"] > 25)
&
(df["Age"] < 40)
]
OR Condition
df[
(df["Age"] < 25)
|
(df["Age"] > 50)
]
8. Adding Columns
Add New Column
df["Salary"] = [50000, 60000, 70000]
Derived Column
df["Bonus"] = df["Salary"] * 0.10
9. Updating Data
Update Single Value
df.loc[0, "Age"] = 28
Update Multiple Rows
df.loc[
df["Age"] > 30,
"Category"
] = "Senior"
10. Deleting Data
Delete Column
df.drop("Salary", axis=1)
Permanent:
df.drop(
"Salary",
axis=1,
inplace=True
)
Delete Row
df.drop(0)
11. Handling Missing Values
Detect Missing Values
df.isnull()
Count Missing Values
df.isnull().sum()
Remove Missing Rows
df.dropna()
Fill Missing Values
df.fillna(0)
Fill with Mean:
df["Age"].fillna(
df["Age"].mean()
)
12. Sorting Data
Ascending
df.sort_values("Age")
Descending
df.sort_values(
"Age",
ascending=False
)
Multiple Columns
df.sort_values(
["Age", "Salary"]
)
13. Renaming Columns
df.rename(
columns={
"Age": "Employee_Age"
}
)
Permanent:
df.rename(
columns={
"Age": "Employee_Age"
},
inplace=True
)
14. String Operations
Uppercase
df["Name"].str.upper()
Lowercase
df["Name"].str.lower()
Contains
df["Name"].str.contains("John")
Replace
df["Name"].str.replace(
"John",
"Johnny"
)
15. Apply Function
def grade(age):
if age > 30:
return "Senior"
return "Junior"
df["Category"] = df["Age"].apply(grade)
Using Lambda:
df["Double"] = df["Age"].apply(
lambda x: x * 2
)
16. GroupBy
Mean Salary by Department
df.groupby(
"Department"
)["Salary"].mean()
Multiple Aggregations
df.groupby(
"Department"
).agg({
"Salary": [
"sum",
"mean",
"max"
],
"Age": "mean"
})
17. Pivot Table
pd.pivot_table(
df,
values="Salary",
index="Department",
aggfunc="mean"
)
18. Merge DataFrames
Inner Join
pd.merge(
df1,
df2,
on="ID"
)
Left Join
pd.merge(
df1,
df2,
on="ID",
how="left"
)
Right Join
pd.merge(
df1,
df2,
on="ID",
how="right"
)
Outer Join
pd.merge(
df1,
df2,
on="ID",
how="outer"
)
19. Concatenation
Vertical Concatenation
pd.concat([df1, df2])
Horizontal Concatenation
pd.concat(
[df1, df2],
axis=1
)
20. Date and Time Operations
Convert to Datetime:
df["Date"] = pd.to_datetime(
df["Date"]
)
Extract Year:
df["Date"].dt.year
Extract Month:
df["Date"].dt.month
Extract Day:
df["Date"].dt.day
21. Index Operations
Set Index
df.set_index("ID")
Reset Index
df.reset_index()
22. MultiIndex
df.set_index(
[
"Department",
"Employee"
]
)
Access:
df.loc["IT"]
23. Advanced Aggregation
df.groupby(
"Department"
).agg(
total_salary=(
"Salary",
"sum"
),
avg_salary=(
"Salary",
"mean"
),
max_salary=(
"Salary",
"max"
)
)
24. Window Functions
Rolling Average
df["RollingMean"] = (
df["Sales"]
.rolling(3)
.mean()
)
Rolling Sum
df["RollingSum"] = (
df["Sales"]
.rolling(3)
.sum()
)
25. Cumulative Functions
Cumulative Sum
df["CumSum"] = (
df["Sales"]
.cumsum()
)
Cumulative Maximum
df["CumMax"] = (
df["Sales"]
.cummax()
)
Cumulative Minimum
df["CumMin"] = (
df["Sales"]
.cummin()
)
26. Explode Lists
df = pd.DataFrame({
"Name": ["John", "Alice"],
"Skills": [
["Python", "SQL"],
["Java"]
]
})
df.explode("Skills")
27. Performance Optimization
Convert to Category
df["Department"] = (
df["Department"]
.astype("category")
)
Memory Usage
df.memory_usage(
deep=True
)
28. Method Chaining
result = (
df
.dropna()
.query("Age > 25")
.sort_values("Salary")
)
29. Export Data
CSV
df.to_csv(
"output.csv",
index=False
)
Excel
df.to_excel(
"output.xlsx",
index=False
)
JSON
df.to_json(
"output.json"
)
30. Real-World Workflow Example
import pandas as pd
df = pd.read_csv(
"employees.csv"
)
df.dropna(
inplace=True
)
df = df[
df["Salary"] > 50000
]
df["Bonus"] = (
df["Salary"] * 0.10
)
summary = (
df.groupby("Department")
.agg({
"Salary": "mean",
"Bonus": "sum"
})
)
summary.to_excel(
"report.xlsx"
)
Pandas Interview Questions
Difference Between loc and iloc
| loc | iloc |
|---|---|
| Label Based | Position Based |
| Uses Names | Uses Integer Index |
Difference Between merge and concat
| merge | concat |
|---|---|
| SQL-style Join | Stacking Data |
| Requires Key | No Key Required |
Difference Between Series and DataFrame
| Series | DataFrame |
|---|---|
| 1-Dimensional | 2-Dimensional |
| Single Column | Multiple Columns |
Missing Value Functions
dropna()
fillna()
interpolate()
Pandas Cheat Sheet
pd.read_csv()
pd.read_excel()
df.head()
df.info()
df.describe()
df.shape
df.columns
df.loc[]
df.iloc[]
df.groupby()
df.merge()
df.concat()
df.pivot_table()
df.dropna()
df.fillna()
df.sort_values()
df.to_csv()
df.to_excel()
Recommended Learning Roadmap
- Series
- DataFrame
- Reading Files
- Selection & Filtering
- Missing Values
- GroupBy
- Merge & Join
- Pivot Tables
- DateTime Operations
- Window Functions
- MultiIndex
- Performance Optimization
- Real-world Data Analysis Projects
- Interview Preparation
Pandas Simple Examples and Real-World Use Cases
What Problem Does Pandas Solve?
Imagine you have thousands of rows of data:
| Name | Age | Salary |
|---|---|---|
| John | 25 | 50000 |
| Alice | 30 | 60000 |
| Bob | 35 | 70000 |
If you use normal Python lists, finding information becomes difficult as the data grows.
Pandas allows you to:
- Read data from CSV/Excel files
- Filter data
- Clean missing values
- Calculate statistics
- Generate reports
- Prepare data for Machine Learning
Example 1: Employee Salary Report
Real-World Scenario
An HR department stores employee information in an Excel file.
They want to find:
- Employees earning more than ₹50,000
- Average salary
- Highest salary
Data
import pandas as pd
data = {
"Name": ["John", "Alice", "Bob"],
"Salary": [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Salary
0 John 50000
1 Alice 60000
2 Bob 70000
Employees earning more than 50,000
print(df[df["Salary"] > 50000])
Output:
Name Salary
1 Alice 60000
2 Bob 70000
Average Salary
print(df["Salary"].mean())
Output:
60000
Maximum Salary
print(df["Salary"].max())
Output:
70000
Example 2: Student Marks Analysis
Real-World Scenario
A teacher wants to:
- Find the highest mark
- Calculate average marks
- Identify students who passed
Data
import pandas as pd
data = {
"Student": ["Rahul", "Priya", "Amit"],
"Marks": [85, 40, 65]
}
df = pd.DataFrame(data)
print(df)
Output:
Student Marks
0 Rahul 85
1 Priya 40
2 Amit 65
Average Marks
print(df["Marks"].mean())
Output:
63.33
Students Who Passed
passed = df[df["Marks"] >= 50]
print(passed)
Output:
Student Marks
0 Rahul 85
2 Amit 65
Example 3: Online Shop Sales Report
Real-World Scenario
An e-commerce company wants to know:
- Total sales
- Best-selling day
Data
import pandas as pd
sales = {
"Day": ["Mon", "Tue", "Wed"],
"Amount": [1000, 1500, 2000]
}
df = pd.DataFrame(sales)
print(df)
Total Sales
print(df["Amount"].sum())
Output:
4500
Highest Sale
print(df["Amount"].max())
Output:
2000
Example 4: Reading Data from CSV File
employees.csv
Name,Age,Salary
John,25,50000
Alice,30,60000
Bob,35,70000
Read CSV
import pandas as pd
df = pd.read_csv("employees.csv")
print(df)
Output:
Name Age Salary
0 John 25 50000
1 Alice 30 60000
2 Bob 35 70000
Why This Is Useful
Instead of manually reading each line from a file, Pandas loads the entire file into a table instantly.
Example 5: Handling Missing Data
Real-World Scenario
A customer database contains missing ages.
import pandas as pd
data = {
"Name": ["John", "Alice", "Bob"],
"Age": [25, None, 35]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age
0 John 25.0
1 Alice NaN
2 Bob 35.0
Count Missing Values
print(df.isnull().sum())
Output:
Name 0
Age 1
dtype: int64
Fill Missing Value
df["Age"] = df["Age"].fillna(30)
print(df)
Output:
Name Age
0 John 25.0
1 Alice 30.0
2 Bob 35.0
Example 6: Grouping Data
Real-World Scenario
A company wants average salary by department.
import pandas as pd
data = {
"Department": ["IT", "IT", "HR", "HR"],
"Salary": [50000, 60000, 40000, 45000]
}
df = pd.DataFrame(data)
result = df.groupby("Department")["Salary"].mean()
print(result)
Output:
Department
HR 42500
IT 55000
Name: Salary, dtype: float64
Why Companies Use This
Managers can quickly see:
- Average salary
- Total salary expense
- Department performance
Example 7: Creating a New Column
Real-World Scenario
An organization gives a 10% bonus.
import pandas as pd
data = {
"Name": ["John", "Alice"],
"Salary": [50000, 60000]
}
df = pd.DataFrame(data)
df["Bonus"] = df["Salary"] * 0.10
print(df)
Output:
Name Salary Bonus
0 John 50000 5000.0
1 Alice 60000 6000.0
Where Pandas Is Used in Industry
Banking
- Customer transaction analysis
- Fraud detection
- Loan data analysis
Example:
transactions.groupby("Customer")["Amount"].sum()
E-Commerce
- Sales reports
- Customer behavior analysis
- Product performance tracking
Example:
orders.groupby("Product")["Quantity"].sum()
Healthcare
- Patient records
- Medical research
- Hospital reporting
Example:
patients["Age"].mean()
Education
- Student marks analysis
- Attendance tracking
- Exam reports
Example:
students["Marks"].mean()
Data Science & Machine Learning
Pandas is usually the first library used before machine learning.
Typical workflow:
import pandas as pd
# Load data
df = pd.read_csv("data.csv")
# Clean data
df.dropna(inplace=True)
# Analyze data
print(df.describe())
# Send to ML model
Quick Summary
| Task | Pandas Function |
|---|---|
| Read CSV | read_csv() |
| Read Excel | read_excel() |
| View Data | head() |
| Statistics | describe() |
| Filter Rows | df[df["col"] > value] |
| Group Data | groupby() |
| Remove Missing Data | dropna() |
| Fill Missing Data | fillna() |
| Sort Data | sort_values() |
| Export CSV | to_csv() |
| Export Excel | to_excel() |
Remember
Pandas is essentially a super-powered spreadsheet inside Python.
If Excel can do it manually, Pandas can usually do it automatically and much faster on thousands or millions of rows of data.
Example: Working with Excel Data Using Pandas
Scenario
Suppose you work in an office and receive an Excel file every month named:
employees.xlsx
| Employee ID | Name | Department | Salary |
|---|---|---|---|
| 101 | John | IT | 50000 |
| 102 | Alice | HR | 60000 |
| 103 | Bob | IT | 55000 |
| 104 | Emma | Finance | 70000 |
Management asks:
- Show all IT employees
- Calculate average salary
- Find the highest-paid employee
- Generate a department-wise salary report
Step 1: Read Excel File
import pandas as pd
df = pd.read_excel("employees.xlsx")
print(df)
Output
Employee ID Name Department Salary
0 101 John IT 50000
1 102 Alice HR 60000
2 103 Bob IT 55000
3 104 Emma Finance 70000
Step 2: Show IT Employees
it_employees = df[df["Department"] == "IT"]
print(it_employees)
Output
Employee ID Name Department Salary
0 101 John IT 50000
2 103 Bob IT 55000
Step 3: Calculate Average Salary
avg_salary = df["Salary"].mean()
print(avg_salary)
Output
58750.0
Step 4: Find Highest Salary
highest_salary = df["Salary"].max()
print(highest_salary)
Output
70000
Step 5: Find Highest Paid Employee
highest_paid = df[df["Salary"] == df["Salary"].max()]
print(highest_paid)
Output
Employee ID Name Department Salary
3 104 Emma Finance 70000
Step 6: Department-Wise Salary Report
report = df.groupby("Department")["Salary"].mean()
print(report)
Output
Department
Finance 70000
HR 60000
IT 52500
Name: Salary, dtype: float64
Step 7: Add Bonus Column
Company gives a 10% bonus.
df["Bonus"] = df["Salary"] * 0.10
print(df)
Output
Employee ID Name Department Salary Bonus
0 101 John IT 50000 5000.0
1 102 Alice HR 60000 6000.0
2 103 Bob IT 55000 5500.0
3 104 Emma Finance 70000 7000.0
Step 8: Save Updated Data to Excel
df.to_excel(
"employee_report.xlsx",
index=False
)
This creates a new Excel file:
employee_report.xlsx
with the Bonus column included.
Real Office Use Cases
HR Department
# Employees older than 30
df[df["Age"] > 30]
Used for promotion analysis.
Sales Team
# Total monthly sales
df["Sales"].sum()
Used for monthly reporting.
Finance Team
# Total salary expense
df["Salary"].sum()
Used for budgeting.
Inventory Team
# Products with low stock
df[df["Stock"] < 10]
Used for reordering items.
Typical Daily Workflow
import pandas as pd
# Load Excel file
df = pd.read_excel("employees.xlsx")
# Analyze
print(df.head())
# Filter
it_staff = df[df["Department"] == "IT"]
# Create bonus
df["Bonus"] = df["Salary"] * 0.10
# Save report
df.to_excel(
"monthly_report.xlsx",
index=False
)
What Happened?
- Read data from Excel.
- Filtered required records.
- Calculated new values.
- Generated a report.
- Saved the report back to Excel.
This is one of the most common ways Pandas is used in real office jobs, data analysis roles, HR reporting, finance reporting, and business automation.
Key Takeaway
Pandas acts like an automated Excel worker.
Instead of manually filtering, calculating, and creating reports in Excel, you write a few lines of Python and let Pandas do it automatically—even for files with hundreds of thousands of rows.
Pandas acts like an automated Excel worker. Instead of manually filtering, calculating, and creating reports in Excel, you write a few lines of Python and let Pandas do it automatically, even for files with hundreds of thousands of rows.
NumPy vs Pandas
Overview
Both NumPy and Pandas are essential Python libraries for data processing.
- NumPy is designed for fast numerical and mathematical computations.
- Pandas is designed for data analysis and manipulation of tabular data.
Think of it this way:
NumPy → Numbers and Mathematics
Pandas → Tables and Data Analysis
Real-World Analogy
Suppose you have an Excel sheet:
| ID | Name | Salary |
|---|---|---|
| 101 | John | 50000 |
| 102 | Alice | 60000 |
NumPy View
[
[101, "John", 50000],
[102, "Alice", 60000]
]
NumPy only sees rows and columns.
You must remember:
Column 0 = ID
Column 1 = Name
Column 2 = Salary
Pandas View
ID Name Salary
101 John 50000
102 Alice 60000
Pandas understands:
- Column names
- Row indexes
- Missing values
- Data types
- Spreadsheet-like operations
Comparison Table
| Feature | NumPy | Pandas |
|---|---|---|
| Main Object | ndarray | DataFrame, Series |
| Purpose | Numerical Computing | Data Analysis |
| Column Names | No | Yes |
| Row Labels | No | Yes |
| CSV Support | No | Yes |
| Excel Support | No | Yes |
| Missing Data Handling | Basic | Advanced |
| Grouping Data | No | Yes |
| Performance | Faster | Slightly Slower |
| Memory Usage | Lower | Higher |
Example 1: Creating Data
NumPy
import numpy as np
data = np.array([
[101, 50000],
[102, 60000],
[103, 70000]
])
print(data)
Output:
[[ 101 50000]
[ 102 60000]
[ 103 70000]]
Pandas
import pandas as pd
df = pd.DataFrame({
"Employee_ID": [101, 102, 103],
"Salary": [50000, 60000, 70000]
})
print(df)
Output:
Employee_ID Salary
0 101 50000
1 102 60000
2 103 70000
Example 2: Average Salary
NumPy
import numpy as np
salaries = np.array([
50000,
60000,
70000
])
print(np.mean(salaries))
Output:
60000.0
Pandas
import pandas as pd
df = pd.DataFrame({
"Salary": [50000, 60000, 70000]
})
print(df["Salary"].mean())
Output:
60000.0
Example 3: Filter Employees
Find employees earning more than 60,000.
NumPy
import numpy as np
data = np.array([
[101, 50000],
[102, 60000],
[103, 70000]
])
result = data[data[:,1] > 60000]
print(result)
Output:
[[ 103 70000]]
Notice:
data[:,1]
You must know that column index 1 contains salary.
Pandas
import pandas as pd
df = pd.DataFrame({
"Employee_ID": [101, 102, 103],
"Salary": [50000, 60000, 70000]
})
result = df[df["Salary"] > 60000]
print(result)
Output:
Employee_ID Salary
2 103 70000
More readable and easier to maintain.
Example 4: Reading Excel Files
NumPy
NumPy does not directly support Excel files.
Additional libraries are required.
Pandas
import pandas as pd
df = pd.read_excel(
"employees.xlsx"
)
One line of code.
Example 5: Department-Wise Salary Report
Data:
| Name | Department | Salary |
|---|---|---|
| John | IT | 50000 |
| Alice | HR | 60000 |
| Bob | IT | 70000 |
NumPy
Requires custom loops and logic.
Not straightforward.
Pandas
df.groupby(
"Department"
)["Salary"].mean()
Output:
Department
HR 60000
IT 60000
Very simple.
Performance Comparison
NumPy
import numpy as np
arr = np.arange(1000000)
arr * 2
Extremely fast.
Optimized for numerical calculations.
Pandas
import pandas as pd
s = pd.Series(
range(1000000)
)
s * 2
Fast, but slightly slower because Pandas provides additional functionality.
When to Use NumPy
Use NumPy when working with:
- Mathematical calculations
- Linear algebra
- Matrix operations
- Statistics
- Scientific computing
- Machine Learning internals
Example:
import numpy as np
matrix = np.array([
[1, 2],
[3, 4]
])
print(matrix.dot(matrix))
Output:
[[ 7 10]
[15 22]]
When to Use Pandas
Use Pandas when working with:
- Excel files
- CSV files
- Business reports
- Data cleaning
- Missing values
- Data analysis
- Data Science projects
Example:
import pandas as pd
df = pd.read_csv(
"sales.csv"
)
report = (
df.groupby("Region")
["Sales"]
.sum()
)
print(report)
Relationship Between NumPy and Pandas
Pandas is built on top of NumPy.
Python
│
├── NumPy
│
└── Pandas
Internally, Pandas uses NumPy arrays.
Example:
import pandas as pd
df = pd.DataFrame({
"Salary": [50000, 60000]
})
print(
type(
df["Salary"].values
)
)
Output:
<class 'numpy.ndarray'>
Real-World Use Cases
NumPy
- Physics simulations
- Engineering calculations
- Machine Learning algorithms
- Scientific research
- Mathematical modeling
Example:
velocity = np.array(
[10, 20, 30]
)
distance = velocity * 5
Pandas
- HR reports
- Sales reports
- Financial analysis
- Customer analytics
- Data cleaning
- Business dashboards
Example:
sales.groupby(
"Product"
)["Quantity"].sum()
Interview Answer
What is the difference between NumPy and Pandas?
NumPy is a library for fast numerical and mathematical computations using multidimensional arrays. Pandas is a library for data analysis and manipulation that provides DataFrame and Series objects. Pandas is built on top of NumPy and adds support for column names, indexing, grouping, file handling, and missing-value management.
Quick Summary
| Situation | Recommended Library |
|---|---|
| Mathematical Calculations | NumPy |
| Matrix Operations | NumPy |
| Scientific Computing | NumPy |
| Excel Analysis | Pandas |
| CSV Processing | Pandas |
| Data Cleaning | Pandas |
| Business Reporting | Pandas |
| Data Science Projects | Both |
Rule of Thumb
Need mathematics?
→ Use NumPy
Need Excel/CSV analysis?
→ Use Pandas
Need Data Science?
→ Use Both