Data Manipulation with pandas

Learning Outcomes

Load tabular data into pandas with clear, reliable labels.
Select and filter data safely using .loc and .iloc.
Calculate statistics with pandas.DataFrame.mean and pandas.DataFrame.std
Create plots using pandas.DataFrame.plot.
Handle missing values.
Create new columns using vectorised operations and simple functions.
Combine many CSVs into one table with concat.
Summarise data with groupby/agg
Export cleaned results for downstream analysis.

Questions

Why use pandas instead of plain lists for table-like data?
How do I quickly filter and transform rows and columns?
How can I summarize data by groups for analysis and plotting?

Structure & Agenda

DataFrame fundamentals (~20 min)
Selecting and filtering data (~25 min)
Transforming and cleaning (~25 min)
Grouped summaries and export (~20 min)

🔧 Activities spaced throughout the session

Introduction to pandas

Pandas is a widely-used Python library for statistics, particularly on tabular data.
Pandas provides the DataFrame, a labelled table-like data structure.
Pandas builds on numpy and matplotlib, adding labels, missing-data handling, reshaping, grouping, and other functionality that simplify data manipulation and analysis.

See the pandas docs for more information.

From Arrays to DataFrames

We have used header=None above because there is no header row in our data.

Set DataFrame.columns to assign human-readable column names.

List Comprehensions

List comprehensions are a concise way to create lists. They replace longer for loops and can make code more readable. A list comprehension has the basic form [expression for item in iterable].

An equivalent for loop for the list comprehension used above would be:

F Strings

Python f-strings are a way to format strings by embedding variables in strings using {}

DataFrame.info() quickly shows wrong dtypes or unexpected missing values.

Use DataFrame.describe() to get summary statistics about data.

Tip

Clear, human-readable names avoid off-by-one mistakes.

How many patients and days are in inflammation-01.csv?

Selecting Data

Use single brackets for one column, double brackets for multiple columns.

Use .iloc (by position) and .loc (by label).

Use boolean masks to filter rows.

Tip

Read: use attribute or []
Write: use .loc[...] = value
axis=0 → down rows; axis=1 → across columns.
Avoid chained indexing like df[df["day_0"]>0]["day_1"]=...

Challenge: Selection

Task
Solution

Get patient 10’s readings for days 10–19 using both .iloc and .loc.

Stats & Plots with pandas

Compute per-day mean/min/max, then plot.

Tip

Methods mirror NumPy: mean, std, idxmax, etc.
DataFrame.plot() is a thin wrapper around Matplotlib—fast for exploration.

Challenge: Plotting

Task
Solution

Recreate the mean/min/max per-day plot with .plot(), adding axis labels.

Creating New Columns

Challenge: Create

Task
Solution

Make day_0_diff = day_0 - day_1 and week1_std as sample SD of days 0–6.

We can also use assign to create new columns. This is slightly more verbose, but has the advantage of not modifying df until the end. It is preferred for more declarative code.

User-Defined Functions

Vectorised where possible:

Row-wise if necessary:

Handling Missing Values

Grouped Summaries

Tip

groupby(...).agg(...) is the standard summarization pattern.

Sorting

Notice the row labels (patient IDs) are preserved and sorted along with the data.

Exporting Results

Putting It All Together: Combine → Summarise → Export

This workflow loads multiple files, labels them, combines, summarises, and exports. Here it’s written slowly with excessive comments so each step is clear.

Why this pattern?

reset_index ensures each patient has an explicit identifier.
source preserves provenance across files.
Using groupby("source")[day_cols].agg(...) lets you compute many stats in one call.
Flattening the column MultiIndex is optional but makes the CSV easier to read.
Exporting with to_csv makes results reusable in other tools.

Challenge: Tiny CLI Wrapper

Task

To run the workflow from the shell without using a notebook, save the below as scripts/summarise_inflammation.py:

To run from the command line, call the script with a glob pattern and optional output path:

python scripts/summarise_inflammation.py "inflammation-*.csv" --out out/per_day_summary.csv

Further Information

📚 Keypoints

pandas DataFrames are efficient, labeled table structures for analysis.
Boolean filtering and column selection are core day-to-day operations.
Clean data before summarizing (isna, dropna, type checks).
groupby plus agg is the main pattern for grouped statistics.
Keep outputs reproducible by saving cleaned tables and summaries.

🔦 Hints

Inspect data early with head(), info(), and summary statistics.
Apply filters in small steps so each transformation is easy to verify.
Save cleaned outputs to support reproducibility and peer review.

Module Summary

This module introduces pandas for tabular data workflows. Learners practice selecting columns, filtering rows, handling missing values, grouping data, and exporting clean summaries for downstream analysis.

Additional Learning

The concepts in this module connect directly to practical data handling and exploration in Python.

Submodule	Python Connection	Why It Matters
DataFrame Basics	pandas DataFrame	DataFrames are the core structure for real-world tables.
Filtering and Selection	Indexing and selecting data	Precise selection keeps analyses targeted and correct.
Grouped Summaries	`groupby`	Group-wise statistics are essential for comparison tasks.

Attribution

This lesson is derived from materials developed by the Software Carpentry project.

The original content is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license: https://github.com/swcarpentry/python-novice-inflammation/blob/main/LICENSE.md