Data Manipulation with pandas

Learning Outcomes

  • Load tabular data into pandas with clear, reliable labels.
  • Select and filter data safely using .loc and .iloc.
  • Calculate statistics with pandas.DataFrame.mean and pandas.DataFrame.std
  • Create plots using pandas.DataFrame.plot.
  • Handle missing values.
  • Create new columns using vectorised operations and simple functions.
  • Combine many CSVs into one table with concat.
  • Summarise data with groupby/agg
  • Export cleaned results for downstream analysis.

Questions

  1. Why use pandas instead of plain lists for table-like data?
  2. How do I quickly filter and transform rows and columns?
  3. How can I summarize data by groups for analysis and plotting?

Structure & Agenda

  1. DataFrame fundamentals (~20 min)
  2. Selecting and filtering data (~25 min)
  3. Transforming and cleaning (~25 min)
  4. Grouped summaries and export (~20 min)

🔧 Activities spaced throughout the session

Introduction to pandas

  • Pandas is a widely-used Python library for statistics, particularly on tabular data.
  • Pandas provides the DataFrame, a labelled table-like data structure.
  • Pandas builds on numpy and matplotlib, adding labels, missing-data handling, reshaping, grouping, and other functionality that simplify data manipulation and analysis.

See the pandas docs for more information.

From Arrays to DataFrames

We have used header=None above because there is no header row in our data.

Set DataFrame.columns to assign human-readable column names.

List Comprehensions

List comprehensions are a concise way to create lists. They replace longer for loops and can make code more readable. A list comprehension has the basic form [expression for item in iterable].

An equivalent for loop for the list comprehension used above would be:

F Strings

Python f-strings are a way to format strings by embedding variables in strings using {}

DataFrame.info() quickly shows wrong dtypes or unexpected missing values.

Use DataFrame.describe() to get summary statistics about data.

Tip

Clear, human-readable names avoid off-by-one mistakes.

Challenge: Inspection

How many patients and days are in inflammation-01.csv?

Selecting Data

Use single brackets for one column, double brackets for multiple columns.

Use .iloc (by position) and .loc (by label).

Use boolean masks to filter rows.

Tip

  • Read: use attribute or []
  • Write: use .loc[...] = value
  • axis=0 → down rows; axis=1 → across columns.
  • Avoid chained indexing like df[df["day_0"]>0]["day_1"]=...

Challenge: Selection

Get patient 10’s readings for days 10–19 using both .iloc and .loc.

Stats & Plots with pandas

Compute per-day mean/min/max, then plot.

Tip

  • Methods mirror NumPy: mean, std, idxmax, etc.
  • DataFrame.plot() is a thin wrapper around Matplotlib—fast for exploration.

Challenge: Plotting

Recreate the mean/min/max per-day plot with .plot(), adding axis labels.

Creating New Columns

Challenge: Create

Make day_0_diff = day_0 - day_1 and week1_std as sample SD of days 0–6.

We can also use assign to create new columns. This is slightly more verbose, but has the advantage of not modifying df until the end. It is preferred for more declarative code.

User-Defined Functions

Vectorised where possible:

Row-wise if necessary:

Handling Missing Values

Grouped Summaries

Tip

groupby(...).agg(...) is the standard summarization pattern.

Sorting

Notice the row labels (patient IDs) are preserved and sorted along with the data.

Exporting Results

Putting It All Together: Combine → Summarise → Export

This workflow loads multiple files, labels them, combines, summarises, and exports. Here it’s written slowly with excessive comments so each step is clear.

Why this pattern?

  • reset_index ensures each patient has an explicit identifier.
  • source preserves provenance across files.
  • Using groupby("source")[day_cols].agg(...) lets you compute many stats in one call.
  • Flattening the column MultiIndex is optional but makes the CSV easier to read.
  • Exporting with to_csv makes results reusable in other tools.

Challenge: Tiny CLI Wrapper

To run the workflow from the shell without using a notebook, save the below as scripts/summarise_inflammation.py:

To run from the command line, call the script with a glob pattern and optional output path:

python scripts/summarise_inflammation.py "inflammation-*.csv" --out out/per_day_summary.csv

Further Information

📚 Keypoints

  • pandas DataFrames are efficient, labeled table structures for analysis.
  • Boolean filtering and column selection are core day-to-day operations.
  • Clean data before summarizing (isna, dropna, type checks).
  • groupby plus agg is the main pattern for grouped statistics.
  • Keep outputs reproducible by saving cleaned tables and summaries.

🔦 Hints

  • Inspect data early with head(), info(), and summary statistics.
  • Apply filters in small steps so each transformation is easy to verify.
  • Save cleaned outputs to support reproducibility and peer review.

Module Summary

This module introduces pandas for tabular data workflows. Learners practice selecting columns, filtering rows, handling missing values, grouping data, and exporting clean summaries for downstream analysis.

Additional Learning

The concepts in this module connect directly to practical data handling and exploration in Python.

Submodule Python Connection Why It Matters
DataFrame Basics pandas DataFrame DataFrames are the core structure for real-world tables.
Filtering and Selection Indexing and selecting data Precise selection keeps analyses targeted and correct.
Grouped Summaries groupby Group-wise statistics are essential for comparison tasks.

Attribution

This lesson is derived from materials developed by the Software Carpentry project.

The original content is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license: https://github.com/swcarpentry/python-novice-inflammation/blob/main/LICENSE.md