Data Manipulation with pandas
Learning Outcomes
- Load tabular data into pandas with clear, reliable labels.
- Select and filter data safely using
.locand.iloc. - Calculate statistics with
pandas.DataFrame.meanandpandas.DataFrame.std - Create plots using
pandas.DataFrame.plot. - Handle missing values.
- Create new columns using vectorised operations and simple functions.
- Combine many CSVs into one table with
concat. - Summarise data with
groupby/agg - Export cleaned results for downstream analysis.
Questions
- Why use pandas instead of plain lists for table-like data?
- How do I quickly filter and transform rows and columns?
- How can I summarize data by groups for analysis and plotting?
Structure & Agenda
- DataFrame fundamentals (~20 min)
- Selecting and filtering data (~25 min)
- Transforming and cleaning (~25 min)
- Grouped summaries and export (~20 min)
π§ Activities spaced throughout the session
Introduction to pandas
- Pandas is a widely-used Python library for statistics, particularly on tabular data.
- Pandas provides the
DataFrame, a labelled table-like data structure. - Pandas builds on
numpyandmatplotlib, adding labels, missing-data handling, reshaping, grouping, and other functionality that simplify data manipulation and analysis.
See the pandas docs for more information.
From Arrays to DataFrames
We have used header=None above because there is no header row in our data.
Set DataFrame.columns to assign human-readable column names.
List comprehensions are a concise way to create lists. They replace longer for loops and can make code more readable. A list comprehension has the basic form [expression for item in iterable].
An equivalent for loop for the list comprehension used above would be:
Python f-strings are a way to format strings by embedding variables in strings using {}
DataFrame.info() quickly shows wrong dtypes or unexpected missing values.
Use DataFrame.describe() to get summary statistics about data.
Clear, human-readable names avoid off-by-one mistakes.
Challenge: Inspection
How many patients and days are in inflammation-01.csv?
Selecting Data
Use single brackets for one column, double brackets for multiple columns.
Use .iloc (by position) and .loc (by label).
Use boolean masks to filter rows.
- Read: use attribute or
[] - Write: use
.loc[...] = value axis=0β down rows;axis=1β across columns.- Avoid chained indexing like
df[df["day_0"]>0]["day_1"]=...
Challenge: Selection
Get patient 10βs readings for days 10β19 using both .iloc and .loc.
Stats & Plots with pandas
Compute per-day mean/min/max, then plot.
- Methods mirror NumPy:
mean,std,idxmax, etc. DataFrame.plot()is a thin wrapper around Matplotlibβfast for exploration.
Challenge: Plotting
Recreate the mean/min/max per-day plot with .plot(), adding axis labels.
Creating New Columns
Challenge: Create
Make day_0_diff = day_0 - day_1 and week1_std as sample SD of days 0β6.
We can also use assign to create new columns. This is slightly more verbose, but has the advantage of not modifying df until the end. It is preferred for more declarative code.
User-Defined Functions
Vectorised where possible:
Row-wise if necessary:
Handling Missing Values
Grouped Summaries
groupby(...).agg(...) is the standard summarization pattern.
Sorting
Notice the row labels (patient IDs) are preserved and sorted along with the data.
Exporting Results
Putting It All Together: Combine β Summarise β Export
This workflow loads multiple files, labels them, combines, summarises, and exports. Here itβs written slowly with excessive comments so each step is clear.
Why this pattern?
reset_indexensures each patient has an explicit identifier.sourcepreserves provenance across files.- Using
groupby("source")[day_cols].agg(...)lets you compute many stats in one call. - Flattening the column MultiIndex is optional but makes the CSV easier to read.
- Exporting with
to_csvmakes results reusable in other tools.
Challenge: Tiny CLI Wrapper
To run the workflow from the shell without using a notebook, save the below as scripts/summarise_inflammation.py:
To run from the command line, call the script with a glob pattern and optional output path:
python scripts/summarise_inflammation.py "inflammation-*.csv" --out out/per_day_summary.csvFurther Information
π Keypoints
- pandas DataFrames are efficient, labeled table structures for analysis.
- Boolean filtering and column selection are core day-to-day operations.
- Clean data before summarizing (
isna,dropna, type checks). groupbyplusaggis the main pattern for grouped statistics.- Keep outputs reproducible by saving cleaned tables and summaries.
π¦ Hints
- Inspect data early with
head(),info(), and summary statistics. - Apply filters in small steps so each transformation is easy to verify.
- Save cleaned outputs to support reproducibility and peer review.
Module Summary
This module introduces pandas for tabular data workflows. Learners practice selecting columns, filtering rows, handling missing values, grouping data, and exporting clean summaries for downstream analysis.
Additional Learning
The concepts in this module connect directly to practical data handling and exploration in Python.
| Submodule | Python Connection | Why It Matters |
|---|---|---|
| DataFrame Basics | pandas DataFrame | DataFrames are the core structure for real-world tables. |
| Filtering and Selection | Indexing and selecting data | Precise selection keeps analyses targeted and correct. |
| Grouped Summaries | groupby |
Group-wise statistics are essential for comparison tasks. |