Regression Analysis and Modelling

Learning Outcomes

Understand what statsmodels is and when to use it.
Fit a minimal Ordinary Least Squares (OLS) regression with the formula API.
Read and interpret the key parts of a model summary.
Make predictions with confidence intervals.
Know where to go next: GLMs (e.g., logistic regression)
Train a linear regression model on a dataset.
Test the model, and plot the results.

Questions

What is the statsmodels package?
What is the scikit-learn package?
How can I train a Linear Regression model in Python?

Structure & Agenda

Run a complete linear regression workflow in scikit-learn (~20 min)
Run a linear regression example with statsmodels (~20 min)

🔧 Activities spaced throughout the session

Important

This lesson doesn’t depend on any previous state in your notebook.

What is Scikit-Learn?

Scikit-Learn is a python package designed to give access to well-known machine learning algorithms within Python code, through a clean application programming interface (API). It has been built by hundreds of contributors from around the world, and is used across industry and academia.

Scikit-Learn is built upon Python’s NumPy (Numerical Python) and SciPy (Scientific Python) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, Scikit-Learn is not specifically designed for extremely large datasets, though there is some work in this area.

We will use Scikit-Learn to build a full linear regression example in stages following the example in the documentation.

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

Import and Prepare Data

Model Training

Prediction and Evaluation

Plotting

This nicely demonstrates how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.

What is statsmodels?

statsmodels is a Python library for classical statistical modeling. It complements pandas/NumPy by providing well-tested implementations of linear models (OLS), generalised linear models (GLM), time series (ARIMA/ETS), and statistical tests, with rich summaries (standard errors, p-values, confidence intervals, diagnostic metrics).

Use statsmodels when you need interpretability, inference, and statistical diagnostics beyond what scikit-learn typically exposes.

Link to docs: Docs

Worked Example in docs:

Regression Examples

Interaction Examples

Prerequisites & Setup

Comfort with Python, pandas, and basic plotting (Matplotlib or pandas .plot).
Install:

pip install statsmodels patsy

We’ll use a built-in dataset to avoid external downloads.

Minimal OLS with the Formula API

The formula API (statsmodels.formula.api or smf) uses R-like model formulas via patsy. It integrates smoothly with pandas DataFrames.

Quick Tips

Formulas: y ~ x1 + x2 (additive), x1:x2 (interaction), C(cat) (treat a column as categorical),
y ~ x1 + x2 → main effects only
y ~ x1 * x2 → includes x1, x2, and their interaction (x1:x2)
y ~ x1:x2 → interaction only, no main effects
y ~ x1 + x2 + x1:x2 → explicit version of x1 * x2
ols(...).fit() returns a results object (.params, .bse, .pvalues, .conf_int()).
Use clean column names (snake_case, no spaces) to avoid quoting hassles in formulas.

Reading the Summary (Essentials)

Key fields from model.summary():

coef: estimated effect per unit change in predictor, holding others fixed.
std err: standard error of the estimate.
t, P>|t|: t-statistic and p-value for the null hypothesis that the coefficient is zero.
[0.025, 0.975]: 95% confidence interval.
R-squared / Adj. R-squared: variance explained (adjusted penalises extra predictors).
F-statistic: global test that at least one predictor is nonzero.
Durbin-Watson: autocorrelation check (mainly for time-series-like residuals).

Extract the essentials programmatically:

Quick Tips

Large absolute t-statistics and small p-values suggest evidence against a zero coefficient (context matters).
Significance is not importance—check effect sizes and domain relevance.

Prediction with Confidence Intervals

mean_ci_lower/upper: CI for the expected mean of y at these predictors.
obs_ci_lower/upper: PI for a new observation (wider).

Quick Tips

For reporting, prefer tabular outputs (summary_frame) you can save:

Very Brief Diagnostics

Check linear model assumptions quickly: linearity, homoscedasticity, normal-ish residuals, independence.

Quick Tips

Patterns in residuals vs fitted suggest model misspecification or heteroscedasticity.
Consider transformations or alternative models (e.g., GLM) if assumptions are violated.

Where Next? (Signposts)

Generalised Linear Models (GLM)

For non-Gaussian outcomes (binary counts, rates):

Common families: Binomial (logit/probit), Poisson, Gamma (with appropriate links).

Practical Mini-Exercises

Exercise 1 — Add an Interaction

Fit mpg ~ hp * wt (which expands to hp + wt + hp:wt). Does the interaction appear significant?

Solution:

Explanation: The hp:wt row tests whether the effect of horsepower on mpg depends on weight. A small p-value (e.g., < 0.05) suggests meaningfully different slopes across weights; otherwise, prefer the simpler additive model.

Exercise 2 — Categorical Predictor

Create a binary indicator heavy = (wt > wt.median()) and fit mpg ~ hp + C(heavy). Interpret the C(heavy)[T.True] coefficient.

Solution:

Explanation: C(heavy)[T.True] is the average difference in mpg between heavy and light cars at the same horsepower. A negative, significant coefficient indicates heavier cars get fewer mpg, controlling for hp.

Exercise 3 — Prediction Table

Make predictions for hp={90, 120, 150} at wt=3.0. Save the full prediction frame to CSV.

Solution:

Explanation: The table includes fitted means and intervals. Mean CI reflects uncertainty in the expected mpg at those settings; obs CI is wider, reflecting variability for a single new car.

Further Information

📚 Keypoints

Use scikit-learn to fit and evaluate a basic linear regression model.
statsmodels focuses on inference and diagnostics for classical models.
Use the formula API (smf.ols("y ~ x1 + x2", data=...)) for concise, readable models.
Read summaries for coefficients, uncertainty (SE/CI), and goodness-of-fit.
Use get_prediction(...).summary_frame() for CIs/PIs you can report.
For non-linear means or non-Gaussian outcomes, step up to GLM; for temporal dependence, explore ARIMA/SARIMAX.

🔦 Hints

Separate data preparation, model fitting, and evaluation into distinct steps.
Plot residual or comparison views to check model behavior, not just one score.
Keep feature and target naming explicit so formulas remain interpretable.

Module Summary

This module combines exploratory plotting with introductory regression modelling. Learners fit a linear model, evaluate outputs, and connect numerical metrics to visual evidence from the underlying data.

Additional Learning

The concepts in this module connect directly to practical data handling and exploration in Python.

Submodule	Python Connection	Why It Matters
Linear Regression	`LinearRegression`	Provides a baseline model for trend estimation.
Model Evaluation	Regression metrics	Metrics quantify prediction quality and model fit.
Comparative Plotting	Matplotlib subplots	Side-by-side plots improve analytical judgment.

Attribution

This lesson is derived from materials developed by the Software Carpentry project.

The original content is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license: https://github.com/swcarpentry/python-novice-inflammation/blob/main/LICENSE.md