Regression Analysis and Modelling
Learning Outcomes
- Understand what
statsmodelsis and when to use it. - Fit a minimal Ordinary Least Squares (OLS) regression with the formula API.
- Read and interpret the key parts of a model summary.
- Make predictions with confidence intervals.
- Know where to go next: GLMs (e.g., logistic regression)
- Train a linear regression model on a dataset.
- Test the model, and plot the results.
Questions
- What is the
statsmodelspackage? - What is the
scikit-learnpackage? - How can I train a Linear Regression model in Python?
Structure & Agenda
- Run a complete linear regression workflow in
scikit-learn(~20 min) - Run a linear regression example with
statsmodels(~20 min)
π§ Activities spaced throughout the session
This lesson doesnβt depend on any previous state in your notebook.
What is Scikit-Learn?
Scikit-Learn is a python package designed to give access to well-known machine learning algorithms within Python code, through a clean application programming interface (API). It has been built by hundreds of contributors from around the world, and is used across industry and academia.
Scikit-Learn is built upon Pythonβs NumPy (Numerical Python) and SciPy (Scientific Python) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, Scikit-Learn is not specifically designed for extremely large datasets, though there is some work in this area.
We will use Scikit-Learn to build a full linear regression example in stages following the example in the documentation.
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause
Import and Prepare Data
Model Training
Prediction and Evaluation
Plotting
This nicely demonstrates how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.
What is statsmodels?
statsmodels is a Python library for classical statistical modeling. It complements pandas/NumPy by providing well-tested implementations of linear models (OLS), generalised linear models (GLM), time series (ARIMA/ETS), and statistical tests, with rich summaries (standard errors, p-values, confidence intervals, diagnostic metrics).
Use statsmodels when you need interpretability, inference, and statistical diagnostics beyond what scikit-learn typically exposes.
Link to docs: Docs
Worked Example in docs:
Prerequisites & Setup
- Comfort with Python, pandas, and basic plotting (Matplotlib or pandas
.plot). - Install:
pip install statsmodels patsy
- Weβll use a built-in dataset to avoid external downloads.
Minimal OLS with the Formula API
The formula API (statsmodels.formula.api or smf) uses R-like model formulas via patsy. It integrates smoothly with pandas DataFrames.
Quick Tips
Formulas:
y ~ x1 + x2(additive),x1:x2(interaction),C(cat)(treat a column as categorical),y ~ x1 + x2 β main effects only
y ~ x1 * x2 β includes x1, x2, and their interaction (x1:x2)
y ~ x1:x2 β interaction only, no main effects
y ~ x1 + x2 + x1:x2 β explicit version of x1 * x2
ols(...).fit()returns a results object (.params,.bse,.pvalues,.conf_int()).Use clean column names (snake_case, no spaces) to avoid quoting hassles in formulas.
Reading the Summary (Essentials)
Key fields from model.summary():
- coef: estimated effect per unit change in predictor, holding others fixed.
- std err: standard error of the estimate.
- t, P>|t|: t-statistic and p-value for the null hypothesis that the coefficient is zero.
- [0.025, 0.975]: 95% confidence interval.
- R-squared / Adj. R-squared: variance explained (adjusted penalises extra predictors).
- F-statistic: global test that at least one predictor is nonzero.
- Durbin-Watson: autocorrelation check (mainly for time-series-like residuals).
Extract the essentials programmatically:
Quick Tips
- Large absolute t-statistics and small p-values suggest evidence against a zero coefficient (context matters).
- Significance is not importanceβcheck effect sizes and domain relevance.
Prediction with Confidence Intervals
mean_ci_lower/upper: CI for the expected mean ofyat these predictors.obs_ci_lower/upper: PI for a new observation (wider).
Quick Tips
- For reporting, prefer tabular outputs (
summary_frame) you can save:
Very Brief Diagnostics
Check linear model assumptions quickly: linearity, homoscedasticity, normal-ish residuals, independence.
Quick Tips
- Patterns in residuals vs fitted suggest model misspecification or heteroscedasticity.
- Consider transformations or alternative models (e.g., GLM) if assumptions are violated.
Where Next? (Signposts)
Generalised Linear Models (GLM)
For non-Gaussian outcomes (binary counts, rates):
- Common families:
Binomial(logit/probit),Poisson,Gamma(with appropriate links).
Practical Mini-Exercises
Exercise 1 β Add an Interaction
Fit mpg ~ hp * wt (which expands to hp + wt + hp:wt). Does the interaction appear significant?
Solution:
Explanation: The hp:wt row tests whether the effect of horsepower on mpg depends on weight. A small p-value (e.g., < 0.05) suggests meaningfully different slopes across weights; otherwise, prefer the simpler additive model.
Exercise 2 β Categorical Predictor
Create a binary indicator heavy = (wt > wt.median()) and fit mpg ~ hp + C(heavy). Interpret the C(heavy)[T.True] coefficient.
Solution:
Explanation: C(heavy)[T.True] is the average difference in mpg between heavy and light cars at the same horsepower. A negative, significant coefficient indicates heavier cars get fewer mpg, controlling for hp.
Exercise 3 β Prediction Table
Make predictions for hp={90, 120, 150} at wt=3.0. Save the full prediction frame to CSV.
Solution:
Explanation: The table includes fitted means and intervals. Mean CI reflects uncertainty in the expected mpg at those settings; obs CI is wider, reflecting variability for a single new car.
Further Information
π Keypoints
- Use
scikit-learnto fit and evaluate a basic linear regression model. statsmodelsfocuses on inference and diagnostics for classical models.- Use the formula API (
smf.ols("y ~ x1 + x2", data=...)) for concise, readable models. - Read summaries for coefficients, uncertainty (SE/CI), and goodness-of-fit.
- Use
get_prediction(...).summary_frame()for CIs/PIs you can report. - For non-linear means or non-Gaussian outcomes, step up to GLM; for temporal dependence, explore ARIMA/SARIMAX.
π¦ Hints
- Separate data preparation, model fitting, and evaluation into distinct steps.
- Plot residual or comparison views to check model behavior, not just one score.
- Keep feature and target naming explicit so formulas remain interpretable.
Module Summary
This module combines exploratory plotting with introductory regression modelling. Learners fit a linear model, evaluate outputs, and connect numerical metrics to visual evidence from the underlying data.
Additional Learning
The concepts in this module connect directly to practical data handling and exploration in Python.
| Submodule | Python Connection | Why It Matters |
|---|---|---|
| Linear Regression | LinearRegression |
Provides a baseline model for trend estimation. |
| Model Evaluation | Regression metrics | Metrics quantify prediction quality and model fit. |
| Comparative Plotting | Matplotlib subplots | Side-by-side plots improve analytical judgment. |