statsmodels is and when to use it.statsmodels package?scikit-learn package?scikit-learn (~20 min)statsmodels (~20 min)🔧 Activities spaced throughout the session
Important
This lesson doesn’t depend on any previous state in your notebook.
Scikit-Learn is a python package designed to give access to well-known machine learning algorithms within Python code, through a clean application programming interface (API). It has been built by hundreds of contributors from around the world, and is used across industry and academia.
Scikit-Learn is built upon Python’s NumPy (Numerical Python) and SciPy (Scientific Python) libraries, which enable efficient in-core numerical and scientific computation within Python. As such, Scikit-Learn is not specifically designed for extremely large datasets, though there is some work in this area.
We will use Scikit-Learn to build a full linear regression example in stages following the example in the documentation.
This nicely demonstrates how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.
statsmodels is a Python library for classical statistical modeling. It complements pandas/NumPy by providing well-tested implementations of linear models (OLS), generalised linear models (GLM), time series (ARIMA/ETS), and statistical tests, with rich summaries (standard errors, p-values, confidence intervals, diagnostic metrics).
Use statsmodels when you need interpretability, inference, and statistical diagnostics beyond what scikit-learn typically exposes.
Link to docs: Docs
Worked Example in docs:
.plot).The formula API (statsmodels.formula.api or smf) uses R-like model formulas via patsy. It integrates smoothly with pandas DataFrames.
Formulas: y ~ x1 + x2 (additive), x1:x2 (interaction), C(cat) (treat a column as categorical),
y ~ x1 + x2 → main effects only
y ~ x1 * x2 → includes x1, x2, and their interaction (x1:x2)
y ~ x1:x2 → interaction only, no main effects
y ~ x1 + x2 + x1:x2 → explicit version of x1 * x2
ols(...).fit() returns a results object (.params, .bse, .pvalues, .conf_int()).
Use clean column names (snake_case, no spaces) to avoid quoting hassles in formulas.
Key fields from model.summary():
Extract the essentials programmatically:
mean_ci_lower/upper: CI for the expected mean of y at these predictors.obs_ci_lower/upper: PI for a new observation (wider).summary_frame) you can save:Check linear model assumptions quickly: linearity, homoscedasticity, normal-ish residuals, independence.
For non-Gaussian outcomes (binary counts, rates):
Binomial (logit/probit), Poisson, Gamma (with appropriate links).Fit mpg ~ hp * wt (which expands to hp + wt + hp:wt). Does the interaction appear significant?
Solution:
Explanation: The hp:wt row tests whether the effect of horsepower on mpg depends on weight. A small p-value (e.g., < 0.05) suggests meaningfully different slopes across weights; otherwise, prefer the simpler additive model.
Create a binary indicator heavy = (wt > wt.median()) and fit mpg ~ hp + C(heavy). Interpret the C(heavy)[T.True] coefficient.
Solution:
Explanation: C(heavy)[T.True] is the average difference in mpg between heavy and light cars at the same horsepower. A negative, significant coefficient indicates heavier cars get fewer mpg, controlling for hp.
Make predictions for hp={90, 120, 150} at wt=3.0. Save the full prediction frame to CSV.
Solution:
Explanation: The table includes fitted means and intervals. Mean CI reflects uncertainty in the expected mpg at those settings; obs CI is wider, reflecting variability for a single new car.
scikit-learn to fit and evaluate a basic linear regression model.statsmodels focuses on inference and diagnostics for classical models.smf.ols("y ~ x1 + x2", data=...)) for concise, readable models.get_prediction(...).summary_frame() for CIs/PIs you can report.This module combines exploratory plotting with introductory regression modelling. Learners fit a linear model, evaluate outputs, and connect numerical metrics to visual evidence from the underlying data.
The concepts in this module connect directly to practical data handling and exploration in Python.
| Submodule | Python Connection | Why It Matters |
|---|---|---|
| Linear Regression | LinearRegression |
Provides a baseline model for trend estimation. |
| Model Evaluation | Regression metrics | Metrics quantify prediction quality and model fit. |
| Comparative Plotting | Matplotlib subplots | Side-by-side plots improve analytical judgment. |
Attribution
This lesson is derived from materials developed by the Software Carpentry project.
The original content is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license: https://github.com/swcarpentry/python-novice-inflammation/blob/main/LICENSE.md