💡 Learning Outcomes
By the end of this module, learners will be able to:
❓ Key Questions
Data rarely arrives clean. Messy data is any dataset that violates the rules required for analysis.
💬 It may include missing values, inconsistent labels, extra spaces, or mixed types.
Name Age Snack Height Home Country
Sarah 21 Chips 1.63m UK
sarah 21 crisps 1.63 United Kingdom
NA Cookies 1.7 UK
🕵️ Think of this stage as being a data detective — scanning the scene, not solving the case yet.
| Problem | Example | Impact on Analysis |
|---|---|---|
| Inconsistent Entries | Sarah vs. sarah | Impossible to count unique names |
| Mixed Types/Formats | 1.63m vs 163 vs 5’4” | Blocks averages |
| Missing Values | NA or blanks | Misleading averages and errors |
| Synonyms/Variations | Chips vs. crisps | Hides true popularity |
⚠️ If data is inconsistent, your results can be misleading or invalid.
Organising data helps you:
🧩 Clean data = reliable analysis
Standardisation
Convert inconsistent formats → e.g., ‘22 yrs’, ‘Twenty-Four’ → 22, 24
Validation
Check values make sense → e.g., ‘Age: 200’
Deduplication
Remove duplicates → e.g., repeated “Alex C.”
Transformation
Convert data types → e.g., height text → numeric
❌ Clean data isn’t perfect — just ready for reliable calculation.
| Type | Description | Examples | Operations Allowed |
|---|---|---|---|
| Categorical | Qualities, groups, names | Gender, Country, Snack | Counting, grouping |
| Numerical | Quantities that can be measured | Age, Steps, Commute | Mean, median, sum |
⚙️ Mixed numerical values must be converted before calculating averages.
Dataset:
Setup
#Questions#
Exploration is the first look at a dataset:
💡 Exploration helps catch cleaning mistakes and shapes further questions.
| Technique | Tells You | Example Question |
|---|---|---|
| Mean | Typical value | Average height? |
| Median | Middle value | Median commute time? |
| Mode | Most frequent | Most common snack? |
| Min/Max | Range | Youngest and oldest? |
📈 These require clean data.
A simple bar chart quickly shows common categories (e.g., transport type).
| Statistic | Meaning | Best Use Case |
|---|---|---|
| Mean | Average | Symmetric data (e.g., height) |
| Median | Middle value | Skewed data (e.g., commute time) |
💬 Median often better represents “typical” experience.
Dataset: {insert link}
| Statistic | Question | Answer |
|---|---|---|
| Mode | Most common transport method | Bus |
| Mean | Average hours slept | 6.44 |
| Min/Max | Min/max age | 19 / 25 |
Example:
💡 Patterns show correlations, not causation.
| Technique | Example | Shows |
|---|---|---|
| Frequency counts | Most common snack | Popularity |
| Group comparison | Average age by snack type | Demographic patterns |
| Cross-tabulation | Snack by home country | Relationships |
| Sorting/filtering | Top 3 snacks | Quick insights |
Help you:
Example table:
| Snack | Avg Age | Avg Steps |
|---|---|---|
| Crisps | 20.1 | 4,000 |
| Cookies | 21.3 | 5,500 |
| Fruit | 23.2 | 7,000 |
Examples:
Dataset: {insert link}
Find 3 facts in 5 minutes.
Example insights:
Prediction = using patterns to estimate unknown values.
🧠 Predictions rely on similarity: past behaviour → future expectation.
Example:
Prediction is informed guessing.
| Approach | Example | Concept |
|---|---|---|
| Rule-based | Age > 20 → savoury | Heuristics |
| Trend extrapolation | Snack preference over time | Trend extension |
| Probabilistic | 60% chance chips | Likelihood |
📌 Weather forecasts are predictions.
Groups predict Estimated Daily Energy Level (High/Low).
Benchmark model:
Low Energy if (Hours Slept ≤6.0) OR (Steps < 5000)
Result: 6 / 10 students = 60% Low Energy
Key takeaways:
Additional learning:
| Submodule | Python Connection | Why It Matters |
|---|---|---|
| Organising Messy Data | pandas cleaning tools | Foundation of real analysis |
| Exploring Data | .describe(), .info(), .value_counts() | 快速 understanding |
| Finding Patterns | groupby, pivot_table, clustering | Reveals relationships |
| Predicting Outcomes | simple models, sklearn basics | Intro to ML thinking |