M6 – Organising and Exploring Data
M6: Organising and Exploring Data
💡 Learning Outcomes
By the end of this module, learners will be able to:
- Recognise why raw data needs cleaning and organisation before analysis.
- Apply basic techniques to tidy, structure, and validate datasets.
- Explore data to identify common features, patterns, and relationships.
- Move from descriptive observations to forming hypotheses and simple predictions.
❓ Key Questions
- Why is real-world data often messy or inconsistent?
- How can cleaning and organising data improve accuracy?
- What can we learn through basic data exploration and summary?
- How do insights and hypotheses support later prediction and analysis?
Structure & Agenda
- Organising Messy Data (~20 min)
- Exploring Data (~20 min)
- Finding Patterns (~20 min)
- Predicting Outcomes (~20 min)
Submodule 1: Organising Messy Data
What is “Messy Data”?
Data rarely arrives clean. Messy data is any dataset that violates the rules required for analysis.
💬 It may include missing values, inconsistent labels, extra spaces, or mixed types.
Example Messy Dataset
Name Age Snack Height Home Country
Sarah 21 Chips 1.63m UK
sarah 21 crisps 1.63 United Kingdom
NA Cookies 1.7 UK
🕵️ Think of this stage as being a data detective — scanning the scene, not solving the case yet.
Problems Noticed
| Problem | Example | Impact on Analysis |
|---|---|---|
| Inconsistent Entries | Sarah vs. sarah | Impossible to count unique names |
| Mixed Types/Formats | 1.63m vs 163 vs 5’4” | Blocks averages |
| Missing Values | NA or blanks | Misleading averages and errors |
| Synonyms/Variations | Chips vs. crisps | Hides true popularity |
⚠️ If data is inconsistent, your results can be misleading or invalid.
Why Organising Matters
Organising data helps you:
- Ensure accuracy and consistency
- Remove duplicates
- Handle missing values
- Format data for analysis
🧩 Clean data = reliable analysis
The Four Pillars of Data Tidy-Up
Standardisation
Convert inconsistent formats → e.g., ‘22 yrs’, ‘Twenty-Four’ → 22, 24Validation
Check values make sense → e.g., ‘Age: 200’Deduplication
Remove duplicates → e.g., repeated “Alex C.”Transformation
Convert data types → e.g., height text → numeric
❌ Clean data isn’t perfect — just ready for reliable calculation.
Categorical vs Numerical Data
| Type | Description | Examples | Operations Allowed |
|---|---|---|---|
| Categorical | Qualities, groups, names | Gender, Country, Snack | Counting, grouping |
| Numerical | Quantities that can be measured | Age, Steps, Commute | Mean, median, sum |
⚙️ Mixed numerical values must be converted before calculating averages.
Activity: The Data Detective Game
Dataset:
Setup
- Present dataset
- Identify errors (missing values, inconsistencies, duplicates)
#Questions#
- Unique Students
- Remove duplicates
- Answer: 9
- Remove duplicates
- Most Frequent Home Country
- Standardise country names
- Answer: United States (3 occurrences)
- Standardise country names
- Average Age of Chips Lovers
- Standardise chips/crisps
- Convert ages
- Answer: 21.5
- Standardise chips/crisps
Submodule 2: Exploring Data
What Does It Mean to Explore Data?
Exploration is the first look at a dataset:
- What variables exist?
- What types are they?
- What common values appear?
💡 Exploration helps catch cleaning mistakes and shapes further questions.
Basic Exploration Techniques
| Technique | Tells You | Example Question |
|---|---|---|
| Mean | Typical value | Average height? |
| Median | Middle value | Median commute time? |
| Mode | Most frequent | Most common snack? |
| Min/Max | Range | Youngest and oldest? |
📈 These require clean data.
Visualising Distributions
- Symmetry
- Skew
- Outliers
A simple bar chart quickly shows common categories (e.g., transport type).
The Importance of the Median
| Statistic | Meaning | Best Use Case |
|---|---|---|
| Mean | Average | Symmetric data (e.g., height) |
| Median | Middle value | Skewed data (e.g., commute time) |
💬 Median often better represents “typical” experience.
Summary Game
Dataset: {insert link}
| Statistic | Question | Answer |
|---|---|---|
| Mode | Most common transport method | Bus |
| Mean | Average hours slept | 6.44 |
| Min/Max | Min/max age | 19 / 25 |
Submodule 3: Finding Patterns
From Observation to Patterns
Example:
- Observation: average steps = 7,000
- Pattern: walkers average 10,000 vs bus users 4,500
💡 Patterns show correlations, not causation.
Common Pattern Techniques
| Technique | Example | Shows |
|---|---|---|
| Frequency counts | Most common snack | Popularity |
| Group comparison | Average age by snack type | Demographic patterns |
| Cross-tabulation | Snack by home country | Relationships |
| Sorting/filtering | Top 3 snacks | Quick insights |
Why Patterns Matter
Help you:
- Form hypotheses
- Understand relationships
- Decide what to analyse next
Building Hypotheses
Example table:
| Snack | Avg Age | Avg Steps |
|---|---|---|
| Crisps | 20.1 | 4,000 |
| Cookies | 21.3 | 5,500 |
| Fruit | 23.2 | 7,000 |
Examples:
- Fruit eaters may walk more
- Snack choice not strongly age-related
Quick Insights Race
Dataset: {insert link}
Find 3 facts in 5 minutes.
Example insights:
- Chips/Crisps most popular
- 22.22% have ≥ 1 hr commute
- 22.22% don’t use library
Submodule 4: Predicting Outcomes
What Does Prediction Mean?
Prediction = using patterns to estimate unknown values.
🧠 Predictions rely on similarity: past behaviour → future expectation.
Hypothesis → Prediction
Example:
- Hypothesis: age affects snack preference
- Prediction: 25+ choose savoury snacks
Simple Predictive Thinking
- Identify question
- Choose variables
- Examine patterns
- Apply pattern to new situation
Prediction is informed guessing.
Forms of Prediction
| Approach | Example | Concept |
|---|---|---|
| Rule-based | Age > 20 → savoury | Heuristics |
| Trend extrapolation | Snack preference over time | Trend extension |
| Probabilistic | 60% chance chips | Likelihood |
📌 Weather forecasts are predictions.
Uncertainty Matters
- Predictions ≠ guarantees
- High probability ≠ certainty
- Aim to be less wrong
Prediction Game
Groups predict Estimated Daily Energy Level (High/Low).
Steps
- Form hypothesis
- Select variables
- Define rule
- Calculate % Low Energy
Benchmark model:
Low Energy if (Hours Slept ≤6.0) OR (Steps < 5000)
Result: 6 / 10 students = 60% Low Energy
📚 Keypoints
Key takeaways:
- Organising data ensures accuracy
- Exploration builds understanding
- Patterns inspire hypotheses
- Predictions extend insights
- Uncertainty is inherent
🔦 Hints
Additional learning:
| Submodule | Python Connection | Why It Matters |
|---|---|---|
| Organising Messy Data | pandas cleaning tools | Foundation of real analysis |
| Exploring Data | .describe(), .info(), .value_counts() | 快速 understanding |
| Finding Patterns | groupby, pivot_table, clustering | Reveals relationships |
| Predicting Outcomes | simple models, sklearn basics | Intro to ML thinking |