
Introduction
Understanding where data comes from, how it is collected, and why certain choices matter is fundamental to good research practice.
This module introduces the principles and practicalities of collecting high-quality, ethical, and relevant data to support effective analysis.
Learning Outcomes
By the end of this module, you will be able to:
- distinguish between primary and secondary data sources
- plan data collection in line with a clear analytical purpose
- select appropriate formats and tools for structured data capture
- identify sources of bias and assess fairness in datasets
- relate these principles to equivalent workflows in Python
Key Terms
Primary data — data collected first-hand for your research purpose
Secondary data — pre-existing data collected by others
Bias — systematic error that distorts representation
Relevance — the degree to which data addresses the research question
Completeness — presence or absence of required variables
Consistency — uniform formats, coding and measurement standards
1. What Is Data and Why Does It Matter?
Data refers to raw facts, observations, or measurements describing phenomena in the world.
From these, we derive information, insights, and evidence.
Data enables:
- identifying patterns
- supporting evidence-based decisions
- optimising processes
- predicting future trends
- driving innovation in research and practice
2. Sources of Data
2.1 Primary Data
Data that you collect directly.
Examples
- Surveying classmates about favourite snacks
- Measuring heights in a room
- Counting steps using your phone
- Observing activity at a campus gate
Strengths: high control, tailored, customisable
Limitations: slower, resource-intensive
2.2 Secondary Data
Data collected by someone else, usually for a different purpose.
Examples
- National statistics portals
- Open weather datasets
- Published research data
- University demographic reports
Strengths: fast, low cost, broad coverage
Limitations: relevance constraints; fixed variables
2.3 Comparison
| Control |
High |
Low |
| Cost |
Higher |
Lower |
| Speed |
Slow |
Fast |
| Relevance |
Tailored |
Variable |
| Richness |
Flexible |
Fixed |
Activity: Data Hunt
Task 1 — Collect primary data
Gather from 3 peers: - Name
- Height
- Favourite snack
Task 2 — Compare to secondary data
Use the provided dataset containing: - Gender, Age, Home Country
- Steps Walked (yesterday)
- Library Usage (per week)
- Commute Time + Transport Method
- IMD Score
- Hours Slept
- Student Status
- Club/Society Participation
Task 3 — Reflect How does your primary data differ in completeness, relevance, and consistency?
3. Planning for Good Data
3.1 Why Planning Matters
Good data demonstrates:
- Relevance — aligned with the research question
- Accuracy — faithfully measured
- Completeness — minimal gaps
- Consistency — uniform formats and coding
Collecting unnecessary or inconsistent variables leads to poor-quality insights and higher cleaning burden.
3.2 From Vague Ideas to Measurable Questions
Vague idea:
“Improve campus transport.”
Analysis purpose:
Examine how student demographics relate to commute burden.
Measurable questions:
- What % of students cycle to campus?
- Do PG students have longer commute times than UG students?
Measurable questions tell you which variables are needed and how to structure the data.
Activity: Question Shuffle
Instructions
- Choose a project idea
- Improve student wellbeing
- Optimise campus resources
- Identify:
- analysis purpose
- refined research question
- Select variables from the list:
Name, Gender, Age, Favourite Snack, Height, Home Country, Steps Walked, Library Usage, Commute Time, Transport Method, Hours Slept, Student Status, Club Participation, Healthy Meals Frequency, Postcode, IMD Score.
Example Solutions
Student Wellbeing
- Purpose: link sleep to social engagement
- Question: Do club-active students sleep more on average?
- Variables: Hours Slept; Club Participation
Resource Equity
- Purpose: assess usage by deprivation
- Question: Do students from low IMD areas visit the library less often?
- Variables: IMD Score; Library Usage
Activity: Sorting Game
Categorise data cards into:
- Primary data (you created it)
- Secondary data (already existed)
- Ambiguous cases (depends on context)
Primary examples:
Surveys you collected; height measurements; observed traffic.
Secondary examples:
Government statistics; downloaded weather data; published research.
Ambiguous:
A friend’s snack preference; a classmate’s dataset; Fitbit challenge uploads.
5. Trust, Bias, and Fairness
5.1 Common Bias Types
| Sampling bias |
Only surveying morning students |
Misrepresents population |
| Measurement bias |
Using an uncalibrated scale |
Inaccurate estimates |
| Selection bias |
Surveying only volunteers |
Inflated engagement measures |
| Confirmation bias |
Leading questions in satisfaction surveys |
Skewed responses |
5.2 Principles for Fair Data
- Include diverse groups
- Use clear, neutral questions
- Document data collection procedures
- Reflect on missingness and under-representation
Activity: Biased Sampling Game
Each group receives coloured counters representing a student population:
- Pink = female
- Blue = male
Tasks - Estimate % female
- Compute average age of females
- Compare with known full-population statistics
Discussion prompts
- How does sampling error distort estimates?
- Which biases can be reduced, which are unavoidable?
- How can we report uncertainty transparently?
Summary
Key Takeaways
- Know your data source — it affects trust and interpretability.
- Good questions drive efficient, relevant data collection.
- Tools and formats shape data quality and analytical ease.
- Bias is unavoidable but manageable with planning.
- Ethical, transparent data collection strengthens research validity.
Python Link
Primary & Secondary Data
Use pandas to import CSV, Excel, JSON, or API data.
Planning for Good Data
Filtering, selecting and restructuring DataFrames.
Tools & Formats
Reading/writing CSV, JSON, XLSX.
Trust & Fairness
Identifying missingness, imbalance and bias using summary statistics.