Lecture 4 - Bivariate Data Exploration
Contingency Table
- Purpose: Show distribution of two categorical variables.
- Structure: One variable in rows, another in columns.
- Use: Study association between variables.
- Entries: Count or percentage.
- Alias: Cross tabulation.
Example: Young Adults by Age and Living Arrangement
| Living Arrangement | 19 | 20 | 21 | 22 | TOTAL |
|---|
| Parents’ home | 324 | 378 | 337 | 318 | 1357 |
| Another person’s home | 37 | 47 | 40 | 38 | 162 |
| Your own place | 116 | 279 | 372 | 487 | 1254 |
| Group quarters | 58 | 60 | 49 | 25 | 192 |
| Other | 5 | 2 | 3 | 9 | 19 |
| Total | 540 | 766 | 801 | 877 | 2984 |
Marginal vs. Conditional Distributions
- Marginal: Totals of subsets (e.g., row/column totals as percentages).
- Conditional: Focus on one row/column subgroup to avoid misleading comparisons.
Example: Conditional Distribution (Percentages)
| Living Arrangement | 19 | 20 | 21 | 22 |
|---|
| Parents’ home | 60.0 | 49.3 | 42.1 | 36.3 |
| Another person’s home | 6.9 | 6.1 | 5.0 | 4.3 |
| Your own place | 21.5 | 36.4 | 46.4 | 55.5 |
| Group quarters | 10.7 | 7.8 | 6.1 | 2.9 |
| Other | 0.9 | 0.3 | 0.4 | 1.0 |
| Total | 100.0 | 99.9 | 100.0 | 100.0 |
Simpson’s Paradox
- Confounding Variable: A Lurking Variable that reverses or masks trends when subgroups are ignored.
- Example: Medical Helicopters vs. Road Transport
- Aggregated Data: Higher survival rate for helicopters.
- Subgroup Data (by injury severity): Road transport performs better in both subgroups.
Visual Summaries for Bivariate Data
| Variable 1 | Variable 2 | Plot Type |
|---|
| Categorical | Categorical | Stacked/Grouped Bar Chart |
| Categorical | Continuous | Comparative Boxplot |
| Continuous | Continuous | Scatter Plot |
Example: Machine Breakdowns by Shift
| Shift | Machine A | B | C | D | Totals |
|---|
| 1 | 41 | 20 | 12 | 16 | 89 |
| 2 | 31 | 11 | 9 | 14 | 65 |
| 3 | 15 | 17 | 16 | 10 | 58 |
| Totals | 87 | 48 | 37 | 40 | 212 |
Plots:
- Stacked Bar Chart: Breakdowns per machine by shift.
- Grouped Bar Chart: Direct comparison across shifts.
Comparative Boxplot
- Purpose: Compare continuous variables across categorical subgroups.
- Elements:
- Location: Median comparison.
- Spread: IQR (box size).
- Shape: Median position.
- Outliers: Points outside 1.5×IQR.
Example: Stem Weights with/without Nitrogen
- No Nitrogen: 0.21, 0.53, …, 0.83
- Nitrogen: 0.26, 0.43, …, 0.46
Scatter Plot
- Purpose: Assess association between two continuous variables.
- Patterns:
- Positive: Values increase together.
- Negative: One decreases as the other increases.
- Example: Pulse Rate vs. Years of Schooling
- Data: (12,73), (16,67), …, (14,71).
- Plot: Negative trend observed.
Correlation Coefficient (r)
- Range: -1 (perfect negative) to 1 (perfect positive).
- Interpretation:
- |r| ≥ 0.7: Strong correlation.
- |r| ≤ 0.3: Weak correlation.
- Formula:
r=SxxSyySxy,where Sxy=∑xiyi−n(∑xi)(∑yi)
Example: Study Hours vs. Final Marks
- Data: (5,49), (8,60), …, (15,85).
- Calculations:
- (Sxy=274), (Sxx=67.5), (Syy=1246).
- (r=0.9448) (strong positive).
- (r2=0.8926) (89.26% variance explained).
Cautions
- Outliers: Skew correlation values.
- Grouping: Combining subgroups may distort trends.
- Linearity: Correlation assumes linearity (e.g., Anscombe’s Quartet).
Exercise: Carry Marks vs. Final Exam Marks
- Data: (40,24), (31,25), …, (38,23).
- Tasks:
a) Identify variables.
b) Plot scatter diagram.
c) Calculate ( r ) and interpret.
d) Compute (r2) and explain.
e) General performance comments.