Lecture 4 - Bivariate Data Exploration

Contingency Table

  • Purpose: Show distribution of two categorical variables.
  • Structure: One variable in rows, another in columns.
  • Use: Study association between variables.
  • Entries: Count or percentage.
  • Alias: Cross tabulation.

Example: Young Adults by Age and Living Arrangement

Living Arrangement19202122TOTAL
Parents’ home3243783373181357
Another person’s home37474038162
Your own place1162793724871254
Group quarters58604925192
Other523919
Total5407668018772984

Marginal vs. Conditional Distributions

  • Marginal: Totals of subsets (e.g., row/column totals as percentages).
  • Conditional: Focus on one row/column subgroup to avoid misleading comparisons.

Example: Conditional Distribution (Percentages)

Living Arrangement19202122
Parents’ home60.049.342.136.3
Another person’s home6.96.15.04.3
Your own place21.536.446.455.5
Group quarters10.77.86.12.9
Other0.90.30.41.0
Total100.099.9100.0100.0

Simpson’s Paradox

  • Confounding Variable: A Lurking Variable that reverses or masks trends when subgroups are ignored.
  • Example: Medical Helicopters vs. Road Transport
    • Aggregated Data: Higher survival rate for helicopters.
    • Subgroup Data (by injury severity): Road transport performs better in both subgroups.

Visual Summaries for Bivariate Data

Variable 1Variable 2Plot Type
CategoricalCategoricalStacked/Grouped Bar Chart
CategoricalContinuousComparative Boxplot
ContinuousContinuousScatter Plot

Example: Machine Breakdowns by Shift

ShiftMachine ABCDTotals
14120121689
2311191465
31517161058
Totals87483740212

Plots:

  • Stacked Bar Chart: Breakdowns per machine by shift.
  • Grouped Bar Chart: Direct comparison across shifts.

Comparative Boxplot

  • Purpose: Compare continuous variables across categorical subgroups.
  • Elements:
    • Location: Median comparison.
    • Spread: IQR (box size).
    • Shape: Median position.
    • Outliers: Points outside 1.5×IQR.

Example: Stem Weights with/without Nitrogen

  • No Nitrogen: 0.21, 0.53, …, 0.83
  • Nitrogen: 0.26, 0.43, …, 0.46

Scatter Plot

  • Purpose: Assess association between two continuous variables.
  • Patterns:
    • Positive: Values increase together.
    • Negative: One decreases as the other increases.
  • Example: Pulse Rate vs. Years of Schooling
    • Data: (12,73), (16,67), …, (14,71).
    • Plot: Negative trend observed.

Correlation Coefficient (r)

  • Range: -1 (perfect negative) to 1 (perfect positive).
  • Interpretation:
    • |r| ≥ 0.7: Strong correlation.
    • |r| ≤ 0.3: Weak correlation.
  • Formula:

Example: Study Hours vs. Final Marks

  • Data: (5,49), (8,60), …, (15,85).
  • Calculations:
    • , , .
    • (strong positive).
    • (89.26% variance explained).

Cautions

  1. Outliers: Skew correlation values.
  2. Grouping: Combining subgroups may distort trends.
  3. Linearity: Correlation assumes linearity (e.g., Anscombe’s Quartet).

Exercise: Carry Marks vs. Final Exam Marks

  • Data: (40,24), (31,25), …, (38,23).
  • Tasks:
    a) Identify variables.
    b) Plot scatter diagram.
    c) Calculate ( r ) and interpret.
    d) Compute and explain.
    e) General performance comments.