Outline

  • Exploratory Data Analysis (EDA)
    • Data Types (Numeric, Categorical)
    • Statistical Description of Data
      • Measures of Central Tendency
      • Measures of Dispersion
      • Outliers
      • Graphical Representation
    • Relationship between Attributes
      • Correlation Analysis
      • Simpson’s Paradox (Correlation vs. Causality)

Data Science Process

  1. Ask an interesting question
    • Scientific goal? Prediction target?
  2. Get the data
    • Sampling, relevance, privacy issues.
  3. Explore the data
    • Plot data, identify anomalies/patterns.
  4. Model the data
    • Build, fit, validate models.
  5. Communicate results
    • Visualize, interpret, and tell a story.

Source: Harvard Data Science Course


4 Types of Analytics

TypeQuestion Answered
DescriptiveWhat is happening?
DiagnosticWhy did it happen?
PredictiveWhat is likely to happen?
PrescriptiveWhat should I do?

Source: Gartner


Data Types

Numerical Data

  • Discrete: Countable (e.g., number of staff).
  • Continuous: Measurable (e.g., height, weight).

Categorical Data

  • Nominal: No order (e.g., gender).
  • Ordinal: Ordered ranks (e.g., satisfaction levels).
  • Binary: Two states (e.g., Yes/No).

Statistical Description

Measures of Central Tendency

MeasureDescriptionSensitivity to Outliers
MeanAverage valueHigh
MedianMiddle value in ordered dataLow
ModeMost frequent valueNone

Measures of Dispersion

  • Range: Max - Min.
  • Variance/Standard Deviation: Spread around mean.
  • IQR: Q3 - Q1 (robust to outliers).

Graphical Representations

  • Boxplots: Show median, quartiles, outliers.
  • Histograms: Distribution of continuous data.
  • Scatterplots: Bivariate relationships.

Outliers Detection

  • Inner Fences: Q1 - 1.5*IQR to Q3 + 1.5*IQR (mild outliers).
  • Outer Fences: Q1 - 3*IQR to Q3 + 3*IQR (extreme outliers).

Example:

  • Dataset: [30, 171, ..., 1441]1441 is a mild outlier.

Skewness & Kurtosis

  • Skewness: Asymmetry of distribution.
    • Rule: -0.5 to 0.5 (fairly symmetrical).
  • Kurtosis: Tailedness.
    • Normal distribution = 3.

Correlation Analysis

  • Formula:
    [ r_{xy} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} ]
  • Interpretation:
    • +1: Perfect positive correlation.
    • -1: Perfect negative correlation.
    • 0: No correlation.

Example: Ice cream sales ↑ with temperature ↑ (r ≈ 0.96).


Simpson’s Paradox

  • Confounding Factor: Ignoring hidden variables reverses correlation.
    • Example: West Coast vs. East Coast friendliness → PhD status explains the trend reversal.

Correlation ≠ Causation

  • Correlation: Statistical relationship.
  • Causation: Direct cause-and-effect (requires controlled experiments).

Reading Materials

  1. Howard Seltman, Exploratory Data Analysis (PDF), 2018.
  2. Yassien Shaalan, EDA using Python, 2019.
  3. [YouTube] Prof. Patrick Meyer, Exploratory Data Analysis, 2015.