Outline
- Exploratory Data Analysis (EDA)
- Data Types (Numeric, Categorical)
- Statistical Description of Data
- Measures of Central Tendency
- Measures of Dispersion
- Outliers
- Graphical Representation
- Relationship between Attributes
- Correlation Analysis
- Simpson’s Paradox (Correlation vs. Causality)
Data Science Process
- Ask an interesting question
- Scientific goal? Prediction target?
- Get the data
- Sampling, relevance, privacy issues.
- Explore the data
- Plot data, identify anomalies/patterns.
- Model the data
- Build, fit, validate models.
- Communicate results
- Visualize, interpret, and tell a story.
Source: Harvard Data Science Course
4 Types of Analytics
| Type | Question Answered |
|---|---|
| Descriptive | What is happening? |
| Diagnostic | Why did it happen? |
| Predictive | What is likely to happen? |
| Prescriptive | What should I do? |
Source: Gartner
Data Types
Numerical Data
- Discrete: Countable (e.g., number of staff).
- Continuous: Measurable (e.g., height, weight).
Categorical Data
- Nominal: No order (e.g., gender).
- Ordinal: Ordered ranks (e.g., satisfaction levels).
- Binary: Two states (e.g., Yes/No).
Statistical Description
Measures of Central Tendency
| Measure | Description | Sensitivity to Outliers |
|---|---|---|
| Mean | Average value | High |
| Median | Middle value in ordered data | Low |
| Mode | Most frequent value | None |
Measures of Dispersion
- Range: Max - Min.
- Variance/Standard Deviation: Spread around mean.
- IQR: Q3 - Q1 (robust to outliers).
Graphical Representations
- Boxplots: Show median, quartiles, outliers.
- Histograms: Distribution of continuous data.
- Scatterplots: Bivariate relationships.
Outliers Detection
- Inner Fences:
Q1 - 1.5*IQRtoQ3 + 1.5*IQR(mild outliers). - Outer Fences:
Q1 - 3*IQRtoQ3 + 3*IQR(extreme outliers).
Example:
- Dataset:
[30, 171, ..., 1441]→1441is a mild outlier.
Skewness & Kurtosis
- Skewness: Asymmetry of distribution.
- Rule:
-0.5 to 0.5(fairly symmetrical).
- Rule:
- Kurtosis: Tailedness.
- Normal distribution = 3.
Correlation Analysis
- Formula:
[ r_{xy} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} ] - Interpretation:
+1: Perfect positive correlation.-1: Perfect negative correlation.0: No correlation.
Example: Ice cream sales ↑ with temperature ↑ (r ≈ 0.96).
Simpson’s Paradox
- Confounding Factor: Ignoring hidden variables reverses correlation.
- Example: West Coast vs. East Coast friendliness → PhD status explains the trend reversal.
Correlation ≠ Causation
- Correlation: Statistical relationship.
- Causation: Direct cause-and-effect (requires controlled experiments).
Reading Materials
- Howard Seltman, Exploratory Data Analysis (PDF), 2018.
- Yassien Shaalan, EDA using Python, 2019.
- [YouTube] Prof. Patrick Meyer, Exploratory Data Analysis, 2015.