Lecture 06 Predictive Modelling

Lecture 6: Predictive Modeling

Outline

Prediction

Phases: Building & Applying
Classification & Regression
- Decision Trees
- Linear Regression

Classification Metrics

Overfitting & Underfitting
Cross-validation

Data Science Process

Ask an interesting question:
- Scientific goal, predictions, or estimates.
Get the data:
- Sampling, relevance, privacy.
Explore the data:
- Plotting, anomaly detection, pattern identification.
Model the data:
- Build, fit, validate models.
Communicate results:
- Insights, visualization, storytelling.

Source: Joe Blitzstein and Hanspeter Pfister, Harvard Data Science Course

Prediction

Purpose: Estimate/forecast (e.g., sales, weather).
Model Input/Output:
- Descriptors (input variables).
- Response (output variable).
Example: Predict car fuel efficiency (MPG) using Cylinders, Displacement, etc.

Usage of Predictive Models

Prioritization:
- E.g., credit card campaigns, experiment selection.
Decision Support:
- E.g., weather forecasts triggering emergency alerts.
Understanding:
- Identify key variables and their relationships.

Phases: Building & Applying

Building

Training Set: Build model.
Test/Validation Set: Assess model quality.

Applying

Use model on new data (no response variables).

Data Partitioning

Partition	Purpose
Training	Model building
Validation	Tuning
Test	Final evaluation

Algorithms for Prediction

Classification	Regression
Classification Trees	Regression Trees
k-Nearest Neighbors	k-Nearest Neighbors
Logistic Regression	Linear Regression
Naïve Bayes	Neural Networks

Decision Trees

Descriptors are the input to build decision trees
relation between two nodes that are joined together is defined as a parent-child relationship
larger node that is being divided is the parent node
child node with no more children is the leaf node
Structure:
- Nodes: Decision points (parent/child/leaf).
- Splits: Based on descriptors (e.g., humidity, outlook).
Example: Play Golf? (Yes/No) using weather attributes.

Pros & Cons

Pros: Interpretable, handles categorical/continuous data.
Cons: Computationally expensive, prone to overfitting.

Random Forest

Ensemble of decision trees (improves accuracy).
Process: Bagging + majority voting.

Linear Regression

Simple Linear Regression:
- Equation: ( Y = a + bX ).
- Example: Predict sales from income.
Multiple Linear Regression:
- Equation: ( Y = a + b_1X_1 + b_2X_2 + \dots ).

Correlation vs. Regression

Correlation: Measures relationship strength.
Regression: Predicts Y from X.

Evaluation Metrics

Classification

Accuracy: Correct predictions ratio.
Confusion Matrix: TP, FP, TN, FN.
ROC Curve: TPR vs. FPR (AUC = 0.5–1.0).
Precision/Recall/F1:
- Precision = ( \frac{TP}{TP+FP} ).
- Recall = ( \frac{TP}{TP+FN} ).
- F1 = Harmonic mean of precision/recall.

Regression

MAE: Mean absolute error.
MSE: Mean squared error.
R²: Goodness of fit (0–1).

Overfitting & Underfitting

Underfitting: Too simple (high bias).
Overfitting: Too complex (high variance).
Remedies:
- Adjust features/data.
- Use cross-validation.

Cross-Validation

k-Fold: Split data into k subsets, rotate testing/training.

Feature Engineering

Dimensionality Reduction: PCA, ICA.
Feature Selection: Discriminative features.
Feature Extraction: Composite features.

Resources

Scikit-learn Tutorials.
Jason Brownlee’s Machine Learning Mastery.
Books:
- “Python Machine Learning” by Sebastian Raschka.
- “Introduction to ML with Python” by Andreas Mueller.