Lecture 6: Predictive Modeling
Outline
Prediction
- Phases: Building & Applying
- Classification & Regression
- Decision Trees
- Linear Regression
Classification Metrics
- Overfitting & Underfitting
- Cross-validation
Data Science Process
- Ask an interesting question:
- Scientific goal, predictions, or estimates.
- Get the data:
- Sampling, relevance, privacy.
- Explore the data:
- Plotting, anomaly detection, pattern identification.
- Model the data:
- Build, fit, validate models.
- Communicate results:
- Insights, visualization, storytelling.
Source: Joe Blitzstein and Hanspeter Pfister, Harvard Data Science Course
Prediction
- Purpose: Estimate/forecast (e.g., sales, weather).
- Model Input/Output:
- Descriptors (input variables).
- Response (output variable).
- Example: Predict car fuel efficiency (MPG) using Cylinders, Displacement, etc.
Usage of Predictive Models
- Prioritization:
- E.g., credit card campaigns, experiment selection.
- Decision Support:
- E.g., weather forecasts triggering emergency alerts.
- Understanding:
- Identify key variables and their relationships.
Phases: Building & Applying
Building
- Training Set: Build model.
- Test/Validation Set: Assess model quality.
Applying
- Use model on new data (no response variables).
Data Partitioning
| Partition | Purpose | |
|---|---|---|
| Training | Model building | |
| Validation | Tuning | |
| Test | Final evaluation |
Algorithms for Prediction
| Classification | Regression | |
|---|---|---|
| Classification Trees | Regression Trees | |
| k-Nearest Neighbors | k-Nearest Neighbors | |
| Logistic Regression | Linear Regression | |
| Naïve Bayes | Neural Networks |
Decision Trees
-
Descriptors are the input to build decision trees
-
relation between two nodes that are joined together is defined as a parent-child relationship
-
larger node that is being divided is the parent node
-
child node with no more children is the leaf node
-
Structure:
- Nodes: Decision points (parent/child/leaf).
- Splits: Based on descriptors (e.g., humidity, outlook).
-
Example: Play Golf? (Yes/No) using weather attributes.
Pros & Cons
- Pros: Interpretable, handles categorical/continuous data.
- Cons: Computationally expensive, prone to overfitting.
Random Forest
- Ensemble of decision trees (improves accuracy).
- Process: Bagging + majority voting.
Linear Regression
- Simple Linear Regression:
- Equation: ( Y = a + bX ).
- Example: Predict sales from income.
- Multiple Linear Regression:
- Equation: ( Y = a + b_1X_1 + b_2X_2 + \dots ).
Correlation vs. Regression
- Correlation: Measures relationship strength.
- Regression: Predicts Y from X.
Evaluation Metrics
Classification
- Accuracy: Correct predictions ratio.
- Confusion Matrix: TP, FP, TN, FN.
- ROC Curve: TPR vs. FPR (AUC = 0.5–1.0).
- Precision/Recall/F1:
- Precision = ( \frac{TP}{TP+FP} ).
- Recall = ( \frac{TP}{TP+FN} ).
- F1 = Harmonic mean of precision/recall.
Regression
- MAE: Mean absolute error.
- MSE: Mean squared error.
- R²: Goodness of fit (0–1).
Overfitting & Underfitting
- Underfitting: Too simple (high bias).
- Overfitting: Too complex (high variance).
- Remedies:
- Adjust features/data.
- Use cross-validation.
Cross-Validation
- k-Fold: Split data into k subsets, rotate testing/training.
Feature Engineering
- Dimensionality Reduction: PCA, ICA.
- Feature Selection: Discriminative features.
- Feature Extraction: Composite features.
Resources
- Scikit-learn Tutorials.
- Jason Brownlee’s Machine Learning Mastery.
- Books:
- “Python Machine Learning” by Sebastian Raschka.
- “Introduction to ML with Python” by Andreas Mueller.