Lecture 6: Predictive Modeling


Outline

Prediction

  • Phases: Building & Applying
  • Classification & Regression
    • Decision Trees
    • Linear Regression

Classification Metrics

  • Overfitting & Underfitting
  • Cross-validation

Data Science Process

  1. Ask an interesting question:
    • Scientific goal, predictions, or estimates.
  2. Get the data:
    • Sampling, relevance, privacy.
  3. Explore the data:
    • Plotting, anomaly detection, pattern identification.
  4. Model the data:
    • Build, fit, validate models.
  5. Communicate results:
    • Insights, visualization, storytelling.

Source: Joe Blitzstein and Hanspeter Pfister, Harvard Data Science Course


Prediction

  • Purpose: Estimate/forecast (e.g., sales, weather).
  • Model Input/Output:
    • Descriptors (input variables).
    • Response (output variable).
  • Example: Predict car fuel efficiency (MPG) using Cylinders, Displacement, etc.

Usage of Predictive Models

  1. Prioritization:
    • E.g., credit card campaigns, experiment selection.
  2. Decision Support:
    • E.g., weather forecasts triggering emergency alerts.
  3. Understanding:
    • Identify key variables and their relationships.

Phases: Building & Applying

Building

  • Training Set: Build model.
  • Test/Validation Set: Assess model quality.

Applying

  • Use model on new data (no response variables).

Data Partitioning

PartitionPurpose
TrainingModel building
ValidationTuning
TestFinal evaluation

Algorithms for Prediction

ClassificationRegression
Classification TreesRegression Trees
k-Nearest Neighborsk-Nearest Neighbors
Logistic RegressionLinear Regression
Naïve BayesNeural Networks

Decision Trees

  • Descriptors are the input to build decision trees

  • relation between two nodes that are joined together is defined as a parent-child relationship

  • larger node that is being divided is the parent node

  • child node with no more children is the leaf node

  • Structure:

    • Nodes: Decision points (parent/child/leaf).
    • Splits: Based on descriptors (e.g., humidity, outlook).
  • Example: Play Golf? (Yes/No) using weather attributes.

Pros & Cons

  • Pros: Interpretable, handles categorical/continuous data.
  • Cons: Computationally expensive, prone to overfitting.

Random Forest

  • Ensemble of decision trees (improves accuracy).
  • Process: Bagging + majority voting.

Linear Regression

  • Simple Linear Regression:
    • Equation: ( Y = a + bX ).
    • Example: Predict sales from income.
  • Multiple Linear Regression:
    • Equation: ( Y = a + b_1X_1 + b_2X_2 + \dots ).

Correlation vs. Regression

  • Correlation: Measures relationship strength.
  • Regression: Predicts Y from X.

Evaluation Metrics

Classification

  • Accuracy: Correct predictions ratio.
  • Confusion Matrix: TP, FP, TN, FN.
  • ROC Curve: TPR vs. FPR (AUC = 0.5–1.0).
  • Precision/Recall/F1:
    • Precision = ( \frac{TP}{TP+FP} ).
    • Recall = ( \frac{TP}{TP+FN} ).
    • F1 = Harmonic mean of precision/recall.

Regression

  • MAE: Mean absolute error.
  • MSE: Mean squared error.
  • : Goodness of fit (0–1).

Overfitting & Underfitting

  • Underfitting: Too simple (high bias).
  • Overfitting: Too complex (high variance).
  • Remedies:
    • Adjust features/data.
    • Use cross-validation.

Cross-Validation

  • k-Fold: Split data into k subsets, rotate testing/training.

Feature Engineering

  • Dimensionality Reduction: PCA, ICA.
  • Feature Selection: Discriminative features.
  • Feature Extraction: Composite features.

Resources

  • Scikit-learn Tutorials.
  • Jason Brownlee’s Machine Learning Mastery.
  • Books:
    • “Python Machine Learning” by Sebastian Raschka.
    • “Introduction to ML with Python” by Andreas Mueller.