Lecture 3: Data Wrangling


Data Science Process

  1. Ask an interesting question
    • Define scientific goals, predictions, or estimates.
  2. Get the data
    • Address sampling, relevance, and privacy.
  3. Explore the data
    • Plot data, identify anomalies/patterns.
  4. Model the data
    • Build, fit, and validate models.
  5. Communicate results
    • Visualize insights and tell a story.

Source: Harvard Data Science Course (Blitzstein & Pfister)


Data Quality Issues

  • Annual cost of bad data (US): $3.1 trillion (IBM).
  • Business impacts: 20% revenue loss, 54% cite data quality as top challenge.

Dimensions of Data Quality

  1. Completeness
    • Example: 98% completion rate for emergency contact numbers in a school database.
  2. Timeliness
    • Example: 3-day delay in updating a student’s contact info breaches SLA.
  3. Consistency
    • Example: Uniform date formats across school registers and databases.
  4. Validity
    • Example: Class identifiers (e.g., “AAA99”) must follow defined rules.
  5. Accuracy
    • Example: Incorrect date format (MM/DD vs DD/MM) leads to wrong age calculation.
  6. Uniqueness
    • Example: 96.2% uniqueness in student records due to duplicates (“Fred Smith” vs “Freddy Smith”).

For financial data, the priority will be

  • Accuracy
  • Timeliness

Data Quality Assessment

  1. Identify which data items need to be assessed
  2. assess which data quality dimensions
  3. define values or ranges representing good and bad quality data
  4. review the results
  5. take corrective actions
  6. repeat the above on a periodic basis

Data Cleaning/Wrangling

Definition: Transforming raw data into a usable format for analysis.


Common Problems & Solutions

ProblemCausesSolutions
Incomplete DataMissing values, truncated fieldsIgnore tuples, fill with mean/median, or predict values.
Noisy DataFaulty instruments, entry errorsBinning, outlier analysis, regression.
Inconsistent DataNaming conventions, units, formatsStandardize formats, resolve conflicts.

Binning Example

Original Data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
After Binning: 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3

Outliers

  • Not always errors; detect via scatter plots.
  • Example: Temperature vs. cans sold plot shows outliers (65°F, 95°F).

Advanced Wrangling Tasks

  1. Data Integration
    • Merge heterogeneous sources (e.g., schema, value, semantic differences).
  2. Entity Resolution
    • Link records representing the same entity (e.g., edit distance, clustering).
  3. Data Reduction
    • Dimensionality reduction: Remove irrelevant attributes.
    • Numerosity reduction: Use smaller data representations.
    • Data compression: Reduce size while preserving info.

Attribute Subset Selection

  • Methods: Forward selection, backward elimination, decision trees.

Data Transformation

  1. Normalization
    • Min-max: Scale to [0, 1] (e.g., $73,600 → 0.716).
    • Z-score: Shift by mean/std dev (e.g., $73,600 → 1.225).
  2. Discretization
    • Convert continuous data to intervals (e.g., binning, entropy-based splits).
  3. Concept Hierarchy
    • Organize attributes hierarchically (e.g., street < city < state < country).

Data Integration

  • Merging of data from multiple sources and (probably) have a single view over all these sources
  • Integration can by physical or virtual
  • Physical: Copy the data to warehouse
  • Virtual: Keep the data only at the sources
  • Issues: Naming consistencies, aggregated attributes, redundant data

The Need for Data Integration

Challenge: Heterogeneity across data sources complicates merging.

Types of Heterogeneity

  1. Source Type
    • Different storage systems (e.g., SQL vs. NoSQL).
  2. Communication
    • Varied access methods (APIs, query languages, web interfaces).
  3. Schema
    • Structural differences in tables (e.g., column names/order).
  4. Data Type
    • Same data stored differently (e.g., phone numbers as string vs. integer).
  5. Value
    • Logical values represented inconsistently (e.g., “Prof.” vs. “Professor”).
  6. Semantic
    • Same column names with different meanings (e.g., “title” as job title vs. person title).

Example:

  • Integrating student records from a school database (DD/MM/YYYY dates) and a US application form (MM/DD/YYYY dates).

Entity Resolution

Goal: Identify records representing the same real-world entity across sources.

Methods

  1. Edit Distance
    • Compare strings (e.g., “Catrina Trewin” vs. “C. Trewin”).
  2. Normalization & Ontology
    • Replace abbreviations with standardized terms (e.g., “NY” → “New York”).
  3. Clustering & Partitioning
    • Group similar records (e.g., cluster misspelled names like “Fred” and “Freddy”).

Example:

  • Merging duplicate student records:

    IDNameAddress
    1Catrina Trewin123 Sample Road
    2C. Trewin123 Sample Rd.

Solution: Use address similarity and name variants to resolve as one entity.


Data Reduction

Why? Handle large datasets efficiently without losing analytical value.

Strategies

  1. Dimensionality Reduction
    • Remove irrelevant attributes (e.g., drop “Student_ID” for a gender-based analysis).
  2. Numerosity Reduction
    • Replace data with smaller representations (e.g., histograms, sampling).
  3. Data Compression
    • Use algorithms (e.g., ZIP) or lossless/lossy techniques (e.g., PCA for images).

Example:

  • Reducing a dataset of 10,000 student records to 1,000 via random sampling.

Methods for Attribute Subset Selection

Objective: Find the minimal set of attributes preserving data utility.

Approaches

  1. Stepwise Forward Selection
    • Start empty, iteratively add the best attribute (e.g., add “GPA” first for grade prediction).
  2. Stepwise Backward Elimination
    • Start with all attributes, remove the worst iteratively (e.g., drop “Middle_Name” if irrelevant).
  3. Decision Tree Induction
    • Use tree splits to identify key attributes (e.g., “Income” and “Age” for loan approval).

Example:

  • Initial Attributes: {Age, GPA, Attendance, Extracurriculars}
  • Reduced Set (via Decision Tree): {GPA, Attendance}

Key Tools for Subset Selection

MethodUse Case
Forward SelectionSmall datasets, clear top attributes
Backward EliminationLarge datasets, many weak predictors
Decision TreesNon-linear relationships, hierarchies

Note: Hybrid methods (e.g., forward + backward) balance precision and efficiency.


Data Transformation

Goal: Convert data into forms suitable for analysis while preserving meaning.

1. Normalization

Purpose: Equalize attribute scales to prevent bias in algorithms.

Methods

TypeFormulaExample
Min-Max
Z-Score
Decimal Scaling

Use Cases:

  • Machine learning (e.g., neural networks, k-NN).
  • Distance-based clustering.

2. Discretization

Purpose: Convert continuous data into intervals (categorical bins).

Techniques

  • Binning:
    • Equal-width: Divide range into fixed intervals (e.g., age groups 0–20, 21–40).
    • Equal-frequency: Each bin has same number of data points.
  • Entropy-Based:
    • Split data where information gain is maximized (used in decision trees).

Example:

  • Original: [4, 8, 9, 15, 21, 24, 25, 26, 28, 29, 34]
  • Binned: [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

3. Concept Hierarchy Generation

Purpose: Organize attributes into hierarchical levels for multi-level analysis.

Methods

  • Schema-Based: Pre-defined hierarchies (e.g., street < city < state < country).
  • Data-Driven: Auto-generate based on distinct values:
    country (15 values)  
    └── province (365 values)  
        └── city (3,567 values)  
            └── street (674,339 values)  

Integration → Resolution → Reduction → Subset Selection forms a pipeline for clean, analysis-ready data.

Key Takeaways

  • Data wrangling is iterative and critical for accurate analysis.
  • Quality dimensions (completeness, accuracy, etc.) guide cleaning efforts.
  • Tools: Binning, normalization, entity resolution, and reduction techniques.

Further Reading:

  • Data Wrangling: Making Data Useful Again (Endel & Piringer).
  • Kaggle (www.kaggle.com) for hands-on practice.