Lecture 3: Data Wrangling

Data Science Process

Ask an interesting question
- Define scientific goals, predictions, or estimates.
Get the data
- Address sampling, relevance, and privacy.
Explore the data
- Plot data, identify anomalies/patterns.
Model the data
- Build, fit, and validate models.
Communicate results
- Visualize insights and tell a story.

Source: Harvard Data Science Course (Blitzstein & Pfister)

Data Quality Issues

Annual cost of bad data (US): $3.1 trillion (IBM).
Business impacts: 20% revenue loss, 54% cite data quality as top challenge.

Dimensions of Data Quality

Completeness
- Example: 98% completion rate for emergency contact numbers in a school database.
Timeliness
- Example: 3-day delay in updating a student’s contact info breaches SLA.
Consistency
- Example: Uniform date formats across school registers and databases.
Validity
- Example: Class identifiers (e.g., “AAA99”) must follow defined rules.
Accuracy
- Example: Incorrect date format (MM/DD vs DD/MM) leads to wrong age calculation.
Uniqueness
- Example: 96.2% uniqueness in student records due to duplicates (“Fred Smith” vs “Freddy Smith”).

For financial data, the priority will be

Accuracy
Timeliness

Data Quality Assessment

Identify which data items need to be assessed
assess which data quality dimensions
define values or ranges representing good and bad quality data
review the results
take corrective actions
repeat the above on a periodic basis

Data Cleaning/Wrangling

Definition: Transforming raw data into a usable format for analysis.

Common Problems & Solutions

Problem	Causes	Solutions
Incomplete Data	Missing values, truncated fields	Ignore tuples, fill with mean/median, or predict values.
Noisy Data	Faulty instruments, entry errors	Binning, outlier analysis, regression.
Inconsistent Data	Naming conventions, units, formats	Standardize formats, resolve conflicts.

Binning Example

Original Data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
After Binning: 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3

Outliers

Not always errors; detect via scatter plots.
Example: Temperature vs. cans sold plot shows outliers (65°F, 95°F).

Advanced Wrangling Tasks

Data Integration
- Merge heterogeneous sources (e.g., schema, value, semantic differences).
Entity Resolution
- Link records representing the same entity (e.g., edit distance, clustering).
Data Reduction
- Dimensionality reduction: Remove irrelevant attributes.
- Numerosity reduction: Use smaller data representations.
- Data compression: Reduce size while preserving info.

Attribute Subset Selection

Methods: Forward selection, backward elimination, decision trees.

Data Transformation

Normalization
- Min-max: Scale to [0, 1] (e.g., $73,600 → 0.716).
- Z-score: Shift by mean/std dev (e.g., $73,600 → 1.225).
Discretization
- Convert continuous data to intervals (e.g., binning, entropy-based splits).
Concept Hierarchy
- Organize attributes hierarchically (e.g., street < city < state < country).

Data Integration

Merging of data from multiple sources and (probably) have a single view over all these sources
Integration can by physical or virtual
Physical: Copy the data to warehouse
Virtual: Keep the data only at the sources
Issues: Naming consistencies, aggregated attributes, redundant data

The Need for Data Integration

Challenge: Heterogeneity across data sources complicates merging.

Types of Heterogeneity

Source Type
- Different storage systems (e.g., SQL vs. NoSQL).
Communication
- Varied access methods (APIs, query languages, web interfaces).
Schema
- Structural differences in tables (e.g., column names/order).
Data Type
- Same data stored differently (e.g., phone numbers as string vs. integer).
Value
- Logical values represented inconsistently (e.g., “Prof.” vs. “Professor”).
Semantic
- Same column names with different meanings (e.g., “title” as job title vs. person title).

Example:

Integrating student records from a school database (DD/MM/YYYY dates) and a US application form (MM/DD/YYYY dates).

Entity Resolution

Goal: Identify records representing the same real-world entity across sources.

Methods

Edit Distance
- Compare strings (e.g., “Catrina Trewin” vs. “C. Trewin”).
Normalization & Ontology
- Replace abbreviations with standardized terms (e.g., “NY” → “New York”).
Clustering & Partitioning
- Group similar records (e.g., cluster misspelled names like “Fred” and “Freddy”).

Example:

Merging duplicate student records:

ID Name Address
1 Catrina Trewin 123 Sample Road
2 C. Trewin 123 Sample Rd.

ID	Name	Address
1	Catrina Trewin	123 Sample Road
2	C. Trewin	123 Sample Rd.

Solution: Use address similarity and name variants to resolve as one entity.

Data Reduction

Why? Handle large datasets efficiently without losing analytical value.

Strategies

Dimensionality Reduction
- Remove irrelevant attributes (e.g., drop “Student_ID” for a gender-based analysis).
Numerosity Reduction
- Replace data with smaller representations (e.g., histograms, sampling).
Data Compression
- Use algorithms (e.g., ZIP) or lossless/lossy techniques (e.g., PCA for images).

Example:

Reducing a dataset of 10,000 student records to 1,000 via random sampling.

Methods for Attribute Subset Selection

Objective: Find the minimal set of attributes preserving data utility.

Approaches

Stepwise Forward Selection
- Start empty, iteratively add the best attribute (e.g., add “GPA” first for grade prediction).
Stepwise Backward Elimination
- Start with all attributes, remove the worst iteratively (e.g., drop “Middle_Name” if irrelevant).
Decision Tree Induction
- Use tree splits to identify key attributes (e.g., “Income” and “Age” for loan approval).

Example:

Initial Attributes: {Age, GPA, Attendance, Extracurriculars}
Reduced Set (via Decision Tree): {GPA, Attendance}

Key Tools for Subset Selection

Method	Use Case
Forward Selection	Small datasets, clear top attributes
Backward Elimination	Large datasets, many weak predictors
Decision Trees	Non-linear relationships, hierarchies

Note: Hybrid methods (e.g., forward + backward) balance precision and efficiency.

Data Transformation

Goal: Convert data into forms suitable for analysis while preserving meaning.

1. Normalization

Purpose: Equalize attribute scales to prevent bias in algorithms.

Methods

Type	Formula	Example
Min-Max	$(v^{'} = \frac{v - m i n}{m a x - m i n})$	$(73600 \to 0.716) (in [0, 1])$
Z-Score	$(v^{'} = \frac{v - μ}{σ})$	$(73600 \to 1.225) (μ = 54000, σ = 16000)$
Decimal Scaling	$(v^{'} = \frac{v}{1 0 ^{k}})$	$(443 \to 0.443) (k = 3)$

\frac{v - min}{ma x - min}

z = \frac{x - m e an}{s t dd v}

Use Cases:

Machine learning (e.g., neural networks, k-NN).
Distance-based clustering.

2. Discretization

Purpose: Convert continuous data into intervals (categorical bins).

Techniques

Binning:
- Equal-width: Divide range into fixed intervals (e.g., age groups 0–20, 21–40).
- Equal-frequency: Each bin has same number of data points.
Entropy-Based:
- Split data where information gain is maximized (used in decision trees).

Example:

Original: [4, 8, 9, 15, 21, 24, 25, 26, 28, 29, 34]
Binned: [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3]

3. Concept Hierarchy Generation

Purpose: Organize attributes into hierarchical levels for multi-level analysis.

Methods

Schema-Based: Pre-defined hierarchies (e.g., street < city < state < country).

Data-Driven: Auto-generate based on distinct values:

country (15 values)  
└── province (365 values)  
    └── city (3,567 values)  
        └── street (674,339 values)

Integration → Resolution → Reduction → Subset Selection forms a pipeline for clean, analysis-ready data.

Key Takeaways

Data wrangling is iterative and critical for accurate analysis.
Quality dimensions (completeness, accuracy, etc.) guide cleaning efforts.
Tools: Binning, normalization, entity resolution, and reduction techniques.

Further Reading:

Data Wrangling: Making Data Useful Again (Endel & Piringer).
Kaggle (www.kaggle.com) for hands-on practice.

Wei Jie's Lectures Notes

Explorer

Lecture 03 Data Science Pipelines

Lecture 3: Data Wrangling

Data Science Process

Data Quality Issues

Dimensions of Data Quality

For financial data, the priority will be

Data Quality Assessment

Data Cleaning/Wrangling

Common Problems & Solutions

Binning Example

Outliers

Advanced Wrangling Tasks

Attribute Subset Selection

Data Transformation

Data Integration

The Need for Data Integration

Types of Heterogeneity

Entity Resolution

Methods

Data Reduction

Strategies

Methods for Attribute Subset Selection

Approaches

Key Tools for Subset Selection

Data Transformation

1. Normalization

Methods

2. Discretization

Techniques

3. Concept Hierarchy Generation

Methods

Key Takeaways

Graph View

Table of Contents

Backlinks