Lecture 3: Data Wrangling
Data Science Process
- Ask an interesting question
- Define scientific goals, predictions, or estimates.
- Get the data
- Address sampling, relevance, and privacy.
- Explore the data
- Plot data, identify anomalies/patterns.
- Model the data
- Build, fit, and validate models.
- Communicate results
- Visualize insights and tell a story.
Source: Harvard Data Science Course (Blitzstein & Pfister)
Data Quality Issues
- Annual cost of bad data (US): $3.1 trillion (IBM).
- Business impacts: 20% revenue loss, 54% cite data quality as top challenge.
Dimensions of Data Quality
- Completeness
- Example: 98% completion rate for emergency contact numbers in a school database.
- Timeliness
- Example: 3-day delay in updating a student’s contact info breaches SLA.
- Consistency
- Example: Uniform date formats across school registers and databases.
- Validity
- Example: Class identifiers (e.g., “AAA99”) must follow defined rules.
- Accuracy
- Example: Incorrect date format (MM/DD vs DD/MM) leads to wrong age calculation.
- Uniqueness
- Example: 96.2% uniqueness in student records due to duplicates (“Fred Smith” vs “Freddy Smith”).
For financial data, the priority will be
- Accuracy
- Timeliness
Data Quality Assessment
- Identify which data items need to be assessed
- assess which data quality dimensions
- define values or ranges representing good and bad quality data
- review the results
- take corrective actions
- repeat the above on a periodic basis
Data Cleaning/Wrangling
Definition: Transforming raw data into a usable format for analysis.
Common Problems & Solutions
| Problem | Causes | Solutions | |
|---|---|---|---|
| Incomplete Data | Missing values, truncated fields | Ignore tuples, fill with mean/median, or predict values. | |
| Noisy Data | Faulty instruments, entry errors | Binning, outlier analysis, regression. | |
| Inconsistent Data | Naming conventions, units, formats | Standardize formats, resolve conflicts. |
Binning Example
Original Data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
After Binning: 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
Outliers
- Not always errors; detect via scatter plots.
- Example: Temperature vs. cans sold plot shows outliers (65°F, 95°F).
Advanced Wrangling Tasks
- Data Integration
- Merge heterogeneous sources (e.g., schema, value, semantic differences).
- Entity Resolution
- Link records representing the same entity (e.g., edit distance, clustering).
- Data Reduction
- Dimensionality reduction: Remove irrelevant attributes.
- Numerosity reduction: Use smaller data representations.
- Data compression: Reduce size while preserving info.
Attribute Subset Selection
- Methods: Forward selection, backward elimination, decision trees.
Data Transformation
- Normalization
- Min-max: Scale to [0, 1] (e.g., $73,600 → 0.716).
- Z-score: Shift by mean/std dev (e.g., $73,600 → 1.225).
- Discretization
- Convert continuous data to intervals (e.g., binning, entropy-based splits).
- Concept Hierarchy
- Organize attributes hierarchically (e.g., street < city < state < country).
Data Integration
- Merging of data from multiple sources and (probably) have a single view over all these sources
- Integration can by physical or virtual
- Physical: Copy the data to warehouse
- Virtual: Keep the data only at the sources
- Issues: Naming consistencies, aggregated attributes, redundant data
The Need for Data Integration
Challenge: Heterogeneity across data sources complicates merging.
Types of Heterogeneity
- Source Type
- Different storage systems (e.g., SQL vs. NoSQL).
- Communication
- Varied access methods (APIs, query languages, web interfaces).
- Schema
- Structural differences in tables (e.g., column names/order).
- Data Type
- Same data stored differently (e.g., phone numbers as
stringvs.integer).
- Same data stored differently (e.g., phone numbers as
- Value
- Logical values represented inconsistently (e.g., “Prof.” vs. “Professor”).
- Semantic
- Same column names with different meanings (e.g., “title” as job title vs. person title).
Example:
- Integrating student records from a school database (DD/MM/YYYY dates) and a US application form (MM/DD/YYYY dates).
Entity Resolution
Goal: Identify records representing the same real-world entity across sources.
Methods
- Edit Distance
- Compare strings (e.g., “Catrina Trewin” vs. “C. Trewin”).
- Normalization & Ontology
- Replace abbreviations with standardized terms (e.g., “NY” → “New York”).
- Clustering & Partitioning
- Group similar records (e.g., cluster misspelled names like “Fred” and “Freddy”).
Example:
-
Merging duplicate student records:
ID Name Address 1 Catrina Trewin 123 Sample Road 2 C. Trewin 123 Sample Rd.
Solution: Use address similarity and name variants to resolve as one entity.
Data Reduction
Why? Handle large datasets efficiently without losing analytical value.
Strategies
- Dimensionality Reduction
- Remove irrelevant attributes (e.g., drop “Student_ID” for a gender-based analysis).
- Numerosity Reduction
- Replace data with smaller representations (e.g., histograms, sampling).
- Data Compression
- Use algorithms (e.g., ZIP) or lossless/lossy techniques (e.g., PCA for images).
Example:
- Reducing a dataset of 10,000 student records to 1,000 via random sampling.
Methods for Attribute Subset Selection
Objective: Find the minimal set of attributes preserving data utility.
Approaches
- Stepwise Forward Selection
- Start empty, iteratively add the best attribute (e.g., add “GPA” first for grade prediction).
- Stepwise Backward Elimination
- Start with all attributes, remove the worst iteratively (e.g., drop “Middle_Name” if irrelevant).
- Decision Tree Induction
- Use tree splits to identify key attributes (e.g., “Income” and “Age” for loan approval).
Example:
- Initial Attributes: {Age, GPA, Attendance, Extracurriculars}
- Reduced Set (via Decision Tree): {GPA, Attendance}
Key Tools for Subset Selection
| Method | Use Case | |
|---|---|---|
| Forward Selection | Small datasets, clear top attributes | |
| Backward Elimination | Large datasets, many weak predictors | |
| Decision Trees | Non-linear relationships, hierarchies |
Note: Hybrid methods (e.g., forward + backward) balance precision and efficiency.
Data Transformation
Goal: Convert data into forms suitable for analysis while preserving meaning.
1. Normalization
Purpose: Equalize attribute scales to prevent bias in algorithms.
Methods
| Type | Formula | Example | |
|---|---|---|---|
| Min-Max | |||
| Z-Score | |||
| Decimal Scaling |
Use Cases:
- Machine learning (e.g., neural networks, k-NN).
- Distance-based clustering.
2. Discretization
Purpose: Convert continuous data into intervals (categorical bins).
Techniques
- Binning:
- Equal-width: Divide range into fixed intervals (e.g., age groups 0–20, 21–40).
- Equal-frequency: Each bin has same number of data points.
- Entropy-Based:
- Split data where information gain is maximized (used in decision trees).
Example:
- Original: [4, 8, 9, 15, 21, 24, 25, 26, 28, 29, 34]
- Binned: [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3]
3. Concept Hierarchy Generation
Purpose: Organize attributes into hierarchical levels for multi-level analysis.
Methods
- Schema-Based: Pre-defined hierarchies (e.g.,
street < city < state < country). - Data-Driven: Auto-generate based on distinct values:
country (15 values) └── province (365 values) └── city (3,567 values) └── street (674,339 values)
Integration → Resolution → Reduction → Subset Selection forms a pipeline for clean, analysis-ready data.
Key Takeaways
- Data wrangling is iterative and critical for accurate analysis.
- Quality dimensions (completeness, accuracy, etc.) guide cleaning efforts.
- Tools: Binning, normalization, entity resolution, and reduction techniques.
Further Reading:
- Data Wrangling: Making Data Useful Again (Endel & Piringer).
- Kaggle (www.kaggle.com) for hands-on practice.