Lecture 1: Introduction and Overview
Data Science Fundamentals
Course Code: CDS6214
Big Data
Big Data is used for real-world insights:
-
Big Data in Retail: How video analytics improve customer experience.
-
Big Data in Telecommunications: Dynamic insights from Telefónica.
-
Big Data in Healthcare: Transforming the healthcare industry.
-
Big Data in Hospitality: Enhancing customer service and operations.
Big Data is defined as data whose scale, distribution, diversity, and/or timeliness require new technical architectures and analytics to unlock new sources of business value.
Types of Data Structures
-
Structured Data: Defined format (e.g., OLAP cubes, RDBMS, CSV files, spreadsheets).
-
Semi-Structured Data: Text with a discernible pattern (e.g., XML files).
-
Quasi-Structured Data: Inconsistent formats that need processing (e.g., web clickstream data).
-
Unstructured Data: No inherent structure (e.g., PDFs, images, videos).
Growth of Big Data
Key Enablers:
-
Increase in storage capacities
-
Increase in processing power
-
Availability of data
The volume of data worldwide is expected to grow exponentially from 2021 to 2025.
5 Vs of Big Data
-
Volume – Large amounts of data generated every second.
-
Velocity – The speed at which data is created and processed.
-
Variety – Different types of data (structured, semi-structured, unstructured).
-
Veracity – Trustworthiness and quality of data.
-
Value – How data creates actionable insights.
Insights from Big Data
Big Data has been leveraged in various industries:
-
Facebook: Location-based analytics for friend suggestions and migration patterns.
-
Target: Predictive modeling to determine customer behavior (e.g., pregnancy predictions).
-
Tesco: Refrigerator data analytics for proactive maintenance and energy savings.
-
Macy’s: Real-time price adjustments based on demand and inventory.
-
Siemens: Sensor-based analytics for predictive maintenance in trains.
-
Google Flu Trends: Using search data to predict flu outbreaks.
How to Extract Insights from Big Data
-
Machine Learning & AI play a crucial role in uncovering patterns.
-
Data Science Process involves exploration, prediction, and inference.
What is Data Science?
Data Science is the process of drawing useful conclusions from large and diverse datasets using:
-
Exploration – Identifying patterns via visualizations and descriptive statistics.
-
Prediction – Making informed decisions using machine learning.
-
Inference – Quantifying certainty through statistical models.
Difference Between BI and Data Science
-
Business Intelligence (BI): Focuses on past performance (reporting and dashboards).
-
Data Science: Predictive analytics and machine learning for future decision-making.
Data Science in Academia vs. Industry
-
Academic research focuses on new methodologies.
-
Industry applies existing methods for business applications.
Roles in Data Science
Who is a Data Scientist?
“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.” – Josh Wills
Technical Skills Required:
-
Mathematics: Linear algebra, calculus, probability
-
Statistics: Hypothesis testing, summary statistics
-
Machine Learning: k-NN, random forests, ensemble methods
-
Software Engineering: Distributed computing, algorithms, data structures
-
Big Data Technologies: Spark, Hadoop, Hive, Pig
-
Programming Languages: Python, R, SQL
-
Data Visualization: Effective communication of insights
Business Skills:
-
Analytical problem-solving
-
Effective communication
-
Intellectual curiosity
-
Industry knowledge
Responsibilities of a Data Scientist:
-
Conduct research and frame open-ended industry questions.
-
Extract and clean large volumes of data.
-
Identify trends and hidden insights.
-
Develop predictive models and automation tools.
-
Communicate findings to stakeholders.
Data Science Roles Comparison
| Role | Focus |
|---|---|
| Data Analyst | Reporting, SQL queries, dashboards |
| Data Engineer | Building data pipelines, managing infrastructure |
| Data Scientist | Advanced analytics, machine learning, modeling |
Challenges in Data Science
-
Validity of assumptions
-
Overgeneralization of models
-
Effective communication of insights
-
Transitioning from prototypes to production-ready systems
-
Ensuring data integrity in pipelines