Lecture 1: Introduction and Overview

Data Science Fundamentals

Course Code: CDS6214

Big Data

Big Data is used for real-world insights:

  • Big Data in Retail: How video analytics improve customer experience.

  • Big Data in Telecommunications: Dynamic insights from Telefónica.

  • Big Data in Healthcare: Transforming the healthcare industry.

  • Big Data in Hospitality: Enhancing customer service and operations.

Big Data is defined as data whose scale, distribution, diversity, and/or timeliness require new technical architectures and analytics to unlock new sources of business value.


Types of Data Structures

  • Structured Data: Defined format (e.g., OLAP cubes, RDBMS, CSV files, spreadsheets).

  • Semi-Structured Data: Text with a discernible pattern (e.g., XML files).

  • Quasi-Structured Data: Inconsistent formats that need processing (e.g., web clickstream data).

  • Unstructured Data: No inherent structure (e.g., PDFs, images, videos).


Growth of Big Data

Key Enablers:

  • Increase in storage capacities

  • Increase in processing power

  • Availability of data

The volume of data worldwide is expected to grow exponentially from 2021 to 2025.


5 Vs of Big Data

  1. Volume – Large amounts of data generated every second.

  2. Velocity – The speed at which data is created and processed.

  3. Variety – Different types of data (structured, semi-structured, unstructured).

  4. Veracity – Trustworthiness and quality of data.

  5. Value – How data creates actionable insights.


Insights from Big Data

Big Data has been leveraged in various industries:

  • Facebook: Location-based analytics for friend suggestions and migration patterns.

  • Target: Predictive modeling to determine customer behavior (e.g., pregnancy predictions).

  • Tesco: Refrigerator data analytics for proactive maintenance and energy savings.

  • Macy’s: Real-time price adjustments based on demand and inventory.

  • Siemens: Sensor-based analytics for predictive maintenance in trains.

  • Google Flu Trends: Using search data to predict flu outbreaks.

How to Extract Insights from Big Data

  • Machine Learning & AI play a crucial role in uncovering patterns.

  • Data Science Process involves exploration, prediction, and inference.


What is Data Science?

Data Science is the process of drawing useful conclusions from large and diverse datasets using:

  • Exploration – Identifying patterns via visualizations and descriptive statistics.

  • Prediction – Making informed decisions using machine learning.

  • Inference – Quantifying certainty through statistical models.

Difference Between BI and Data Science

  • Business Intelligence (BI): Focuses on past performance (reporting and dashboards).

  • Data Science: Predictive analytics and machine learning for future decision-making.

Data Science in Academia vs. Industry

  • Academic research focuses on new methodologies.

  • Industry applies existing methods for business applications.


Roles in Data Science

Who is a Data Scientist?

“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.” – Josh Wills

Technical Skills Required:

  • Mathematics: Linear algebra, calculus, probability

  • Statistics: Hypothesis testing, summary statistics

  • Machine Learning: k-NN, random forests, ensemble methods

  • Software Engineering: Distributed computing, algorithms, data structures

  • Big Data Technologies: Spark, Hadoop, Hive, Pig

  • Programming Languages: Python, R, SQL

  • Data Visualization: Effective communication of insights

Business Skills:

  • Analytical problem-solving

  • Effective communication

  • Intellectual curiosity

  • Industry knowledge

Responsibilities of a Data Scientist:

  • Conduct research and frame open-ended industry questions.

  • Extract and clean large volumes of data.

  • Identify trends and hidden insights.

  • Develop predictive models and automation tools.

  • Communicate findings to stakeholders.


Data Science Roles Comparison

RoleFocus
Data AnalystReporting, SQL queries, dashboards
Data EngineerBuilding data pipelines, managing infrastructure
Data ScientistAdvanced analytics, machine learning, modeling

Challenges in Data Science

  • Validity of assumptions

  • Overgeneralization of models

  • Effective communication of insights

  • Transitioning from prototypes to production-ready systems

  • Ensuring data integrity in pipelines