Introduction
What is Statistics?
- Statistics: The science of conducting studies to collect, organize, summarize, analyze, present, interpret, and draw conclusions from data.
- Data: Any value, either observation or measurement, that has been collected.
- Variable: A characteristic or attribute that can assume different values.
- Variables whose values are determined by chance are called random variables.
Some Terminologies
- Individuals/Units/Instances/Records: The objects described by a set of data.
- Variable/Attribute/Feature:
- Any characteristic of an individual.
- Can take different values for different individuals.
- Any random unit will have a random value (random variable).
Why Do We Need Statistics?
- Describing the Relationship Between Variables
- Example: A university admission director studies the relationship between SPM results and GPA.
- Making Better Decisions in the Face of Uncertainty
- Example: A consumer activist uses statistical inference to verify if a 90% customer satisfaction claim by a hair stylist is exaggerated.
Variables
Types of Variables
- Numerical
- Discrete
- Continuous
- Categorical
- Nominal
- Ordinal
Definitions
- Numerical Variable:
- Measurements are numerical values on a continuous scale.
- Arithmetic operations (adding, averaging) make sense.
- Categorical Variable:
- Places an individual into one of several groups or discrete categories.
- Ordinal: An order is evident in the categories.
Level of Measurement Data
Qualitative (Categorical/Attribute) Data
- Data classified using code numbers.
- Nominal Data: No ranking. (e.g., Gender, race, nationality)
- Ordinal Data: Can be ranked. (e.g., Likert scale, color intensity)
Quantitative (Numerical) Data
- Can be counted or measured.
- Discrete Data: Finite, countable values. (e.g., Number of students, number of defects)
- Continuous Data: Measured within two values, rounded to decimals. (e.g., Weight, age, salary)
Levels of Measurement
| Level | Description | Examples |
|---|---|---|
| Nominal | Categories, no ranking | Gender, Religion, Zip Code |
| Ordinal | Categories with ranking | Grades (A, B, C), Ratings (Good, Excellent) |
| Interval | Ranked with equal intervals, no true zero | IQ test, Temperature, Shoe Size |
| Ratio | True zero exists | Height, Weight, Time, Salary |
Exercise 1
Analyze job-related injuries data by categorizing variables:
- Qualitative or Quantitative?
- Discrete or Continuous?
- Nominal or Ordinal?
- Measurement Level?
Dimensionality of Dataset
| Dimension | Variables | Purpose | Example | Common Techniques |
|---|---|---|---|---|
| Univariate | 1 | Analyze distribution, central tendency, dispersion | Height of students | Mean, Median, Boxplot |
| Bivariate | 2 | Relationship & association | Height vs. Weight | Correlation, Scatter Plot |
| Multivariate | >2 | Complex relationships | Height, Weight, Age | Regression, Clustering |
Statistical Data Analysis
- Asking the right question(s).
- Collecting useful data or searching for secondary data.
- Exploring and analyzing data to answer questions.
- Making decisions & inferences about a population.
- Turning data into knowledge.
Population and Sample
- Population (N): A complete collection of measurements, outcomes, or individuals under study.
- Sample (n): A subset of the population.
Types of Populations
- Tangible: Finite, fixed subjects (e.g., all students in a university).
- Intangible (Conceptual): Unlimited possible observations (e.g., simulated data).
Parameter vs. Statistic
- Parameter: A numerical value representing a population characteristic.
- Example: The average height of all students in a university.
- Statistic: A numerical value representing a sample characteristic.
- Example: The average height of female students in the university.
Notation
| Measurement | Parameter (Population) | Statistic (Sample) |
|---|---|---|
| Mean | 𝜇 | x̄ |
| Variance | 𝜎² | s² |
| Standard Deviation | 𝜎 | s |
| Proportion | π | p |
Example 1
A travel agent claims that large hotels in Pahang have an average of 500 rooms (σ = 165). A sample of 7 hotels in Genting Highlands shows an average of 435 rooms (s = 15).
- Population: All large hotels in Pahang.
- Sample: 7 large hotels in Genting Highlands.
- Variable: Number of rooms.
- Parameter: 𝜇 = 500, 𝜎 = 165.
- Statistic: x̄ = 435, s = 15.
Exercise 2
A hostel has 317 first-year students. A dean collects IQ pre-test scores for 27 students and estimates the mean IQ for all students. Answer:
- What is the population?
- Is it tangible or conceptual?
- What is the sample?
- What is the variable?
- Which number describes a parameter?
- Which number describes a statistic?
Descriptive vs. Inferential Statistics
- Descriptive Statistics: Organizing, summarizing, and presenting data.
- Example: “10,000 parents in Malaysia chose Takaful Insurance.”
- Inferential Statistics: Generalizing and making predictions based on a sample.
- Example: “Lung cancer rate is 10x higher in smokers.”
Exercise 3
Classify these as Descriptive or Inferential Statistics:
- The average cost of a wedding is RM10,000.
- Median salary for a bachelor’s degree holder in Malaysia is RM30,000 (men), RM29,000 (women).
- Estimated 500,000 children under 15 have Type 1 diabetes.
- A new drug is claimed to reduce heart attacks in men over 70.