Data Mining and Warehousing Notes
All Topics (21)
- 1. What is Data Warehouse ?
- 2. Key Characteristics of a Data Warehouse
- 3. Need, Importance & Components of Data Warehousing
- 4. Data Warehouse vs OLTP
- 5. Architecture, Advantages & Applications of a Data Warehouse
- 6. Data Warehouse (DW) Delivery Process
- 7. Data Warehouse Architecture
- 8. Data Cleaning
- 9. Data Integration
- 10. Data Transformation
- 11. Data Reduction
- 12. Data Warehouse Schema Design
- 13. Partitioning Strategy in Data Warehouse Implementation
- 14. Data Marts
- 15. Metadata
- 16. Example of a Multidimensional Data Model
- 17. Pattern Warehousing
- 18. OLAP Systems
- 19. OLAP Queries
- 20. OLAP Servers
- 21. OLAP Operations
11. Data Reduction
Data Reduction is the process of reducing the volume of data while maintaining its integrity and analytical value.
Why Data Reduction is Needed
In data warehousing, datasets are very large. Data reduction helps to:
- Improve query performance
- Reduce storage cost
- Speed up data mining
- Make analysis faster and easier
Goal
To obtain a reduced representation of data that:
- Is smaller in size
- Produces almost the same analytical results
Types of Data Reduction Techniques
1. Data Cube Aggregation
Data is stored at detailed (low-level) form and then aggregated into higher-level summaries.
Examples:
- Daily sales → Monthly sales
- Monthly → Yearly
- City → State → Country
Used in:
- OLAP Cubes
- Roll-up operations
- Summary tables
2. Dimensionality Reduction
Reduces the number of attributes (features) in the dataset.
Techniques:
- PCA (Principal Component Analysis)
- Feature Selection:
- Correlation
- Chi-square
- RFE (Recursive Feature Elimination)
- Wavelet Transform
Benefits:
- Reduces model complexity
- Removes irrelevant/redundant data
- Improves accuracy and speed
3. Data Compression
Stores data in a compact format.
Types:
- Lossless Compression
- No data loss
- Examples: ZIP, Huffman Coding
- Lossy Compression
- Some data loss allowed
- Used in images, audio, video
- Other Methods
- Run-Length Encoding
- String Compression
4. Numerosity Reduction
Replaces the original dataset with a smaller representation.
(a) Parametric Methods
Assume a model and store only its parameters.
Examples:
- Regression Models
- Log-linear Models
Store model instead of full data.
(b) Non-Parametric Methods
No model assumption.
Examples:
- Histograms
- Clustering (e.g., K-means)
- Sampling
5. Sampling
Selecting a small but representative subset of data.
Types:
- Simple Random Sampling
- Stratified Sampling
- Systematic Sampling
Use:
- When full dataset is too large to process
6. Discretization and Binning
Reduces the number of distinct values by grouping.
Examples:
- Age (1–100) → (0–10, 11–20, …)
- Income → Low, Medium, High
Benefits:
- Reduces data size
- Simplifies analysis
- Improves model performance
12. Data Warehouse Schema Design
A schema in a data warehouse defines how data is organized and how tables are logically connected.
It is based on Dimensional Modeling, which divides data into:
- Fact Tables → Store numeric measures (Sales, Revenue, Quantity)
- Dimension Tables → Store descriptive attributes (Time, Product, Customer)
Schema design determines how fact and dimension tables are arranged.
Types of Data Warehouse Schemas
1. Star Schema
Definition:
The simplest and most commonly used schema.
Structure:
- One central Fact Table
- Multiple surrounding Dimension Tables
- All dimensions directly connected to the fact table
Diagram (Text):
|
Customer — Fact Table — Time
|
Store Dimension
Features:
- Easy to understand
- Fast query performance
- Denormalized dimension tables (redundancy allowed)
Use Cases:
- Retail sales
- Finance dashboards
- OLAP queries
2. Snowflake Schema
Definition:
A normalized version of the star schema.
Structure:
- One central Fact Table
- Dimension tables further divided into sub-dimensions (normalized)
Diagram (Text):
|
Product — Fact Table — Time
|
Customer → Customer City
Features:
- Reduced data redundancy
- More complex joins
- Saves storage space
Use Cases:
- Large and complex data warehouses
- Systems with many attributes
3. Galaxy Schema (Fact Constellation Schema)
Definition:
A schema with multiple fact tables sharing common dimension tables.
Structure:
- Multiple fact tables (e.g., Sales Fact, Inventory Fact)
- Shared dimension tables (Product, Time, Store)
Diagram (Text):
/ \
Sales Fact Inventory Fact
\ /
Time Dimension
Features:
- Handles complex business scenarios
- Supports multiple data marts
- Flexible but complex
Use Cases:
- Enterprise-level data warehouses
- Large organizations
Components of a Data Warehouse Schema
1. Fact Table
- Stores numerical/measurable data
- Contains foreign keys to dimension tables
Examples:
- Sales Amount
- Quantity
- Profit
2. Dimension Table
- Stores descriptive attributes
Examples:
- Time Dimension
- Product Dimension
- Customer Dimension
- Store / Location Dimension
3. Measures and Metrics
- Calculated values used for analysis
Examples:
- SUM(Sales)
- AVG(Profit)
4. Hierarchies
Used for Drill-Down and Roll-Up operations in OLAP.
Examples:
- Day → Month → Quarter → Year
- City → State → Country
Comparison of Schemas
| Feature | Star Schema | Snowflake Schema | Galaxy Schema |
|---|---|---|---|
| Normalization | Low | High | Medium |
| Query Speed | Fast | Slower | Moderate |
| Complexity | Simple | Complex | High |
| Storage | Large | Smaller | Medium |
| Fact Tables | One | One | Many |
| Dimensions | Denormalized | Normalized | Shared |
13. Partitioning Strategy in Data Warehouse Implementation
Partitioning is a physical design technique used in data warehouses to divide large tables (especially fact tables) into smaller, manageable pieces called partitions.
Why Partitioning is Needed
Data warehouses store massive volumes of historical data (often billions of rows). Partitioning helps in:
- Faster Query Processing
Only relevant partitions are scanned - Faster ETL & Data Loading
Only new/recent partitions are updated - Easy Data Archiving / Purging
Old partitions can be dropped without affecting others - Better Indexing & Storage Management
Smaller, more efficient indexes
Types of Partitioning Strategies
1. Range Partitioning (Most Common)
Data is divided based on a continuous range of values.
Example (Date-based):
- Partition 1 → Jan 2024
- Partition 2 → Feb 2024
- Partition 3 → Mar 2024
Advantages:
- Best for time-based data
- Easy archiving and deletion
- Efficient for range queries (e.g., last 3 months)
2. List Partitioning
Data is divided based on a predefined list of values.
Example (Region-based):
- Partition A → USA, Canada
- Partition B → India, Nepal, Sri Lanka
- Partition C → UK, Germany, France
Advantages:
- Good for categorical data
- Faster queries on specific categories
3. Hash Partitioning
Data is distributed using a hash function.
Example:
Advantages:
- Uniform data distribution
- Avoids data skew
- Useful when no natural partition key exists
4. Composite Partitioning (Hybrid)
Combination of two or more partitioning techniques.
Example (Range + Hash):
- First partition by Month
- Then apply hash on Customer_ID
Advantages:
- Suitable for very large datasets
- Balances performance and distribution
5. Column-based Partitioning (Columnar Storage)
Used in modern data warehouses.
Features:
- Stores data column-wise instead of row-wise
- High compression
- Very fast analytical queries
Examples of systems:
- BigQuery
- Snowflake
- Amazon Redshift
How Partitioning Works (Steps)
Step 1: Identify Large Tables
- Usually fact tables with billions of rows
Step 2: Select Partition Key
Common Keys:
- Date
- Product Category
- Region
- Department
- Customer ID
Step 3: Choose Partition Type
- Range
- List
- Hash
- Composite
Step 4: Create Partitions
- Table is divided into multiple segments on disk
Step 5: Manage Partitions
- Add new partitions (e.g., new month/year)
- Drop old partitions (archival)
- Merge or split partitions
Partition Pruning
Definition:
Query engines automatically skip irrelevant partitions.
Example Query:
WHERE sale_date BETWEEN '2024-01-01' AND '2024-01-31';
Only January partition is scanned, not the full table.
Result:
Faster query performance
Benefits of Partitioning
| Benefit | Explanation |
|---|---|
| Faster Query Performance | Only required partitions are scanned |
| Better ETL Speed | Incremental data loading |
| Easy Archival | Drop old partitions easily |
| Improved Index Efficiency | Smaller indexes per partition |
| High Availability | Operations at partition level |
| Manageability | Easier maintenance |
Example Scenario
Table: Sales_Fact
Partition Key: Sale_Date
Partition Type: Range Partitioning
| Partition Name | Date Range |
|---|---|
| SALES_2023_Q1 | Jan – Mar 2023 |
| SALES_2023_Q2 | Apr – Jun 2023 |
| SALES_2023_Q3 | Jul – Sep 2023 |
| SALES_2023_Q4 | Oct – Dec 2023 |
Operations You Can Perform
- Archive old partition (e.g., SALES_2021)
- Add new partition (e.g., SALES_2025)
- Query only recent data (last 6 months)
14. Data Marts
A Data Mart is a subset of a data warehouse that focuses on a specific business area such as Sales, Marketing, Finance, HR, or Inventory.
It is smaller, faster, and easier to manage than a full data warehouse.
Formal Definition
A Data Mart is a subject-oriented, integrated, and optimized collection of data designed to serve the needs of a particular department or business function.
Why Use Data Marts
- Provide quick access to relevant data
- Support department-specific reporting
- Reduce load on the main Data Warehouse
- Improve performance and simplify analytics
- Lower storage and management cost
Types of Data Marts
1. Dependent Data Mart
Definition:
Created from the Enterprise Data Warehouse (EDW).
Structure:
Features:
- Uses centralized and clean data
- Highly consistent and reliable
Example:
Sales Data Mart extracted from EDW
2. Independent Data Mart
Definition:
Created directly from operational systems without a data warehouse.
Structure:
Features:
- Faster to build
- May cause data inconsistency
Example:
Marketing Data Mart built from CRM system
3. Hybrid Data Mart
Definition:
Combination of both dependent and independent approaches.
Features:
- Uses both EDW and operational systems
- Flexible and faster implementation
Example:
Finance Data Mart using EDW + external banking data
Data Mart Architecture
A Data Mart consists of 3 layers:
1. Source Layer
- Operational databases
- Data warehouse
- External data sources
2. ETL Layer (Extract – Transform – Load)
- Data cleaning
- Data integration
- Data transformation
- Loading into data mart
3. Data Mart Storage / Presentation Layer
- Fact tables
- Dimension tables
- OLAP cubes
- Reports and dashboards
Schema Used in Data Marts
Data marts typically use Dimensional Modeling:
- Star Schema
- Snowflake Schema
Data Mart Components
1. Fact Tables
- Store transactional data
- Contain measures
Examples:
- Sales amount
- Revenue
- Profit
2. Dimension Tables
- Store descriptive attributes
Examples:
- Time
- Product
- Customer
- Region
Examples of Data Marts
1. Sales Data Mart
- Measures: Sales amount, Quantity
- Dimensions: Time, Product, Store
2. Finance Data Mart
- Measures: Revenue, Expenses, Profit
- Dimensions: Time, Department, Account
3. HR Data Mart
- Measures: Employee count, Salary
- Dimensions: Department, Location, Job Role
Advantages of Data Marts
| Advantage | Explanation |
|---|---|
| Faster Access | Smaller dataset → quick queries |
| Better Performance | Optimized for specific tasks |
| Lower Cost | Less storage and processing |
| Easy to Manage | Can be handled by small teams |
| Department-Focused | Tailored for specific needs |
| Reduces DW Load | Offloads queries from main warehouse |
Disadvantages of Data Marts
| Disadvantage | Explanation |
|---|---|
| Data Inconsistency | Occurs in independent marts |
| Data Redundancy | Same data stored multiple times |
| Limited Scope | Only covers one department |
| Integration Issues | Hard to combine multiple marts |
15. Metadata
Metadata means “data about data.”
It describes, explains, and gives meaning to the data stored in a data warehouse.
Formal Definition
Metadata is descriptive information that defines the structure, operations, rules, and contents of data in a data warehouse, enabling users and systems to understand and manage the data.
What Metadata Helps Users Understand
- What the data is
- Where it came from
- How it is structured
- How it should be used
Why Metadata is Important
- Helps users understand data definitions
- Helps ETL developers track data movement
- Provides source-to-target mapping
- Supports data governance
- Improves data quality and consistency
- Helps analysts generate accurate reports
Types of Metadata
1. Technical Metadata
Used by: IT teams, DBAs, ETL developers
Includes:
- Table names and column names
- Data types
- Primary keys and foreign keys
- Indexes and constraints
- Schema and table structure
- File paths and storage details
Example:
- Customer_ID → Integer (Primary Key)
- Date → Format: YYYY-MM-DD
2. Business Metadata
Used by: Business users, analysts
Includes:
- Business definitions
- KPIs (Key Performance Indicators)
- Data ownership
- Business rules
- Meaning of data fields
Example:
- Sales_Amount = Total value of products sold (excluding tax)
3. Operational Metadata
Describes: ETL and system operations
Includes:
- Data load time
- Record counts (before/after load)
- Error logs
- Job execution logs
- Data source details
Example:
- ETL Load Time: 02:00 AM
- Rows Loaded: 2,500,000
- Failed Rows: 1,200
Metadata in ETL Process
Metadata stores information about:
- Source extraction rules
- Transformation logic
- Mapping between source and target
- Load schedules
- Job dependencies
Metadata Repository (Metadata Catalog)
Definition:
A Metadata Repository is a central storage location where all metadata is stored.
Used By:
- ETL Tools
- BI Tools
- Data Warehouse Administrators
- Analysts
Examples of Metadata Tools:
- Informatica Repository
- SSIS Catalog
- AWS Glue Data Catalog
- Hive Metastore
Examples of Metadata
1. Technical Metadata
- Table: sales_fact
- Column: sale_date (DATE)
- Column: amount (NUMBER, NOT NULL)
2. Business Metadata
- Revenue = Total sales after discount
3. Operational Metadata
- Last ETL Run: 1 March 2025, 3:00 AM
- Status: Success
Advantages of Metadata
| Benefit | Explanation |
|---|---|
| Improves Data Understanding | Clear meaning of data |
| Better Data Quality | Detects errors and inconsistencies |
| Easier ETL Development | Provides clear documentation |
| Faster Data Retrieval | Helps optimize queries |
| Supports Governance | Tracks data origin and lineage |