What Is ETL Testing? (Definition + Real Example)
ETL Testing validates data as it moves from Source systems to Target Data Warehouse (DW) through Extract, Transform, and Load processes. The goal is to ensure accuracy, completeness, consistency, performance, and business-rule compliance.
Real-world example:
A banking DW pulls Transactions from OLTP systems, transforms currency, deduplicates customers, applies SCD2 on dimensions, and loads fact tables for reporting. ETL testing ensures row counts match, transformations are correct, audit fields are populated, and reports reflect true balances.
Data Warehouse (DW) Flow: Source → Staging → Transform → Load → Reporting
- Source: OLTP DBs, files, APIs
- Staging: Raw landing, minimal rules
- Transform: Business logic, SCDs, aggregations
- Load: Facts & dimensions
- Reporting: BI tools consume curated data
Interview Questions + Best Answers (Basic → Advanced)
A. Fundamentals (1–10)
- What is ETL testing?
Validation of extracted, transformed, and loaded data against business rules and mappings. - ETL vs DW testing?
ETL focuses on pipelines; DW testing includes schemas, facts/dimensions, and reporting accuracy. - Why staging?
Isolates raw data, supports reprocessing, simplifies debugging. - What is S2T?
Source-to-Target mapping document defining transformations. - Fact vs Dimension?
Facts store measures; dimensions store descriptive attributes. - Types of ETL testing?
Source validation, transformation validation, target validation, reconciliation, performance. - What are audit fields?
load_dt, batch_id, created_by, updated_dt. - What is data reconciliation?
Matching counts/sums between source and target. - What is data lineage?
Traceability from source to report. - Common ETL defects?
Truncation, duplicates, null propagation, wrong joins.
B. Mapping & Transformations (11–25)
- Mapping validation steps?
Verify joins, expressions, filters, lookups against S2T. - Handling nulls?
Default values, COALESCE, reject rows with reason codes. - SCD Type 1 vs Type 2?
Type 1 overwrites; Type 2 preserves history with effective dates. - Surrogate vs natural key?
Surrogate is DW-generated; natural comes from source. - Deduplication strategy?
Hashing on business keys + ROW_NUMBER. - What is CDC?
Change Data Capture for incremental loads. - Lookup vs Join?
Lookups are cached reference checks; joins combine datasets. - Late arriving dimensions?
Insert placeholders, update later. - Reject handling?
Route invalid rows to error tables with reasons. - Soft deletes?
Flag records instead of physical delete. - What is data masking?
Obfuscation for sensitive fields. - What is hashing used for?
Change detection and deduplication. - Slowly changing fact?
Facts rarely change; adjustments handled via corrections. - Effective dating?
valid_from/valid_to to manage history. - Conformed dimensions?
Shared dimensions across marts.
C. SQL-Driven Validation (26–40)
- Row count check?
Compare source vs target post-filters. - Sum reconciliation?
Validate measures after transformations. - Duplicate detection?
GROUP BY HAVING COUNT(*)>1. - Referential integrity?
Facts must match dimension keys. - Window functions usage?
Ranking, dedupe, SCD logic. - Performance checks?
Index usage, partition pruning. - Data type mismatch?
Validate casting rules. - Incremental load validation?
Only delta rows processed. - Reject count validation?
Expected vs actual rejects. - Data freshness?
Max(load_dt) within SLA. - Aggregation correctness?
GROUP BY validations. - Join correctness?
Validate cardinality and join keys. - Null propagation risk?
Validate mandatory fields. - Audit reconciliation?
Batch totals match. - Restartability?
Idempotent loads on rerun.
D. Advanced & Real-Time (41–55)
- How to test SCD2 end-to-end?
Validate versioning, dates, current_flag. - Handling late data?
Backdated inserts with re-aggregation. - Large file testing?
Chunking, parallelism, checksums. - Data skew?
Identify uneven partitions. - Error reprocessing?
Fix source → re-run from staging. - Cross-system reconciliation?
Align currencies/time zones. - PII compliance?
Masking + access controls. - Time-zone issues?
Normalize to UTC. - End-to-end BI validation?
Report totals vs DW tables. - SLA breaches?
Root cause via job stats. - Schema drift?
Detect added/removed columns. - Data quality rules?
Threshold-based alerts. - Rollback strategy?
Partition swaps, backups. - Parallel loads risk?
Locking and duplicates. - Cloud DW nuances?
Cost/perf tradeoffs.
Real SQL Query Examples (with Sample Datasets)
Sample tables
src_orders(order_id, cust_id, amount, order_dt)
dim_customer(cust_sk, cust_id, current_flag)
fact_sales(order_id, cust_sk, amount, load_dt)
1) JOIN Validation
SELECT COUNT(*) AS missing_dim
FROM fact_sales f
LEFT JOIN dim_customer d
ON f.cust_sk = d.cust_sk
WHERE d.cust_sk IS NULL;
2) GROUP BY Reconciliation
SELECT SUM(amount) src_sum FROM src_orders;
SELECT SUM(amount) tgt_sum FROM fact_sales;
3) Window Function (Dedup)
SELECT *
FROM (
SELECT order_id, cust_id, order_dt,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_dt DESC) rn
FROM src_orders
) t
WHERE rn = 1;
4) SCD2 Check
SELECT cust_id, COUNT(*) versions
FROM dim_customer
GROUP BY cust_id
HAVING COUNT(*) > 1;
5) Performance Tuning Hint
— Ensure partition pruning and indexed joins
EXPLAIN SELECT /*+ USE_HASH(f d) */ *
FROM fact_sales f JOIN dim_customer d
ON f.cust_sk = d.cust_sk;
Scenario-Based ETL Testing Questions (Real Time)
- Mismatch counts after load: Check filters, rejects, CDC window.
- Unexpected nulls: Validate COALESCE/defaults and source constraints.
- Duplicate facts: Verify business keys and dedupe logic.
- Slow jobs: Analyze joins, partitions, parallelism, indexes.
- Late data impacts aggregates: Recompute affected partitions.
ETL Tools You Must Know
- Informatica – Enterprise mappings, workflows
- Microsoft SQL Server Integration Services – SQL-centric ETL
- Ab Initio – High-performance graphs
- Pentaho – Open-source analytics
- Talend – Cloud & open-source ETL
ETL Defect Examples + Test Case Samples
Defect: Amount doubled in fact table
- Cause: Many-to-many join
- Fix: Correct join keys, add dedupe
Test Case (Sample):
- Objective: Validate SCD2 customer history
- Steps: Load change → verify new row with updated dates
- Expected: Old row expired, new row current_flag=Y
Quick Revision Sheet (Cheat)
- Validate counts, sums, nulls, duplicates
- Check S2T, SCD1/SCD2, audit fields
- Use window functions for dedupe
- Monitor performance & SLAs
FAQs (Snippet-Ready)
Q: What are real time ETL testing scenarios?
A: Mismatches, late data, CDC failures, performance issues.
Q: Best SQL for ETL validation?
A: JOINs, GROUP BY, window functions, EXPLAIN plans.
Q: How to test SCD2?
A: Validate versioning, effective dates, current flag.
