1. What is ETL Testing? (Definition + Example)
ETL Testing is the process of validating data during Extract, Transform, and Load operations to ensure accuracy, completeness, consistency, and performance when data moves from source systems into a data warehouse (DW).
Real-Time Example (ETL QA Role)
In an e-commerce analytics project:
- Source: Order DB, customer CRM, product files
- Target: Enterprise Data Warehouse
- ETL QA validates:
- Source vs target record counts
- Transformation rules (discounts, tax, currency conversion)
- SCD Type 1 & Type 2 for customer/product dimensions
- Audit fields (batch_id, load_date, checksum)
- Performance SLA for daily loads
- Source vs target record counts
ETL QA testing focuses on data correctness and business logic, not UI behavior.
2. DW Flow – Source → Staging → Transform → Load → Reporting
- Source Layer
OLTP databases, APIs, flat files (CSV, JSON, XML) - Staging Layer
Raw extracted data with minimal validation - Transformation Layer
Business rules, joins, aggregations, lookups, SCD handling - Load Layer
Fact tables and dimension tables - Reporting Layer
BI dashboards, analytics, regulatory reports
QA focus: S2T mapping validation, reconciliation, referential integrity, aggregates, and performance.
3. ETL Architecture – QA Tester’s Perspective
- ETL Tools Layer: Informatica, SSIS, Ab Initio
- Mapping & Metadata Layer: S2T documents, business rules
- Control Tables: Job status, row counts, error counts
- Audit Framework: Batch tracking, hashing, rejects
- Performance Layer: Partitioning, parallel processing
ETL QA ensures data quality, restartability, and SLA compliance.
4. ETL QA Testing Interview Questions & Answers (Basic → Advanced)
🔹 Basic ETL QA Interview Questions (1–15)
- What is ETL QA testing?
Validation of data extraction, transformation, and loading into a DW. - Why is ETL QA important?
Incorrect data leads to wrong business decisions. - What is a data warehouse?
Centralized repository for analytics and reporting. - What is a staging area?
Temporary storage for raw extracted data. - What is S2T mapping?
Source-to-Target document defining column mappings and rules. - What is a fact table?
Stores measurable business metrics. - What is a dimension table?
Stores descriptive attributes. - What are audit fields?
batch_id, load_date, source_system, checksum. - What is data reconciliation?
Comparing source and target data. - What is full load?
Reloading entire target table. - What is incremental load?
Loading only new or changed records. - What is primary key validation?
Ensuring uniqueness and non-null values. - What is reject data?
Records failing validation rules. - What is data profiling?
Analyzing source data patterns. - What is null validation?
Verifying null handling per business rules.
🔹 Intermediate ETL QA Questions (16–35)
- Explain SCD Type 1.
Overwrites old dimension data without history. - Explain SCD Type 2.
Maintains historical records using effective dates and flags. - How do you test SCD2?
Validate new row insertion, old row expiry, current_flag. - What is a surrogate key?
System-generated unique identifier. - How do you validate surrogate keys?
Check uniqueness and non-null constraints. - What is CDC (Change Data Capture)?
Processing only changed or new records. - How do you test CDC logic?
Compare before/after snapshots. - What is referential integrity testing?
Fact FK must exist in dimension PK. - What is aggregation testing?
Validating SUM, COUNT, AVG logic. - What is lookup testing?
Verifying reference data mappings. - What is deduplication?
Removing duplicate business keys. - How do you test dedup logic?
Using GROUP BY or window functions. - What is late-arriving dimension?
Fact arrives before dimension. - What is data skew?
Uneven data distribution impacting performance. - What is restartability testing?
Ensuring ETL resumes correctly after failure. - What is metadata testing?
Validating column names, types, lengths. - What is data lineage?
Tracking data from source to report. - What is threshold testing?
Job fails if reject count exceeds limit. - What is hashing in ETL?
Detecting data changes efficiently. - What is SLA testing?
Ensuring ETL jobs complete on time.
🔹 Advanced & Scenario-Based Questions (36–60)
- How do you handle record count mismatch?
Check filters, joins, rejects, CDC logic. - How do you validate null handling?
Defaults vs rejects vs allowed nulls. - How do you test ETL performance?
Partitioning, indexing, parallelism. - How do you test incremental loads?
Validate watermark logic. - How do you test multi-source joins?
Validate join keys and cardinality. - How do you test aggregation failures?
Recalculate totals using SQL. - How do you test timezone conversions?
Validate timestamps across zones. - How do you test re-runs?
Ensure no duplicate data. - How do you test historical data loads?
Validate backdated inserts. - How do you validate audit tables?
Compare source_count vs target_count. - What causes ETL performance issues?
Data skew, large joins, missing indexes. - How do you test file-based ETL?
Header, footer, delimiter, encoding. - How do you test schema changes?
Backward compatibility validation. - How do you validate reporting data?
BI totals vs DW aggregates. - Explain a critical ETL QA defect you found.
Example: SCD2 failure, incorrect aggregation.
5. Real SQL Query Examples for ETL QA Validation
Sample Tables
src_orders(order_id, cust_id, amount, order_date)
fact_sales(order_sk, cust_sk, sales_amt, order_date, batch_id)
JOIN Validation – Missing Records
SELECT COUNT(*)
FROM src_orders s
LEFT JOIN fact_sales f
ON s.order_id = f.order_sk
WHERE f.order_sk IS NULL;
GROUP BY – Aggregation Validation
SELECT order_date, SUM(amount)
FROM src_orders
GROUP BY order_date;
SELECT order_date, SUM(sales_amt)
FROM fact_sales
GROUP BY order_date;
Window Function – De-duplication
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER
(PARTITION BY order_id ORDER BY load_date DESC) rn
FROM stage_orders
) t
WHERE rn = 1;
Performance Tuning Validation
EXPLAIN ANALYZE
SELECT cust_sk, SUM(sales_amt)
FROM fact_sales
GROUP BY cust_sk;
6. Scenario-Based ETL QA Testing Examples
| Scenario | QA Validation |
| Record mismatch | Source vs target counts |
| Null handling | Default or reject logic |
| Duplicate data | Window function checks |
| Late-arriving data | SCD2 backdated insert |
| Slow job | Partitioning & indexing |
7. ETL Tools Commonly Used by QA Teams
- Informatica
- Microsoft SSIS
- Ab Initio
- Pentaho
- Talend
ETL QA concepts remain same across tools.
8. ETL QA Defect Examples + Sample Test Case
Defect: SCD2 history not maintained
- Expected: Old row expired, new row inserted
- Actual: Two active records
- Severity: High
Sample Test Case:
- Update dimension attribute
- Validate effective_start_date, effective_end_date, current_flag
9. ETL QA Testing – Quick Revision Sheet
- ETL flow & architecture
- S2T mapping validation
- SCD1 vs SCD2
- SQL JOIN, GROUP BY, window functions
- Hashing, CDC, audit fields
- Performance & reconciliation
10. FAQs – Featured Snippet Ready
Q1. Is SQL mandatory for ETL QA testing roles?
Yes. SQL is essential for data validation.
Q2. Are ETL QA interviews tool-specific?
Mostly no. Concepts matter more than tools.
Q3. Can ETL QA testing be automated?
Yes, especially for regression and large datasets.
