What Is ETL Testing? (Definition + PwC-Style Example)
ETL Testing is the process of validating data during Extract, Transform, and Load operations to ensure correctness, completeness, consistency, auditability, and performance in a Data Warehouse (DW).
PwC real-world project example
In consulting and assurance programs delivered by PwC, ETL testing commonly supports:
- Financial & regulatory reporting
- Risk and compliance analytics
- Multi-source enterprise data platforms
An ETL tester ensures that source data, transformation logic, and final reports are fully traceable, auditable, and compliant with business and regulatory rules.
Data Warehouse Flow: Source → Staging → Transform → Load → Reporting
- Source – OLTP databases, ERP/CRM, flat files, APIs
- Staging – Raw data landing area (no business rules)
- Transformation – Cleansing, joins, aggregations, SCD logic
- Load – Fact and Dimension tables
- Reporting – BI dashboards, MIS, regulatory reports
PwC ETL Testing Interview Questions & Answers
(Basic → Advanced | Scenario-Based & Real-Time)
🔹 Basic ETL Testing Questions (1–15)
- What is ETL testing?
Validating data accuracy, completeness, and transformations during ETL processing. - Why is ETL testing critical in PwC projects?
Because data is used for financial, audit, and regulatory decision-making. - What is Source-to-Target (S2T) mapping?
A document defining how each source column maps to target columns with transformations. - What are the types of ETL testing?
Source testing, staging testing, transformation testing, target testing, reconciliation, performance testing. - What is staging area testing?
Verifying staging data exactly matches source data. - What are audit fields in ETL?
load_date, batch_id, created_by, updated_date. - How do you validate record counts?
Compare source vs target counts after filters and transformations. - What is data reconciliation?
Matching totals, counts, and metrics between source and target. - Difference between fact and dimension tables?
Facts store measures; dimensions store descriptive attributes. - What is data warehouse testing?
Testing data models, ETL logic, and reports. - What is full load vs incremental load?
Full loads reload all data; incremental loads only deltas. - What is reject data testing?
Validating rejected records and error reasons. - What is data quality testing?
Accuracy, completeness, consistency checks. - What is mapping validation?
Ensuring transformations follow S2T exactly. - What is end-to-end ETL testing?
Source → DW → Report validation.
🔹 Scenario-Based Intermediate Questions (16–35)
- Scenario: Source has 1M records, target has 960K. What do you check?
Filters, rejected rows, duplicate removal, join conditions. - How do you test NULL handling?
Validate mandatory columns and default value logic. - What is SCD Type 1?
Overwrites old data without maintaining history. - What is SCD Type 2?
Maintains history using effective_from, effective_to, and current_flag. - Scenario: Customer address changes. What do you test?
SCD2 history row creation and expiry of old record. - What is surrogate key testing?
Ensure surrogate keys are unique and non-null. - What is a late-arriving dimension?
Fact arrives before dimension; validate dummy/unknown keys. - Scenario: Duplicate rows appear in target but not in source. Why?
Many-to-many joins or missing dedup logic. - How do you test aggregation logic?
Compare aggregated source data with target totals. - What is hashing used for in ETL?
Change detection and deduplication. - How do you test CDC (Change Data Capture)?
Validate only changed records are processed. - Scenario: Negative revenue appears in report. What do you test?
Business rule validation and transformation logic. - What is referential integrity testing?
Fact foreign keys must exist in dimension tables. - Scenario: Data type mismatch error occurs. What to validate?
Casting, truncation, and precision rules. - How do you test date transformations?
Time zone, format, and business calendar validation. - What is soft delete testing?
Validate delete_flag instead of physical deletion. - Scenario: ETL rerun creates duplicates. What failed?
Restartability and idempotency logic. - What is schema drift?
Source schema changes impacting ETL. - How do you validate BI reports?
Reconcile report numbers with DW tables. - What is data masking testing?
Ensure PII fields are obfuscated in target.
🔹 Advanced & PwC Real-Time Scenarios (36–55)
- Scenario: ETL job misses SLA. What do you analyze?
Query plans, indexing, partitioning, parallelism. - How do you test ETL performance tuning?
Validate join strategies and execution plans. - Scenario: Many-to-many join inflates data. What test?
Join cardinality validation. - How do you validate window functions?
Verify ranking, deduplication, and SCD logic. - Scenario: Late data affects aggregates. What do you do?
Recalculate impacted partitions. - How do you test audit and control tables?
Validate batch status, row counts, timestamps. - Scenario: Regulatory report mismatch. First check?
S2T mapping and aggregation logic. - How do you test ETL rollback?
Ensure partial loads are reverted. - Scenario: Parallel jobs cause locking. What test?
Concurrency and isolation testing. - How do you test historical reloads?
Check duplication and SCD logic. - Scenario: Source file arrives late. What test?
Dependency handling and rerun logic. - How do you validate checksum/hash totals?
Compare source vs target hashes. - Scenario: Currency conversion mismatch. What to validate?
Exchange rate tables and application logic. - How do you test cloud DW ETL pipelines?
Scalability, cost, and performance. - Scenario: NULLs after LEFT JOIN. Why?
Missing dimension records. - How do you test archival logic?
Validate data movement to history tables. - Scenario: Incorrect SCD expiry date. Root cause?
Effective date logic issue. - How do you validate metadata tables?
Batch IDs, counts, load timestamps. - Scenario: Unexpected data spike. What test?
Source anomaly and business rule validation. - How do you perform end-to-end ETL testing?
Source → DW → Report reconciliation.
Real SQL Query Examples for ETL Validation
Sample Tables
- src_orders(order_id, cust_id, amount, order_dt)
- dim_customer(cust_sk, cust_id, current_flag)
- fact_sales(order_id, cust_sk, amount, load_dt)
1️⃣ Record Count Validation
SELECT COUNT(*) FROM src_orders;
SELECT COUNT(*) FROM fact_sales;
2️⃣ JOIN Validation
SELECT COUNT(*) AS missing_dim
FROM fact_sales f
LEFT JOIN dim_customer d
ON f.cust_sk = d.cust_sk
WHERE d.cust_sk IS NULL;
3️⃣ GROUP BY Aggregation
SELECT cust_id, SUM(amount)
FROM src_orders
GROUP BY cust_id;
4️⃣ Window Function – Deduplication
SELECT *
FROM (
SELECT order_id,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_dt DESC) rn
FROM src_orders
) t
WHERE rn = 1;
5️⃣ Performance Tuning Validation
EXPLAIN
SELECT *
FROM fact_sales
WHERE load_dt >= CURRENT_DATE – 1;
ETL Tools Commonly Asked in PwC Interviews
- Informatica
- Microsoft SQL Server Integration Services
- Ab Initio
- Pentaho
- Talend
ETL Defect Examples + Test Case Sample
Defect Example
Issue: Duplicate records in fact table
- Root Cause: Incorrect join logic
- Fix: Correct join keys and add deduplication
Sample Test Case
- Test Case: Validate SCD2 update
- Expected Result:
- Old record expired
- New record inserted with current_flag = ‘Y’
- Old record expired
PwC ETL Testing Quick Revision Sheet
- Validate counts, sums, NULLs, duplicates
- Check S2T mappings thoroughly
- Test SCD1, SCD2, audit fields, hashing
- Use JOIN, GROUP BY, window functions
- Always reconcile source vs target vs reports
FAQs (SEO & Snippet Friendly)
Q1. What are PwC ETL testing interview questions?
They focus on real-time ETL scenarios, reconciliation, and regulatory-grade data accuracy.
Q2. Which SQL is important for PwC ETL roles?
JOINs, GROUP BY, window functions, and performance tuning queries.
Q3. How do you test SCD2 in real projects?
By validating history rows, effective dates, and current flags.
