1) What is ETL Testing? (Definition + Example)
ETL Testing validates Extract → Transform → Load processes to ensure data is accurately moved from source systems into a data warehouse (DW) with correct business rules, mappings, and performance.
Example:
A banking DW loads daily transactions from OLTP tables into a fact table. ETL testing verifies:
- Row counts match (after filters)
- Transformations (currency conversion, de-duplication)
- Slowly Changing Dimensions (SCD1/SCD2)
- Audit fields (batch_id, load_ts)
- Performance within SLA
2) DW Flow – Source → Staging → Transform → Load → Reporting
- Source: OLTP, files, APIs
- Staging: Raw landing, minimal transforms
- Transform: Business rules, lookups, aggregations
- Load: Facts/dimensions with constraints
- Reporting: BI tools consume curated data
Key validations: source-to-target (S2T), referential integrity, aggregates, reconciliation, restartability.
3) Ab Initio ETL Architecture (Tester’s View)
- GDE graphs orchestrate flows
- Transforms: Join, Rollup, Normalize, Denormalize
- Metadata: Record formats, layouts
- Control/Audit: Row counts, checksums, reject handling
- Parallelism: Partitioning for performance
4) Interview Questions + Best Answers (Basic → Advanced)
Basics (1–15)
- What is Ab Initio?
A high-performance ETL tool using parallel graphs for large-scale data processing. - What is ETL testing?
Validation of data extraction, transformation logic, and loading accuracy. - Difference between ETL and ELT?
ETL transforms before load; ELT loads first, transforms in DW. - What is S2T mapping?
Document mapping source fields to target fields with rules. - What is staging?
Temporary area for raw data before transformations. - What is a fact table?
Stores measurable metrics (e.g., sales_amount). - What is a dimension?
Descriptive attributes (customer, product). - What are audit fields?
batch_id, load_date, record_source, checksum. - What is reject handling?
Capturing invalid rows with reasons. - What is data reconciliation?
Matching counts/totals between source and target. - What is a lookup?
Reference data used to enrich records. - What is normalization?
Splitting repeating groups into rows. - What is denormalization?
Combining tables for query performance. - What is partitioning?
Dividing data for parallel processing. - What is checksum/hashing?
Detects data changes efficiently.
Intermediate (16–35)
- Explain SCD Type 1 vs Type 2.
SCD1 overwrites; SCD2 preserves history with effective dates. - How do you validate SCD2?
Check new row insert, old row expiry, current_flag. - How to validate joins in Ab Initio?
Verify join keys, cardinality, null handling. - What is rollup?
Aggregates rows (SUM, COUNT) by keys. - How to test aggregations?
Recompute aggregates via SQL and compare. - What is CDC?
Change Data Capture—only deltas are processed. - How do you test CDC?
Compare before/after images, counts by operation. - What is surrogate key?
System-generated unique key for dimensions. - How to validate surrogate keys?
Uniqueness, non-null, sequence integrity. - What is late-arriving dimension?
Fact arrives before dimension; handle via placeholders. - How to test null handling?
Validate defaults, rejects, or pass-through rules. - What is data skew?
Uneven distribution causing performance issues. - How to handle data skew?
Re-partition, salting, skew hints. - What is restartability?
ETL resumes from last successful checkpoint. - How to test restartability?
Fail mid-run; rerun and validate idempotency. - What is metadata testing?
Validate record formats, data types, lengths. - How do you test file formats?
Delimiter, header/footer, encoding. - What is referential integrity?
Fact foreign keys exist in dimensions. - How to validate RI?
Anti-join facts vs dimensions. - What is SLA testing?
Ensure load completes within time limits.
Advanced (36–60)
- How do you test performance tuning?
Partition strategy, parallelism, I/O metrics. - What is window function usage in ETL?
Running totals, dedup with ROW_NUMBER(). - How to test deduplication?
Business key uniqueness post-transform. - What is hashing in SCD2?
Detect attribute change using hash diff. - How to validate hash logic?
Recompute hash in SQL and compare. - What is data lineage?
Trace data from source to report. - How to test lineage?
Verify S2T and report calculations. - What is bad record threshold?
Max allowed rejects before job fails. - How to test thresholds?
Inject bad rows and observe behavior. - What is incremental vs full load?
Delta vs reload all data. - How to test incremental loads?
Counts by date watermark. - How do you validate timestamps/time zones?
Check conversions, DST handling. - What is multi-file dependency?
Jobs relying on multiple inputs. - How to test dependencies?
Control tables, arrival checks. - What is reconciliation at report level?
BI totals match DW aggregates.
5) Real SQL Query Examples (Validation)
Sample Data
Source Orders (src_orders)
(order_id, cust_id, amount, order_dt)
Target Fact (fact_sales)
(order_sk, cust_sk, sales_amt, order_dt, batch_id)
JOIN Validation
SELECT COUNT(*)
FROM src_orders s
LEFT JOIN fact_sales f
ON s.order_id = f.order_sk
WHERE f.order_sk IS NULL;
GROUP BY Aggregation
SELECT order_dt, SUM(amount) src_sum
FROM src_orders
GROUP BY order_dt;
SELECT order_dt, SUM(sales_amt) tgt_sum
FROM fact_sales
GROUP BY order_dt;
Window Function (De-dup)
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER
(PARTITION BY order_id ORDER BY load_ts DESC) rn
FROM stage_orders
) t
WHERE rn = 1;
Performance Tuning Check
EXPLAIN ANALYZE
SELECT cust_sk, SUM(sales_amt)
FROM fact_sales
GROUP BY cust_sk;
6) Scenario-Based ETL Testing Questions
- Mismatch counts: Validate filters, joins, rejects
- Nulls in mandatory fields: Defaults vs rejects
- Duplicate records: Window functions, business keys
- Late arriving data: Backdated updates
- Slow job: Partitioning, indexes, parallelism
7) Tools Commonly Used by ETL QA
- Informatica
- Microsoft SSIS
- Ab Initio
- Pentaho
- Talend
(Testers validate logic consistently across tools.)
8) ETL Defect Examples + Test Cases
Defect: SCD2 not expiring old row
- Expected: current_flag=N, end_date populated
- Actual: old row still current
- Severity: High
Test Case:
- Input change in dimension attribute
- Validate two rows exist with correct flags
9) Quick Revision Sheet (Cheat Notes)
- S2T, SCD1/SCD2, CDC
- Counts, aggregates, RI
- Window functions for dedup
- Hashing for change detection
- Performance: partition, parallelism
10) FAQs (Snippet-Ready)
Q: Is SQL mandatory for Ab Initio ETL testing?
Yes—SQL is essential for validation and reconciliation.
Q: How many rounds focus on ETL QA?
Usually 1–2 deep technical rounds.
Q: Can Informatica experience help?
Absolutely—concepts are transferable.
