Ab Initio ETL Testing Interview Questions – Real-Time QA Guide (40–150 Q&A)

1) What is ETL Testing? (Definition + Example)

ETL Testing validates Extract → Transform → Load processes to ensure data is accurately moved from source systems into a data warehouse (DW) with correct business rules, mappings, and performance.

Example:
A banking DW loads daily transactions from OLTP tables into a fact table. ETL testing verifies:

  • Row counts match (after filters)
  • Transformations (currency conversion, de-duplication)
  • Slowly Changing Dimensions (SCD1/SCD2)
  • Audit fields (batch_id, load_ts)
  • Performance within SLA

2) DW Flow – Source → Staging → Transform → Load → Reporting

  1. Source: OLTP, files, APIs
  2. Staging: Raw landing, minimal transforms
  3. Transform: Business rules, lookups, aggregations
  4. Load: Facts/dimensions with constraints
  5. Reporting: BI tools consume curated data

Key validations: source-to-target (S2T), referential integrity, aggregates, reconciliation, restartability.


3) Ab Initio ETL Architecture (Tester’s View)

  • GDE graphs orchestrate flows
  • Transforms: Join, Rollup, Normalize, Denormalize
  • Metadata: Record formats, layouts
  • Control/Audit: Row counts, checksums, reject handling
  • Parallelism: Partitioning for performance

4) Interview Questions + Best Answers (Basic → Advanced)

Basics (1–15)

  1. What is Ab Initio?
    A high-performance ETL tool using parallel graphs for large-scale data processing.
  2. What is ETL testing?
    Validation of data extraction, transformation logic, and loading accuracy.
  3. Difference between ETL and ELT?
    ETL transforms before load; ELT loads first, transforms in DW.
  4. What is S2T mapping?
    Document mapping source fields to target fields with rules.
  5. What is staging?
    Temporary area for raw data before transformations.
  6. What is a fact table?
    Stores measurable metrics (e.g., sales_amount).
  7. What is a dimension?
    Descriptive attributes (customer, product).
  8. What are audit fields?
    batch_id, load_date, record_source, checksum.
  9. What is reject handling?
    Capturing invalid rows with reasons.
  10. What is data reconciliation?
    Matching counts/totals between source and target.
  11. What is a lookup?
    Reference data used to enrich records.
  12. What is normalization?
    Splitting repeating groups into rows.
  13. What is denormalization?
    Combining tables for query performance.
  14. What is partitioning?
    Dividing data for parallel processing.
  15. What is checksum/hashing?
    Detects data changes efficiently.

Intermediate (16–35)

  1. Explain SCD Type 1 vs Type 2.
    SCD1 overwrites; SCD2 preserves history with effective dates.
  2. How do you validate SCD2?
    Check new row insert, old row expiry, current_flag.
  3. How to validate joins in Ab Initio?
    Verify join keys, cardinality, null handling.
  4. What is rollup?
    Aggregates rows (SUM, COUNT) by keys.
  5. How to test aggregations?
    Recompute aggregates via SQL and compare.
  6. What is CDC?
    Change Data Capture—only deltas are processed.
  7. How do you test CDC?
    Compare before/after images, counts by operation.
  8. What is surrogate key?
    System-generated unique key for dimensions.
  9. How to validate surrogate keys?
    Uniqueness, non-null, sequence integrity.
  10. What is late-arriving dimension?
    Fact arrives before dimension; handle via placeholders.
  11. How to test null handling?
    Validate defaults, rejects, or pass-through rules.
  12. What is data skew?
    Uneven distribution causing performance issues.
  13. How to handle data skew?
    Re-partition, salting, skew hints.
  14. What is restartability?
    ETL resumes from last successful checkpoint.
  15. How to test restartability?
    Fail mid-run; rerun and validate idempotency.
  16. What is metadata testing?
    Validate record formats, data types, lengths.
  17. How do you test file formats?
    Delimiter, header/footer, encoding.
  18. What is referential integrity?
    Fact foreign keys exist in dimensions.
  19. How to validate RI?
    Anti-join facts vs dimensions.
  20. What is SLA testing?
    Ensure load completes within time limits.

Advanced (36–60)

  1. How do you test performance tuning?
    Partition strategy, parallelism, I/O metrics.
  2. What is window function usage in ETL?
    Running totals, dedup with ROW_NUMBER().
  3. How to test deduplication?
    Business key uniqueness post-transform.
  4. What is hashing in SCD2?
    Detect attribute change using hash diff.
  5. How to validate hash logic?
    Recompute hash in SQL and compare.
  6. What is data lineage?
    Trace data from source to report.
  7. How to test lineage?
    Verify S2T and report calculations.
  8. What is bad record threshold?
    Max allowed rejects before job fails.
  9. How to test thresholds?
    Inject bad rows and observe behavior.
  10. What is incremental vs full load?
    Delta vs reload all data.
  11. How to test incremental loads?
    Counts by date watermark.
  12. How do you validate timestamps/time zones?
    Check conversions, DST handling.
  13. What is multi-file dependency?
    Jobs relying on multiple inputs.
  14. How to test dependencies?
    Control tables, arrival checks.
  15. What is reconciliation at report level?
    BI totals match DW aggregates.

5) Real SQL Query Examples (Validation)

Sample Data

Source Orders (src_orders)
(order_id, cust_id, amount, order_dt)

Target Fact (fact_sales)
(order_sk, cust_sk, sales_amt, order_dt, batch_id)

JOIN Validation

SELECT COUNT(*) 

FROM src_orders s

LEFT JOIN fact_sales f

  ON s.order_id = f.order_sk

WHERE f.order_sk IS NULL;

GROUP BY Aggregation

SELECT order_dt, SUM(amount) src_sum

FROM src_orders

GROUP BY order_dt;

SELECT order_dt, SUM(sales_amt) tgt_sum

FROM fact_sales

GROUP BY order_dt;

Window Function (De-dup)

SELECT *

FROM (

  SELECT *, ROW_NUMBER() OVER

    (PARTITION BY order_id ORDER BY load_ts DESC) rn

  FROM stage_orders

) t

WHERE rn = 1;

Performance Tuning Check

EXPLAIN ANALYZE

SELECT cust_sk, SUM(sales_amt)

FROM fact_sales

GROUP BY cust_sk;


6) Scenario-Based ETL Testing Questions

  • Mismatch counts: Validate filters, joins, rejects
  • Nulls in mandatory fields: Defaults vs rejects
  • Duplicate records: Window functions, business keys
  • Late arriving data: Backdated updates
  • Slow job: Partitioning, indexes, parallelism

7) Tools Commonly Used by ETL QA

  • Informatica
  • Microsoft SSIS
  • Ab Initio
  • Pentaho
  • Talend

(Testers validate logic consistently across tools.)


8) ETL Defect Examples + Test Cases

Defect: SCD2 not expiring old row

  • Expected: current_flag=N, end_date populated
  • Actual: old row still current
  • Severity: High

Test Case:

  • Input change in dimension attribute
  • Validate two rows exist with correct flags

9) Quick Revision Sheet (Cheat Notes)

  • S2T, SCD1/SCD2, CDC
  • Counts, aggregates, RI
  • Window functions for dedup
  • Hashing for change detection
  • Performance: partition, parallelism

10) FAQs (Snippet-Ready)

Q: Is SQL mandatory for Ab Initio ETL testing?
Yes—SQL is essential for validation and reconciliation.

Q: How many rounds focus on ETL QA?
Usually 1–2 deep technical rounds.

Q: Can Informatica experience help?
Absolutely—concepts are transferable.

Leave a Comment

Your email address will not be published. Required fields are marked *