Slowly Changing Dimensions (SCD) is a data engineering technique to track changes in a record over time. They're useful when changes are unpredictable, infrequent, and a historical vew of the data is needed.
SQL is ubiquitous in data engineering, and therefore it is the perfect tool to implement SCD with.
Here's a quick five minute video demonstrating how to apply SCD to a small fake data set. The video is rough and ready, but it gets the point across. I recommend watching in high-def if you cannot read the text - I will figure out how to make the text bigger and sharper for next time!
SCD uses extra columns to track changes. There are multiple types of SCD, and Wikipedia has a good description. The demo above uses the following code, which has an extra seven columns to track changes on a single table.
CREATE TABLE customers ( -- our data id varchar, name varchar, address varchar, -- our SCD columns record_id integer primary key, record_insert_ts timestamp, record_update_ts timestamp, record_start_date date, record_end_date date, record_active int, record_hash uint64 );
record_id is a surrogate key.
record_insert_ts is the timestamp when the record was inserted.
record_update_ts tracks when the record was updated. This is standard database administration.
record_end_date are the start and end dates of the record. This is the core of SCD. The start date is when the record was inserted, and the end date is when the record was updated. The end date is set to a far future date, and when the record is updated, the end date is set to the day before the update, and a new record is inserted with the start date set to the day of the update.
record_active is a handy flag to allow quick queries to only return the latest record. It effectively duplicates the functionality of querying for
record_end_date = '2999-12-31' or null, or another date far in the future. It's convenient syntax.
record_hash is also a convenient column. It's a hash of all the columns we want to track changes in. If the hash of the current record is different from the hash of the new record, then the record has changed.
Once those fields are setup, we can use a simple pipeline of SQL to insert and track changes.
Sourcing the data means getting the data into the system.
I'm using DuckDB to showcase the SQL. If you're not familiar with it, it's like SQLite but for analytics. It also provides very handy functions for reading data from files in different formats.
To load data from a CSV, that contains headers, the following SQL can be used:
SELECT * FROM read_csv_auto('data_part_0.csv', header=True) limit 5;
Next, we stage the data. This is an intermediate step to make the data easier to work with.
Here we use the hash, and some Common Table Expressions (CTEs), to construct a dataset that flags new records and changed records with a
drop table if exists customers_stg; create table customers_stg as with stg as ( select id as stg_id, name as stg_name, address as stg_address, hash(id, name, address) as stg_record_hash, effective_date as stg_effective_date, from customers_src ), active_customers as ( select * from customers where record_active = 1 ) select s.*, case when c.id is null then 1 else 0 end new_ind, case when c.id is not null and s.stg_record_hash <> c.record_hash then 1 else 0 end track_ind from stg s left join active_customers c on s.stg_id = c.id ;
After constructing the intermediate staged data set, we can apply the SCD logic to our customers table.
This SQL has two parts, wrapped in a transaction to maintain consistency. One statement to close the old record, and one statement to insert the new record.
We use the date '2999-12-31' to indicate the record hasn't ended yet. Other implementations are possible, such as using a null value.
begin transaction; update customers c set record_end_date = stg_effective_date - interval '1 day', record_active = 0, record_update_ts = current_timestamp from customers_stg s where c.id = s.stg_id and record_end_date = '2999-12-31' and track_ind = 1; insert into customers( id, name, address, --scd fields: record_id, record_insert_ts, record_update_ts, record_start_date, record_end_date, record_active, record_hash ) select stg_id, stg_name, stg_address, --scd fields: nextval('seq_customer_record_id'), current_timestamp as record_insert_ts, current_timestamp as record_update_ts, stg_effective_date, '2999-12-31' as record_end_date, 1 as record_active, stg_record_hash from customers_stg where track_ind = 1 or new_ind = 1; commit transaction;