Monday 27 August 2018

One, Two, Three Phase ETL. Ah ah ah!

A little bit late, but I hope to catch up. My wife and I welcomed the newest junior DBA into our household with the birth of our first son, which has taken a bit of my attention in the last couple weeks. As promised though, I am here to discuss ETL processes, specifically, why it is I choose to embrace the three phase ETL process, what I gain from it, and what I lose along the way.

The way we extract, transform, and load data is different from use case to use case, and the amount of change required to the data from the source system can vary from case to case. That being said, there are some generalizations on the type of transformations required to build a robust business intelligence program. In order of operations (generally speaking):

1. Extract data from source system
2. Ensure data is complete for each entity
3. Ensure data is within the domain constraints
4. Ensure data is clean of errors
5. Remodel data into new schema design
6. Aggregate measures for reporting
7. Mask or encrypt personal information
8. Change coded values to plain English
9. Load data for reporting usage

In a single or one phase ETL process, all of this would occur programmatically within a single process, whether it be in SSIS or in TSQL or utilizing additional scripting tools in Python, R, VB, or C# (as of SQL Server 2017). There is nothing wrong with that being the case, if that works within your existing process of development.

This doesn’t work for me and for the team I work with, as it adds a large amount of complexity to a single process, and creates a roadblock for rapid development. Single phase ETL also creates cumbersome documentation requirements and longer validation phases that increase risks for human error, or missing information, and cuts down on sustainability. The upside of single phase ETL, is it reduces system complexity, as well as keeps the entire process in a single location, whether a job, stored procedure or SSIS package or Project, cutting down on the number of pieces that could potentially fail.

In a two-phase process, operations 1 through 5 are generally handled by the first phase, moving data into a staging area and transforming it into a model. The model is then loaded into the warehouse where the remaining operations take place. There are some cases as well where I have seen operations 1 through 5 loaded into the warehouse, and the remaining operations done at the reporting layer. In our Microsoft shop we would complete the primary stages using the same tools we would for a single phase, and then we would finalize the operations within the data model itself in SSAS, creating our aggregates, masking and changing coded values in the model load. The reporting objects (either .rdl or .pbix) are then connected to the data model with the transformations.

The two-phase style is one I have used before, and to great effect. By separating the primary and secondary processes, we can create rapid prototypes on the first ETL, and allow our users to become involved in the testing process. The documentation is split between the two phases and connected in the overall project or initiative level. The primary process contains the bulk of the workload, with the transformations and mastering happening at in the first phase, and can still be a stumbling block for the tight timelines I like to achieve to bring value to the organization, and begin the process of back end engagement.

A three phase ETL process breaks it down further. Stages 1 through 4 are completed int the first phase, pulling (always pull, never push) data from the source, and loading it into a staging area database. This area allows us to transform the data and move it into a “clean zone” within the staging area. This clean zone will, in future, allow us to open up pre-cleaned data from all the domains we have complete to the organizational citizen data scientists, researchers, and other business users that have the ability and will to create highly complex analysis. Stages 5 through 7 are completed in the second phase. This allows us to build the mini-marts (as discussed previously) and build onto them within the warehouse environment. At this point the data remains consistent with values in the transactional system in that coded values remain coded, although any personally identifiably information has been cleansed, masked, or encrypted, allowing us to control and limit access to the data as necessary. In the third phase, we load the report object and complete stages 8 and 9 providing the final reporting. At any point along the way, we are able to create a prototype report to engage the business user, either direct from staging, from the model, or with a finished product.

By breaking down the stages of transformations into the three phases, we keep each individual phase lighter and allow us to rapidly prototype data transformations. Each phase can be tackled in an iterative cycle and allow us to begin validations early in the process with our engaged business clients. We also cut down on the overhead for each phase by splitting out the documentation requirements to each phase individually.

Each phase of our three phase ETL can be developed independent of the others, which allows us to tackle multiple models within a single domain concurrently. If, for example, we are building a new reporting structure for our HR department, we can engage with our HR users, and determine the items they would like to report on, or suggest the ones that we know from our research would also fit within our organization. We then work with them on the design of the report, while at the same time, we begin the work of the first phase ETL. Once the first phase is done, we can prototype the report with their designs, and the base data that has been cleaned, but not yet transformed. At this point, we have the ability to get immediate feedback with a working product with our end user. As we iterate, we work on the second and third phases of the ETL with subsequent versions, adding in the final pieces for a polished and distributable report, built in concert with our department, and able to hand it off to them, with our deferred ownership model.

This method has served me well, and comes from multiple different iterations before I landed on the three phase ETL as the process method of choice for AGILE BI implementations.

Cheers!
SQL Doch

No comments:

Post a Comment

Reflections from the Summit

This past week I attended PASS Summit 2019 in Seattle. I had an amazing time and it was great to catch up with friends and colleagues, b...