Monday 27 August 2018

One, Two, Three Phase ETL. Ah ah ah!

A little bit late, but I hope to catch up. My wife and I welcomed the newest junior DBA into our household with the birth of our first son, which has taken a bit of my attention in the last couple weeks. As promised though, I am here to discuss ETL processes, specifically, why it is I choose to embrace the three phase ETL process, what I gain from it, and what I lose along the way.

The way we extract, transform, and load data is different from use case to use case, and the amount of change required to the data from the source system can vary from case to case. That being said, there are some generalizations on the type of transformations required to build a robust business intelligence program. In order of operations (generally speaking):

1. Extract data from source system
2. Ensure data is complete for each entity
3. Ensure data is within the domain constraints
4. Ensure data is clean of errors
5. Remodel data into new schema design
6. Aggregate measures for reporting
7. Mask or encrypt personal information
8. Change coded values to plain English
9. Load data for reporting usage

In a single or one phase ETL process, all of this would occur programmatically within a single process, whether it be in SSIS or in TSQL or utilizing additional scripting tools in Python, R, VB, or C# (as of SQL Server 2017). There is nothing wrong with that being the case, if that works within your existing process of development.

This doesn’t work for me and for the team I work with, as it adds a large amount of complexity to a single process, and creates a roadblock for rapid development. Single phase ETL also creates cumbersome documentation requirements and longer validation phases that increase risks for human error, or missing information, and cuts down on sustainability. The upside of single phase ETL, is it reduces system complexity, as well as keeps the entire process in a single location, whether a job, stored procedure or SSIS package or Project, cutting down on the number of pieces that could potentially fail.

In a two-phase process, operations 1 through 5 are generally handled by the first phase, moving data into a staging area and transforming it into a model. The model is then loaded into the warehouse where the remaining operations take place. There are some cases as well where I have seen operations 1 through 5 loaded into the warehouse, and the remaining operations done at the reporting layer. In our Microsoft shop we would complete the primary stages using the same tools we would for a single phase, and then we would finalize the operations within the data model itself in SSAS, creating our aggregates, masking and changing coded values in the model load. The reporting objects (either .rdl or .pbix) are then connected to the data model with the transformations.

The two-phase style is one I have used before, and to great effect. By separating the primary and secondary processes, we can create rapid prototypes on the first ETL, and allow our users to become involved in the testing process. The documentation is split between the two phases and connected in the overall project or initiative level. The primary process contains the bulk of the workload, with the transformations and mastering happening at in the first phase, and can still be a stumbling block for the tight timelines I like to achieve to bring value to the organization, and begin the process of back end engagement.

A three phase ETL process breaks it down further. Stages 1 through 4 are completed int the first phase, pulling (always pull, never push) data from the source, and loading it into a staging area database. This area allows us to transform the data and move it into a “clean zone” within the staging area. This clean zone will, in future, allow us to open up pre-cleaned data from all the domains we have complete to the organizational citizen data scientists, researchers, and other business users that have the ability and will to create highly complex analysis. Stages 5 through 7 are completed in the second phase. This allows us to build the mini-marts (as discussed previously) and build onto them within the warehouse environment. At this point the data remains consistent with values in the transactional system in that coded values remain coded, although any personally identifiably information has been cleansed, masked, or encrypted, allowing us to control and limit access to the data as necessary. In the third phase, we load the report object and complete stages 8 and 9 providing the final reporting. At any point along the way, we are able to create a prototype report to engage the business user, either direct from staging, from the model, or with a finished product.

By breaking down the stages of transformations into the three phases, we keep each individual phase lighter and allow us to rapidly prototype data transformations. Each phase can be tackled in an iterative cycle and allow us to begin validations early in the process with our engaged business clients. We also cut down on the overhead for each phase by splitting out the documentation requirements to each phase individually.

Each phase of our three phase ETL can be developed independent of the others, which allows us to tackle multiple models within a single domain concurrently. If, for example, we are building a new reporting structure for our HR department, we can engage with our HR users, and determine the items they would like to report on, or suggest the ones that we know from our research would also fit within our organization. We then work with them on the design of the report, while at the same time, we begin the work of the first phase ETL. Once the first phase is done, we can prototype the report with their designs, and the base data that has been cleaned, but not yet transformed. At this point, we have the ability to get immediate feedback with a working product with our end user. As we iterate, we work on the second and third phases of the ETL with subsequent versions, adding in the final pieces for a polished and distributable report, built in concert with our department, and able to hand it off to them, with our deferred ownership model.

This method has served me well, and comes from multiple different iterations before I landed on the three phase ETL as the process method of choice for AGILE BI implementations.

SQL Doch

Monday 6 August 2018

Data Mini-Marts! How to build and deploy rapid value in the MS BI stack.

Let’s talk about mini-marts. When we build out BI, traditionally, we work towards launching the data-mart; a complete and cohesively built star or snowflake schema that allows us to create reporting and self-service areas to launch to a complete business area.

This is a wonderful endeavour, and I need to ensure that everyone knows this is not something I am against at all. What I have found though, in my experience, is that when working with a small team, or starting a BI program, it creates a far-reaching target to get to, before you can show value to your consumers, in this case, your internal or external clients that consume your data from the mart.

I need to also add the caveat right near the top here, that if you and your team are not documenting what you are doing, even with something as simple as a text document that highlights the purpose of every action along the way, you are setting yourselves up for failure. Full transparency within your team, is a must.

To build value rapidly, as we do in AGILE and in DevOps, and to achieve the shortest turn-arounds for our clients, then, we deploy mini-marts. We concentrate on the value that the business users identify as being of the most import to them; the areas that they already use to describe their business. We usually target areas that are being manually compiled by a clerk, team lead, or manager, by collecting data from pre-built queries and then cleansing it through a manual process, usually in Excel.

We identify all the areas and fields from the transactional system, as well as any low-hanging fruit that is adjacent, and we concentrate on building our mini-mart based on this; the customer identified value, and a few extras. Using the 80/20 principle, we begin the process of designing our mart from the front end backward.

Power BI On-premises is our current reporting server of choice in our organization, so we usually begin with a mock up, consisting of a bunch of fake data on a single spreadsheet that feeds the visuals. This is something that can be achieved in half a day at most, and provides our users with something they can see, and interact with to guide the design. This allows them to make rapid choices about what really matters to them, long before we put in any effort in creating the ETL for the initiative.

Once they are happy and have given us the feedback, we have a much more solid design from our users that we can then use to design the mart. This technique is guided through having the mock up, and a conversational meeting that is led by the 7 question technique outlined in the model-storming process from the “Agile Data Warehouse” by Lawrence Corr. This model-storming technique gives us the measures and dimensions and we can design our star schema model, usually in a visual manner, using software from QUEST or just in Visio.

Now we know what needs to be pulled, transformed, and prepared, and the realm building work begins. Using SSIS we will pull the data into our landing zone in the staging instance of SQL Server. This landing zone is a partition specifically designed for us to pull raw data into and, if we need to, add static tables such as translation values, grouping values, or targets that are updated yearly. (Some of which we then go on to automate through SharePoint forms for ease of data entry.)

We use T-SQL, Python and R to affect our stage 2 ETL, and transform and align the base tables for our model. This is the point where we are creating the star schema, and that is located in our clean staging area of the data warehouse. This area also doubles as our future sandbox area, when we are ready to engage with our organization’s citizen data scientists.

This clean staging area contains mostly transformed data, although we do not concentrate on ensuring all the final measures are there, but that our fact tables are well built and designed for the final model. We then pull the tables for our mini mart into a SSAS tabular model, creating the final measures, building all our joins, changing titles to plain language, and ensuring that the mini-mart encompasses the areas we need in the value-add reporting initiative for our business user.

We deploy the mini-mart to our warehouse, connect it up to the mock-up report to display real data, and then launch in alpha stage to a limited user group that allows us to get them involved in the final data validation of all the business logic, built into all three stages of the ETL process. (Next week we will discuss three phase ETL, and why it is my choice for ETL design.)

As I have outlined in the previous post, we now go through the iterative processes and we have created value for our users in a very short amount of time. Additionally, we have created what I would refer to as the embryo of the greater data mart; the seed that will allow us to build outward, in an organic and client-driven way. (Much like a fractal)

When we get the next project, initiative, or request from the same business area, we already have some semblance of an idea of where we are starting. We may already have twenty, forty, or even eighty percent of the data we need for the next project completed in the existing mart. Now we build up on it with this next initiative, adding to our dimension tables, maybe needing to re-create them slightly to conform them to the new project, but it is not ground up again, and we are still building value for our consumers.

This is the basic concept of building the mini-mart, and why I use them for rapid development, and immediate value-add to our users. Instead of spending a year to three building a larger data mart, during the first year we launched 4 primary mini-marts. I then went on to create a second generation of 2 of them, being able to provide over 15 value-adding reporting initiatives to my internal clients. This process allowed our minuscule team to create labour time savings from automating manual compilation and business rules into these mini-marts that exceeded the amount of time we spent building them, hour for hour, by over fifteen percent!


SQL Doch

Reflections from the Summit

This past week I attended PASS Summit 2019 in Seattle. I had an amazing time and it was great to catch up with friends and colleagues, b...