ETL Best Practices
Overview
Here we explain some good practices to follow with ETL when designing and building data queries.
Push transforming as close to the source as possible
Wherever possible, do any data shaping as close as possible to the data source.
There are many ways that you can shape your data (in the Microsoft BI stack).
You can always write SQL code and paste that into ETL tools to extract the data this way. The main problem with this approach is you effectively hard code a solution for a single data set. If you want to build another data set in the future, the work needs to be done again (either copy or re-write).
The data shaping tools in Power Query are designed to allow you to do whatever you need without having to rely on a third party – but only use these tools downstream if you need to.
If you have a common need for data in a particular shape and you can get support (e.g. from IT) to shape your data at the source, then there is definitely value in doing that.
Shape in Power Query, Model in Power Pivot
Power Query and Power Pivot (a data modelling tool) were built to do 2 completely different tasks.
Power Query is built for cleaning and transforming while Power Pivot is built for modelling and reporting. It is possible that you can shape your data in Power Pivot (e.g. you can write calculated columns, you can add calculated tables (in the newer versions) etc.). But best practice is to use Power Query to shape your data before/during load, and then use Power Pivot for measures and reporting.
Use a Calendar table
If you want to any sort of time calculations, get a Calendar table
Although it is possible to analyse all your data in a single flat table without using any lookup/dimension tables, if you need to do any time related calculations, bottom line is you need a Calendar Table - a special type of lookup/dimension table for performing time intelligence functions.
A Star schema is optimal
Power Pivot is optimised to use a Star Schema table structure
This is not to say that a Star Schema is the only layout that will work, or that other designs will always be slow, but thinking about a star schema will put you well under way to success when it comes to general data modelling and thinking about how to design and build your ETLs.
Use Long and Narrow tables
Short wide tables are generally bad for fact tables (i.e. event tables, like Sales Transactions, or Purchase Transactions) but long narrow tables are better.
There are 2 reasons why loading data this way is a good idea.
- Power Pivot is a column store database. It uses advanced compression techniques to store the data efficiently so it takes up less space and so it is fast to access the data when needed. Long narrow tables compress better than short wide tables.
- Power Pivot is designed to quickly and easily filter your data. It is much easier/better to write one formula to add up a single column and then filter on an attribute column, than it is to write many different measures to add up each column separately.
Only load the data you need
Load all the data you need, and nothing you don’t need.
If you have data (particularly in extra columns) that you do not need to be loaded, then don’t load it. Loading data, you do not need will make your workbooks bigger and slower than they need to be.
In the old world of Excel you might have asked IT to “give you everything” because it was too hard to go back and add the missing columns of data later.
This is no longer the case – bring in all of what you need and nothing you don’t. If you need something else later, then go and get it later. Focus mainly on your large data tables – the dimension tables tend to be smaller and hence are generally less of an issue.
See also
More tips on common pitfalls to avoid when designing ETL processes can be found here:
- Dealing with slowly changing Datasets: a handful of tips on how to build ETLs that can better deal with slowly changing datasets (datasets whose structure slowly changes over time) » Read more
Feedback
Submit and view feedback