Overview
Introducing ETL
Often data needs to be transformed and cleaned before any meaningful analysis can be done on it. But the process of transforming data into a form that can easily be processed is sadly, often a very difficult task:
- Repeated rows
- Misspelt fields
- Empty cells
- Invalid dates
- Data stored as the wrong data type
- Cells with errors in them
- Whole numbers used instead of percentages
- Problems with long and complex formulas
- Errors that creep from repeating the procedures over and over again
- the list goes on…
The good news is that ETL (or Extract, Transform, and Load in long-form) is a technology designed to help users overcome many of these challenges - so more time can be spent on accurate analysis and decision making, and less time on preparing poorly shaped or formatted data.
There are now several ETL tools on the market, but one that has gained a lot of traction, and favour amongst data engineers and technologists is: Power Query - an ETL tool that can re-shape data from a variety of sources, in a very robust, but also slick and user-friendly way!
This guide gives a very gentle introduction to what Power Query is, and how it can be used to perform advanced extract, transform, and load (ETL) processes, as an aid to carrying out more robust forms of data analysis.
Power Query, & Self-Service Business transformation
Power Query is an Excel add-in that was developed by Microsoft, to make it easier to load data into Excel (or Power BI) from external data sources. It is one of a suite of tools that can be used to undertake what is now, in the trade, collectively called: “self-service business intelligence”.
The term “self-service” is used because people who need to use this data, the analysts, accountants, general managers, executives, bankers, engineers, scientists, etc., are also the people building the reports. In other words, they no longer have to rely on help from IT departments, like they used to.
In terms of enterprise-level ETL tool-development, Power Query was probably also the first tool built specifically for business users - and as many of those in data & technology communities would advocate, it does its job exceptionally well!
What can Power Query be used for?
Power Query can be used for:
- Extract Data: Directly connecting to a data source and pulling it into one place so you can work with it. A live connection is retained to the source so when the data is updated you can “refresh” the same query without having to extract the data again.
- Transform Data: Cleaning and reshaping the data so that when you are ready to use it, it is transformed into the format you need, instead of the format you are given. Such, transformation steps only need to be programmed once, even when the source data changes or is updated later.
- Load Data: Loading data directly to places where it can be more easily used for future intepretation or analysis. The end location for the data (data destination) can be on the cloud, in a Data Warehouse, or inside a file document, such as a Excel, or PBIX file (i.e., Power Pivot, or Power BI).
Who can benefit from Power Query?
Users that are likely to benefit from this tool are those trying to solve the following kinds of problems:
- Manipulating & Re-shaping: You cannot get your IT department to shape the data before you use it, and because of this you need to spend time manually “reshaping” and “manipulating” the data before it can be used.
- Cleaning: You have to manually “cleanse” your data every month, removing data and errors that you already know about, which you cannot permanently solve back at the data’s source.
- Collating: You need to spend hours and hours of repetitive work combining data from multiple places into a single workbook or file, before you can start working with it.
- Joining: You write lots of XLOOKUP/VLOOKUP formulas over multiple tables of data as you need to “join” the data together. You currently do this using XLOOKUPsVLOOKUPs, dragging formula up and down a long table, but always end up under or overshooting the last row!
- Scraping: You need to get data from various places on the web, into a single offline file where it can be looked at all together. Currently, you cut and paste the data, and spend hours reformatting it to remove any noise from the data.
- Auditing: You need to compare lists of data to find similarities and differences – this is a task you currently do manually, dragging formula up and down, and performing multiple INDEX-MATCHES, and/or H/X/VLOOKUPS.
What makes Power Query so great?
-
User Interface: Power Query has a very simple User Interface that users can easily learn.
-
Powerful Language: The user interface is built on top of a very powerful formula language, ‘M’, that has very powerful capabilities, well beyond what most users will ever need. (Sadly, it bears no resemblance to standard Excel formula language or VBA, so there is a bit of a learning curve!).
-
Automate Process Steps: Power Query records repeatable “Process Steps” that can be reused over and over again without destroying the underlying data. Process Steps can be easily navigated through, so the data can be previewed before and after any transformations.
This means a series of steps can be built for one set of data and then “reapplied” to a different set of data (provided both sets of data are the same “shape”). This ability makes it a great tool for updating weekly or monthly data, for example.
- Query Folding: Power Query will always try to push processing back to the data source. For example, if you are using a SQL Server data source and your query contains several steps, Power Query will translate all of the steps in the query into a single SQL SELECT statement, (where it can)
This is very helpful, if, for example, the source table is very large, as it is more performant for filtering and joining to happen in the SQL Server, than it is to download the entire dataset, to Excel, and perform the filtering / joining operations locally.
What Power Query is not!
Within the Microsoft eco-system, you may have also heard of the (Excel) Data Model. This an in-memory database engine that runs inside Excel (and Power BI), that allows you to load very large amounts of data - much more than could ever be fit on a single Excel worksheet, which is capped at: 1,048,576.
The Excel Data Model provides a much better way of slicing and dicing data, than traditional VLOOKUPs or Pivot Tables.
- Hidden worksheets do not have to be used to store data: once data is in the Data Model - you can create PivotTables, or combined Pivot Tables, directly from it.
- Cube Formulas can also be used to import individual values into worksheet cells, if, for example, you need to build very customised page layouts than is possible with a standard Pivot Table.
Other advantages include:
- Multiple tables: The ability to load several tables, which you can then create relationships between. Traditional PivotTables are limited to just one source table.
- Complex calculations: Advanced calculations can be written elegantly in DAX, Microsoft’s Data Modelling language. These include time series calculations, percentage shares, and many other types of calculations that are difficult to implement using standard Excel formulas.
- The Excel Data Model compresses data very efficiently so loading data into it, instead of into the worksheet, can drastically reduce the size of your workbooks.
- You can add features such as data hierarchies, which allow you to drill down through data, via a predefined path.
But the Microsoft Data Model (Power Pivot) is not *Power Query!* Rather, only after data has been cleaned and transformed, can it be sent to Power Pivot, for it to be fully modelled.
This is what Power Query is for - the cleaning and transforming - before the data is modelled (e.g. in Power Pivot!)
The rest of this guide provides some practical, real-world examples of how Power Query and the language that underlies it, ‘M’, can be used to solve a variety of different business data problems, using ETLs.
Feedback
Submit and view feedback