Changing Data Types
Overview
Data types are a big deal in computer science and computer programming! You may have heard of things such as: type casting, type conversion, or strongly typed - these all refer to the type of a particular variable (i.e. whether an integer, decimal, date, string, etc.)
Power Query has a useful feature, where you can set the data type of each field or column. Whenever you launch a new query, for example, you will often see a step called “Changed Type” appear, under the Applied Steps section. This is Power Query using its own algorithms to try and set the data type of each column.
Whilst it can be helpful for it to automatically do this for you, it can get quite annoying, especially as the “Changed Type” step should be kept as far down the ETL chain as possible (as column data types are changed by reference to their hard-coded name, rather than position).
As a rule of thumb, the “Changed Type” step, should appear:
- Only once, in any given query
- As far down the query chain as possible
Whilst, it might seem like a good idea to get rid of the step altogether, we wouldn’t recommend it. The rest of this article explains why, but, in short, setting data types is very helpful in detecting inadvertent errors in your data (e.g. data of the wrong data type), such as: invalid dates, invalid numbers, and so on. This automatic data checking can be really helpful, and would be very hard, almost impossible to do otherwise!
Without Data Type checking
Suppose you had a line of data given to you in the following form:
Although it’s a little hard to see in the above image (click button to open a larger sized image), the line contains two errors:
- November only has 30 days, so the date 31/11/2020 is invalid
- The figure in April has been typed in as 54,50
Unfortunately, both of these mistakes are very easy to miss, and Excel, itself, does not give any obvious warnings. Whatsmore, in a standard manual data-check, it’s likely that the most any person would do is check that the formula in the Total column, covers the correct cell range, by seeing which cells are highlighted, as shown below:
It looks ok at first glance, but the Total figure returned is actually wrong, and under-estimates total operating costs by 5,450 (or circa. 15%!)
With Data Type checking
A better way is to pass all the data through an ETL, and allow it to perform its automatic data type checks.
This can be done as follows:
1. Load the data
- Load the data into the query editor, so it looks like this:
2. Pivot the data
- Then unpivot all columns apart from the Heading column. The data should look like this.
You can see immediately that the April figure is in a slighly different form.
3. Filter out Total
- Next, filter out the Total row, so the data now looks like this:
4. Set Data Types
- Now, set the data types for each column, by selecting on the left hand icon, on each column.
You should see a selection of data types to choose from.
Once set for each column, the query editor should look like this:
You can see immediately that the invalid date has been picked out, and also that the query has tried to convert the figure 54,50 into a number - in this case, successfully. (otherwise, here, too it would flag it as an error)
5. Close & Load
- Finally, click ‘Close & Load’ to load the results to Excel:
You can add a Total row to the table, from the worksheet itself, so you now have this:
Notice, how the Total now no longer underestimates Operating Costs. It is clearer too that something is wrong with November’s date.
Final Comments
Data type checking is an invaluable feature that comes with very little cost, but a lot of great benefit! If data accuracy is important to you, we highly recommend using this feature!
Feedback
Submit and view feedback