Dealing with slowly changing Datasets

Guides • Data Integration & ETL • 23/06/2019

Overview

Trying to maintain a reporting system that lasts longer than its data can sometimes be a never-ending battle!

This article provides some tips to help you thinking differently about ETL, when it comes to ‘slowly changing’ datasets (those whose structure slowly changes over time)

Tip 1: Define a Data Contract

Perhaps, one of the best things to do, with a slowly changing dataset, is to define a data contract with the original data owner. This is so that whenever the structure of the dataset source (names of columns, new columns, etc.) changes, you are notified in good time.

Being a true “Business Partner” – and demonstrating a “warm”, and “wholesome” ability to collaborate with your work colleagues can be a very good feeling!

Tip 2: Filtering columns

Be particularly careful when using filters, ensuring that any filtering logic you create, is created as expected.

So, watch out for when a generated filter favours the positive condition, using the equal sign on items you have selected, and, when it favours a negative condition, using a not equal sign, on the items you haven’t selected.

If the items listed in the filtered column changes, both types of filtering logic can give slightly different results. For example a column that has just two items {red, yellow}, whilst a filter that does item = red, and a filter that does item <> yellow will yield the same results now, they will not when or if more colors appear in the column.

So, think about what future values might appear or disappear in the filtered column, when or if new items appear in it. How can you make your filtering logic more robust to better handle this?

Tip 3: Re-ordering columns

Often, in ETLs, you are not concerned about the exact order of columns but would like to apply a specific order to a small subset of columns.

By default, in power query, when you reorder columns, the Table.ReorderColumns function is called with a reference to all the columns in the table. Referencing all columns can weaken an ETL and increase the chances of future refresh failures, when / if columns are renamed or removed from the source.

The advice here is to re-order only on the columns you want to keep. So, avoid loading hundreds of columns, re-ordering all of them, and only keeping a few. This non-targeted approach can involve a lot of wasteful operations, and also waste valuable processing power.

Also, if re-ordering is needed, it is better to apply it after you keep only the columns needed in the report

Re-order columns dynamically (with code!)

For example:

M-Code

Table.ReorderColumns (
	#”TableStep”,
	List.InsertRange (
		List.Difference(
			Table.ColumnNames(#”TableStep”),
			{“Customers”, “Employees”}
		),
		1,
		{“Customers”, “Employees”}
	)
)

The above code will dynamically take any table as input, (e.g. #”TableStep”), and re-order the columns called “Customers”, and “Employees”, into the 2nd and 3rd positions - regardless of how many other columns there are in the table, and what these columns are named.

With the above completed, you can then take things a step further by turning the code into a repeatable function, like so:

M-Code

fnReorderSubsetOfColumns =
(tbl as table, reorderedcolumns as list, offset as number) as table =>
Table.ReorderColumns (
	tbl,
	List.InsertRange (
		List.Difference(
			Table.ColumnNames(tbl),
			reorderedcolumns
		),
		offset,
		reorderedColumns
	)
)

Once a function has been created, you can call the function again on a different table, like so:

M-Code

fnReorderSubsetOfColumns = (#”TableStepNew”, {“Suppliers”, “Stores”}, 1)

Tip 4: Removing columns

Generally speaking, when selecting columns, to reduce the risk of an ETL refresh failure, it is better to focus on the columns you want to keep, than on the ones you want to remove.

For example, if the data source owner decides to remove certain columns themselves, and your ETL also has steps for removing these columns – the ETL will likely to fail on its next refresh. Having an ETL that keeps the columns you want, tends to avoid these problems.

It can also be better to remove (or select) columns based on their position rather than name.

The following functions can help you remove and/or select columns in a more dynamic way, and avoid hard-coded name references where possible.

List.FirstN - based on column position:

Removes the first two columns from a table

M-Code

    Table.RemoveColumns(Source, List.FirstN(Table.ColumnNames(Source), 2))

List.Last - based on column position:

Removes the last column from a table

M-Code

    Table.RemoveColumns(Source, List.Last(Table.ColumnNames(Source), 1))

List.Range - based on column position:

Keeps the second and third columns only

M-Code

    Table.SelectColumns(Source, List.Range(Table.ColumnNames(Source), 1, 2))

Note: List.Range receives a list as its first argument, a zero-based offset as its second, and a count of items to return as its third)

Custom Code - based on column position:

Selects 2 non-adjacent columns (the 3rd and 6th columns)

M-Code

    Table.SelectColumns(Source, {Table.ColumnNames(Source){2},  Table.ColumnNames(Source){5}})

List.Select - based on column name:

The below example removes columns based on their names rather than positions – removing only those columns whose header contains the substring “Random”.

M-Code

    = Table.RemoveColumns(#”Source”, List.Select(Table.ColumnNames(#”Source”), each Text.Contains (_, “Random”)))

Here, you get the same results using Table.SelectColumns, and with the logic reversed:

M-Code

    = Table.SelectColumns(#”Source”, List.Select(Table.ColumnNames(#”Source”), each not Text.Contains (_, “Random”)))

Tip 5: Renaming columns

Renaming columns can be a common data preparation step and can improve the user experience. It does however need to be done carefully, as it can result in ETL refresh failures, if, for example, the column names in the source data changes.

What not to do!

M-Code


    Table.RenameColumns(#”Source”, {{“Random Column 1”, “New Name 1”}, {“Random Column 2”, “New Name 2”}})

If these columns disappear from the source, the ETL will fail.

What to do!

Using functions can be a better way of renaming columns:

Create a Renaming function

M-Code

fnRenameColumnByIndices =

let
Source = (Source as table, ColumnNamesNew as list, Indices as list) =>
let
    ColumnNamesOld = List.Transform(Indices, each Table.ColumnNames(Source){_} ), 
    ZippedList = List.Zip( { ColumnNamesOld, ColumnNamesNew } ),
    #"Renamed Columns" = Table.RenameColumns(Source, ZippedList)
in
    #"Renamed Columns"
in
    Source

The above function takes in three arguments: Source – the Table whose columns you want to rename; ColumnNewNames - for the list of new column names; and Indices - for the list of index positions of the columns in the source table that you wish to rename

Invoke the Renaming function

Even when invoking a function, you can use generate functions like List.Transform to create dynamic input lists.

For example, when invoking the earlier created function, you can write the following:

M-Code

(#"TableStep", List.Transform({1..7}, each "New Column Name " & Text.From(_)), {6..12})

This is effective shorthand for:

M-Code

fnRenameColumnsByIndices = 
(#"TableStep", {"New Column Name 1", " New Column Name 2", " New Column Name 3", " New Column Name 4"," New Column Name 5", " New Column Name 6"," New Column Name 7"}, {6..12})

Other variants of this can also be used!

Note: ‘M’ code has a whole range of List, Record, Table and Text functions that you can learn more about here:

Tip 6: Splitting columns

Incorrectly splitting columns can sometimes lead to missed data. This can happen, for example, when you apply the “Split Column by Delimiter”, on columns, on “delimiter separated values” that have a varying number of elements.

So, for exampe, if row 1, had two elements like this {“green”, “yellow”}, and row 2 had 3 elements like this {“red”, “blue”, “green”}, and the column was split into only two columns, you would lose the element “green”, from row 2.

A better way can be to split the column into rows, rather than into multiple columns.

So, instead of using code like this:

M-Code


#"Split Column by Delimiter" = Table.SplitColumn(#"PreviousStep", "Products", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), {"Product.1", "Product.2"}),

(the name of the column that needs to be split is called “Products”)

You should use the following kind of code:

M-Code


#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"PreviousStep", {{"Products", Splitter.SplitTextByDelimiter(", ", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Products"),

This will split all the elements into new rows, and importantly, ensure none of the information is lost.

Tip 7: Merging columns

As with Splitting Columns, you can apply similar logic and techniques when you have to Merge columns.

Such code can scale your search query to merge a given list of columns, without referencing any hard-coded column names – like shown in the example below, which merges columns based on an input list of column names:

M-Code

#”Merged Columns” = Table.CombineColumns (
	Table.TransformColumnTypes (
		Source,
		List.Transform (
			ColumnsToMerge,
			each {_, type text}
		),
		“en-US”
	),
	ColumnsToMerge,
Combiner.CombineTextByDelimiter(“:”, QuoteStyle.None),
“Merged”)

(the list of column names to merge is contained in the list called “ColumnsToMerge”)

Tip 8: Changing Data types

Data types are a big deal in computer science and computer programming! You may have heard of things such as: type casting, type conversion, or strongly typed - these all refer to the type of a particular variable (i.e. whether an integer, decimal, date, string, etc.)

Power Query has a useful feature, where you can set the data type of each field or column. Whenever you launch a new query, for example, you will often see a step called “Changed Type” appear, under the Applied Steps section. This is Power Query using its own algorithms to try and set the data type of each column.

Whilst it can be helpful for it to automatically do this for you, it can get quite annoying, especially as the “Changed Type” step should be kept as far down the ETL chain as possible (as column data types are changed by reference to their hard-coded name, rather than position).

As a rule of thumb, the “Changed Type” step, should appear:

Only once, in any given query
As far down the query chain as possible

Whilst, it might seem like a good idea to get rid of the step altogether, we wouldn’t recommend it. Setting data types can be very helpful in detecting inadvertent errors in your data (e.g. data of the wrong data type), such as: invalid dates, invalid numbers, and so on. This kind of checking would be very hard, almost impossible to do otherwise!

Dealing with slowly changing Datasets

Overview

Tip 1: Define a Data Contract

Tip 2: Filtering columns

Tip 3: Re-ordering columns

Tip 4: Removing columns

Tip 5: Renaming columns

Create a Renaming function

Invoke the Renaming function

Tip 6: Splitting columns

Tip 7: Merging columns

Tip 8: Changing Data types

See also

Feedback