Cleaning data

Overview

This article covers a few techniques that show how you can build a robust approach for cleaning data.

Techniques employed include:

  • Joining tables (Merge Query)
  • Self-Referencing tables
  • Adding Custom Columns

Steps

1. Sample data

The sample data below will be used. The Country data is the data that needs to be cleansed so there are consistent country values in the Country column.

  • Load this data into Power Query and set Close & Load to as a “Connection”.

2. Create a table of all entered countries

  • The next step is to produce a list of the unique entries in the source data. To do this, reference the above table by right clicking, then Reference.
  • In the referenced query, remove the Name column, and then remove the duplicates in the remaining Country column
  • Call the query Country
  • Then sort the column before loading the data back to Excel as shown below.

3. Create the substitutes with a self-referencing table

Manually add a new column to the green table above and entered the substitutions as needed. Where there is no substitute needed, leave the substitute blank.

As it currently stands above, if the source table is sorted and the green table refreshed, the substitute column will fall out of alignment with the source table.

  • To correct for this, load the green table above (both columns this time) back into Power Query, as a new query, setting it only to create a connection.
  • Call the query, ‘Substitutes’
  • Now, edit the original query that created the original green table (with only the “Country” column) and join it to the Substitutes table.
  • In edit mode, select Merge Queries (shown below), then select the Substitutes table, then create the join on the Country columns, with a left outer join.
  • Then expand the new column to extract the Substitutes field.
  • Close & Load the query and delete the old duplicate Substitutes column, so you are left with this:

This now gives a self-referencing table that will refresh while keeping the countries and substitutes correctly aligned regardless of what happens to the source table.

4. Update the new data

Now when new data is added to the source table, it will appear in the substitute table after refresh as follows.

What would be much better is to be clear which are the correct entries that have previously been checked, so it is more obvious which are the new entries.

  • So, enter the correct country name for each entry as follows in the green table.

It is now much clearer that there are 3 new entries that need to be maintained.

5. Use the substitute table

Now that with the substitute table fully mapped, you need to create a new copy of the source table that uses the “substitute” country rather than the “original” country entered.

  • Create a new query referencing the original source data.
  • Merge the queries as before, joining on the original country column.
  • Expand the table to retrieve the substitute column
  • Then delete the original country column and rename the ‘Substitute’ column to ‘Country’

As you can see below, the substitute table is correctly remapping the data as needed.

When the new data is added to the source table, the substitute table will have blank entries as shown below, that should be updated.

6. Add a custom column

As a nice touch, go back into the final query and add a new custom column as follows:

Then use this new column as the country column. Now the data is refreshed, it is clearer that some maintenance on the substitute table is needed.