Importing tabular data from a pdf - new

Overview

This article explains how to import tabular data from a PDF document into, using the newer Import from PDF data connection.

Steps

  • Import the single pdf into Power Query
  • Go through some cleansing and transformation steps, as you would do in Power Query

1. Worked Through Example

The following PDF contains a list of US States with their abbreviations

Example PDF file

The pdf should look like this (only the top half of the page is shown):

  • Save a copy PDF to a known location on your PC.

2. Import into Power Query

  • Create a new query that connects to the pdf

You will be prompted to provide the location of the PDF file you want to use. Once you provide the file location and the PDF file loads, a Navigator window appears and displays the list of tables and pages in the document that you can import the data from.

You can browse through the PDF document data and select one or multiple elements to import into Excel. When you are ready to import, select the Load button to bring the data into Excel, or Transform Data to clean your data and prepare it for analysis with Power Query Editor.

3. Clean the Data

If you click on Transform data, it should look something like this in Power Query.

  • Rename the step, which is likely to be called “Change Type” to “All4Columns
  • Select the first 2 columns, the columns you want to keep, by right clicking, and then clicking remove other columns, as shown below:
  • After this, promote the first row as Headers
  • Then, filter out any Blanks, and the row containing ‘DIRECTIONAL ABBREVIATIONS’ from the STATE column, using the following M-Code:
M-Code
   = Table.SelectRows(#"Promoted Headers", each ([STATE] <> "" and [STATE] <> "DIRECTIONAL#(lf)ABBREVIATIONS"))
  • Rename this step to First2Columns
  • Now, click the fx button to insert a new step with the following M-Code:
M-Code
   = Table.SelectColumns(All4Columns,{"Column3", "Column4"})
  • Promote the first row as Headers
  • Filter out any Blanks, and the row containing ‘SECONDARY ADDRESS INDICATORS’ from the STATE column, using the following M-Code:
M-Code
   = Table.SelectRows(#"Promoted Headers1", each ([STATE] <> "" and [STATE] <> "SECONDARY ADDRESS#(lf)UNIT INDICATORS"))
  • Rename this step to Second2Columns

4. Append the 2 tables into a single table

  • Finally, click the fx button again, and write the following code, to append the 2 tables together:
M-Code
   = Table.Combine({First2Columns, Second2Columns})
  • Rename the step to AllData

The data from the pdf should now look like this when loaded to Excel:

5. Final M-Code

  • The M-code to the entire query should look something like this (the filepath to the pdf in the first step will likely be different for you):
M-Code

let
    Source = Pdf.Tables(File.Contents("C:\Users\...Abbreviation-List.pdf"), [Implementation="1.2"]),
    Table001 = Source{[Id="Table001"]}[Data],
    All4Columns = Table.TransformColumnTypes(Table001,{{"Column1", type text}, {"Column2", type text}, {"Column3", type text}, {"Column4", type text}}),
    #"Select Columns" = Table.SelectColumns(All4Columns,{"Column1", "Column2"}),
    #"Promoted Headers" = Table.PromoteHeaders(#"Select Columns", [PromoteAllScalars=true]),
    First2Columns = Table.SelectRows(#"Promoted Headers", each ([STATE] <> "" and [STATE] <> "DIRECTIONAL#(lf)ABBREVIATIONS")),
    #"Select Columns1" = Table.SelectColumns(All4Columns,{"Column3", "Column4"}),
    #"Promoted Headers1" = Table.PromoteHeaders(#"Select Columns1", [PromoteAllScalars=true]),
    Second2Columns = Table.SelectRows(#"Promoted Headers1", each ([STATE] <> "" and [STATE] <> "SECONDARY ADDRESS#(lf)UNIT INDICATORS")),
    AllData = Table.Combine({First2Columns, Second2Columns})
in
    AllData

6. Advanced scenarios

In some cases, you may want to import a range of pages from a PDF document at once. For this, you can specify the Start page and End Page as optional parameters for your PDF connection in the underlying M formula from the Power Query Editor:

M-Code
    Pdf.Tables(File.Contents("C:\Sample.pdf"), [StartPage=5, EndPage=10])

For more information, refer to the Pdf.Tables M function documentation