Importing tabular data from a pdf

Overview

This article explains how to import tabular data from a PDF document into Excel.

Steps

  • Open the PDF in Microsoft Word.
  • Save the file as a single file web page.
  • Import the single file web page into Power Query as HTML.
  • Go through some cleansing and transformation steps, as you would do in Power Query

1. Worked Through Example

The following PDF contains a list of US States with their abbreviations

Example PDF file

The pdf should look like this (only the top half of the page is shown):

  • Save a copy PDF to a known location on your PC.

2. Convert the PDF to a Web Page File

  • Open Microsoft Word and then open the PDF file from within Word.

You will get the warning message below.

  • Click “don’t show this message again”, and click OK.
  • Now, select File\Save As and save the file as a Single File Web Page (shown below):

3. Import into Power Query

  • Create a new query that connects to a text file
  • In the dialog, change the file filter so it can find all file types (as shown below), then browse until you find your file.
  • Then select the MHT file created earlier and click “import”
  • Power Query will not correctly identify the file type, so right click on the file (shown below) and the select HTML.

Power Query now correctly identifies the table in the document.

In your own PDF documents, you may need to use some trial and error to find the right table.

  • Expand the one called as “3DTableGrid”, by double-click on the Table highlighted green above

4. Clean the Data

The data is starting to look good at this stage, but does have a bit of “noise”.

  • First, remove the first row (Remove Top Rows, 1).
  • Next copy the text <= /span> by right clicking on one of the cells containing this text, and selecting “copy”.
  • Then select all 4 columns and do a Transform\Replace Values and replace <= /span> with nothing.
  • You do not need the bottom 11 rows, so Remove Rows ➔ Remove Bottom Rows ➔ 11

5. Transform from 4 to 2 columns

At this point, the data is in 4 columns, but it should be in just 2 columns.

  • Rename this step to: ‘All4Columns’ – this will make it easier to find the step later.
  • Remove the last 2 columns, selecting the columns to remove, right clicking, and then remove columns.
  • Rename the first 2 columns to be called ‘State’ and ‘Abbreviation’
  • Rename the step to be called ‘First2Columns’ using the same renaming approach as before.
  • Click the fx button, and type in the name of the following step
M-Code
   = All4Columns

This brings all 4 columns back. Go through the following steps.

  • Remove the first 2 columns
  • Rename the columns to be ‘State’ and ‘Abbreviation’ as before
  • Rename the step to be called ‘Second2Columns’

This then leaves you with 2 non-sequential steps (First2Columns, Second2Columns) that each contain half of the data.

6. Append the 2 tables into a single table

Click the fx button again, and write the following code, to append the 2 tables together

M-Code
   = Table.Combine({First2Columns, Second2Columns})

7. VBA for converting multiple pdfs

If you copy this VBA code into Microsoft Word, and run the code to launch a dialog, you will be able to multi-select several PDFs, and the code will convert them all for you.

VBA
Sub ConvertToMHT()
Dim f_dialog As FileDialog
Dim FileChosen As Integer
Dim FileName As String
Dim i As Integer
Set f_dialog = Application.FileDialog(msoFileDialogFilePicker)
f_dialog.InitialView = msoFileDialogViewList
f_dialog.AllowMultiSelect = True

FileChosen = f_dialog.Show
    If FileChosen = -1 Then
         For i = 1 To f_dialog.SelectedItems.Count
              Documents.Open FileName:=f_dialog.SelectedItems(i)
              ActiveDocument.SaveAs2 FileName:=Mid(f_dialog.SelectedItems(i), 1, Len(f_dialog.SelectedItems(i)) - 4) & ".mht", FileFormat:=wdFormatWebArchive
              ActiveWindow.Close
         Next i
    End If
End Sub