Importing tabular data from a pdf
Overview
This article explains how to import tabular data from a PDF document into Excel.
![](images/importing-data-pdf-1.png)
Steps
- Open the PDF in Microsoft Word.
- Save the file as a single file web page.
- Import the single file web page into Power Query as HTML.
- Go through some cleansing and transformation steps, as you would do in Power Query
1. Worked Through Example
The following PDF contains a list of US States with their abbreviations
The pdf should look like this (only the top half of the page is shown):
![](images/importing-data-pdf-2.png)
- Save a copy PDF to a known location on your PC.
2. Convert the PDF to a Web Page File
- Open Microsoft Word and then open the PDF file from within Word.
You will get the warning message below.
- Click “don’t show this message again”, and click OK.
![](images/importing-data-pdf-3.png)
- Now, select File\Save As and save the file as a Single File Web Page (shown below):
![](images/importing-data-pdf-4.png)
3. Import into Power Query
- Create a new query that connects to a text file
![](images/importing-data-pdf-5.png)
- In the dialog, change the file filter so it can find all file types (as shown below), then browse until you find your file.
![](images/importing-data-pdf-6.png)
- Then select the MHT file created earlier and click “import”
![](images/importing-data-pdf-7.png)
- Power Query will not correctly identify the file type, so right click on the file (shown below) and the select HTML.
![](images/importing-data-pdf-8.png)
Power Query now correctly identifies the table in the document.
![](images/importing-data-pdf-9.png)
In your own PDF documents, you may need to use some trial and error to find the right table.
- Expand the one called as “3DTableGrid”, by double-click on the Table highlighted green above
4. Clean the Data
The data is starting to look good at this stage, but does have a bit of “noise”.
![](images/importing-data-pdf-10.png)
- First, remove the first row (Remove Top Rows, 1).
- Next copy the text <= /span> by right clicking on one of the cells containing this text, and selecting “copy”.
- Then select all 4 columns and do a Transform\Replace Values and replace <= /span> with nothing.
- You do not need the bottom 11 rows, so Remove Rows ➔ Remove Bottom Rows ➔ 11
![](images/importing-data-pdf-11.png)
5. Transform from 4 to 2 columns
At this point, the data is in 4 columns, but it should be in just 2 columns.
- Rename this step to: ‘All4Columns’ – this will make it easier to find the step later.
![](images/importing-data-pdf-12.png)
- Remove the last 2 columns, selecting the columns to remove, right clicking, and then remove columns.
![](images/importing-data-pdf-13.png)
- Rename the first 2 columns to be called ‘State’ and ‘Abbreviation’
- Rename the step to be called ‘First2Columns’ using the same renaming approach as before.
![](images/importing-data-pdf-14.png)
- Click the fx button, and type in the name of the following step
= All4Columns
This brings all 4 columns back. Go through the following steps.
- Remove the first 2 columns
- Rename the columns to be ‘State’ and ‘Abbreviation’ as before
- Rename the step to be called ‘Second2Columns’
This then leaves you with 2 non-sequential steps (First2Columns, Second2Columns) that each contain half of the data.
6. Append the 2 tables into a single table
Click the fx button again, and write the following code, to append the 2 tables together
= Table.Combine({First2Columns, Second2Columns})
7. VBA for converting multiple pdfs
If you copy this VBA code into Microsoft Word, and run the code to launch a dialog, you will be able to multi-select several PDFs, and the code will convert them all for you.
Sub ConvertToMHT()
Dim f_dialog As FileDialog
Dim FileChosen As Integer
Dim FileName As String
Dim i As Integer
Set f_dialog = Application.FileDialog(msoFileDialogFilePicker)
f_dialog.InitialView = msoFileDialogViewList
f_dialog.AllowMultiSelect = True
FileChosen = f_dialog.Show
If FileChosen = -1 Then
For i = 1 To f_dialog.SelectedItems.Count
Documents.Open FileName:=f_dialog.SelectedItems(i)
ActiveDocument.SaveAs2 FileName:=Mid(f_dialog.SelectedItems(i), 1, Len(f_dialog.SelectedItems(i)) - 4) & ".mht", FileFormat:=wdFormatWebArchive
ActiveWindow.Close
Next i
End If
End Sub
Feedback
Submit and view feedback