Splitting Large PDFs - S.D. Payslips

Guides • Python • 30/01/2022

Overview

This particular example shows how a medium length Python Script can be used, together with the pyPDF2 and pathlib libraries, to split a large PDF file you might have received, into smaller individual files.

The example shown below works on Payslip PDFs (such as those you might get from Sargent Disc - a small to medium sized payroll vendor that operates in the U.K. Film & Entertainment space, and the kind of data files I used to work a lot with in a previous industry role, on an Apple TV+ Production.

As with all Scripts, the one below can be modified to suit your needs to work with any particular system (that’s one advantage of scripts - they are flexible!)

Python Script

The following shows the full script

Python

import re, os
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter

# Specify Input and Output Locations
pdf_file_path = 'Payslips.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
working_directory = Path.cwd()
output_folder = Path('Output')
output_folder_path = working_directory / output_folder
pdf = PdfFileReader(pdf_file_path)

# Split Files
count = 0
for page_num in range(pdf.numPages):

    # Skip Parent Loop if needed
    if count > 0:
        count -= count
        continue
         
    # Setup Objects & Classes
    pdfWriter = PdfFileWriter()
    pageObj = pdf.getPage(page_num)
    pdfWriter.addPage(pageObj)

    # Search on Current Page
    Text = pageObj.extractText() 
    MatchedNameArray = re.findall("Name:[^0-9]+?\s", Text)
    MatchedName = (MatchedNameArray[0].replace('Name:', '')).replace('\n', '')
    MatchedEmpNoArray = re.findall("Emp.No:\w*", Text)
    if MatchedEmpNoArray:
        MatchedEmpNo = (MatchedEmpNoArray[0].replace('Emp.No:', '')).replace('\n', '')
    else:
        MatchedEmpNo = 'empty'

    # Search on following Pages
    i = page_num + 1
    while i < pdf.numPages:
        pageObjNext = pdf.getPage(i)
        TextNext = pageObjNext.extractText() 
        MatchedNameArrayNext = re.findall("Name:[^0-9]+?\s", TextNext)
        MatchedNameNext = (MatchedNameArrayNext[0].replace('Name:', '')).replace('\n', '')

        if MatchedName == MatchedNameNext:
            i += 1
            count += 1
            pdfWriter.addPage(pageObjNext)
        else:
            break

    # Split MatchedText on UpperCase
    res_pos = [j for j, e in enumerate(MatchedName+'A') if e.isupper()]
    res_list = [MatchedName[res_pos[k]:res_pos[k + 1]] for k in range(len(res_pos)-1)]

    # Extract Firstname
    firstname = res_list[1]

    # Extract Surname
    surname = ''
    del res_list[0:2]
    if len(res_list) == 1:
        surname = surname + res_list[0]
    else:
        surname = surname + res_list[0]
        for l in (m+1 for m in range(len(res_list)-1)):
            if res_list[l-1][-1] == "-" or res_list[l-1][-1] == "'" :
                surname = surname + res_list[l]
            else:
                surname = surname + " " + res_list[l]
 
    # Write PDF File
    with open(
        Path(output_folder_path / f"{surname.upper()}, {firstname.upper()} ({MatchedEmpNo.upper()})"), 'wb') as f:
        pdfWriter.write(f)
        f.close()

# Rename Files in Output Directory
files = os.listdir(str(output_folder_path))
for file in files:
    os.rename(os.path.join(str(output_folder_path), file), 
    os.path.join(str(output_folder_path), 'PAYSLIP - WE 25JAN 2022 - ' + file + '.pdf'))

How to Use it

The input file is called Payslips.pdf. This should be stored in the same folder as this script.
The script can be given any meaningful name, but should have a file extension .py to tell the operating system on your computer that it is a Python Script
The split PDFs are saved in a folder called Output. This (i) Ouput folder should sit alongside the (ii) input PDF file and (iii) Python script, within the same working directory
The name given to each new PDF file can be modified by editing the last line of the Script above
The script can be run easily from any Python compiler on a Mac or Windows computer (Visual Studio Code is recommended!)
The script takes less than 5 seconds to process 500 payslips!

Important:
Re-running the script will overwrite any files already in the Output folder with the same name, so don’t forget to clear out the folder, and carefully check what filename you give to any newly generated PDFs, before running the script

How it Works

The script works by:

Scanning through each page of the input PDF file
Then finding out when the name written after the ‘Name: ‘ field changes, from one page to another.
When it does, it then extracts all the relevant pages upto that point into a PDF object
The PDF object is then written out as a new PDF file and saved into the folder called ‘Output’.

For example, if a name repeats across several pages in the input file, the Script will keep looping through each page, and add each page to the same PDF object, until the name on the page changes. When it does detect a name change, it will then write out and save the PDF object as a new PDF file. The process then repeats.

Before each PDF file is saved, the firstname and surname are intelligently extracted from the detected name, converted to Upper Case, and included in the filename of the newly generated PDF.

Bonus Feature!

The script also extracts the Employee No. from each page (along with the firstname and surname). This is useful as on some productions, Crew Members can have the same name (it does happen on large Crews!), and thus including the Employee No. in the filename allows for better identification.

So each file has the person’s firstname, surname, and Employee No. - which is just what you want!

Note:
Unforunately not all Payslips have a Employee No. on them - (a bug Sargent Disc, presumably, are working to resolve…?). On pages where there is no Employee No., the word ‘empty’ is used, instead, as a text placeholder replacement.

That’s really all there is to it!

Extensions

The basic methods and concepts illustrated here can be extended to any text readble PDF (e.g. V.A.T. Invoices, Digital Timesheets, Purchase Orders, Contracts, Start Forms, etc.)

Other fields from the Payslip could also be extracted, such as WEEK No., etc.

Furthermore, Python also has libraries that can work with applications like MS Excel, and various Email Clients, so the above workflows can, with some effort, be extended out even further!

The sky is the limit!