Splitting Large PDFs - S.D. Payslips
Overview
This particular example shows how a medium length Python Script can be used, together with the pyPDF2 and pathlib libraries, to split a large PDF file you might have received, into smaller individual files.
The example shown below works on Payslip PDFs (such as those you might get from Sargent Disc - a small to medium sized payroll vendor that operates in the U.K. Film & Entertainment space, and the kind of data files I used to work a lot with in a previous industry role, on an Apple TV+ Production.
As with all Scripts, the one below can be modified to suit your needs to work with any particular system (that’s one advantage of scripts - they are flexible!)
Python Script
The following shows the full script
import re, os
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter
# Specify Input and Output Locations
pdf_file_path = 'Payslips.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
working_directory = Path.cwd()
output_folder = Path('Output')
output_folder_path = working_directory / output_folder
pdf = PdfFileReader(pdf_file_path)
# Split Files
count = 0
for page_num in range(pdf.numPages):
# Skip Parent Loop if needed
if count > 0:
count -= count
continue
# Setup Objects & Classes
pdfWriter = PdfFileWriter()
pageObj = pdf.getPage(page_num)
pdfWriter.addPage(pageObj)
# Search on Current Page
Text = pageObj.extractText()
MatchedNameArray = re.findall("Name:[^0-9]+?\s", Text)
MatchedName = (MatchedNameArray[0].replace('Name:', '')).replace('\n', '')
MatchedEmpNoArray = re.findall("Emp.No:\w*", Text)
if MatchedEmpNoArray:
MatchedEmpNo = (MatchedEmpNoArray[0].replace('Emp.No:', '')).replace('\n', '')
else:
MatchedEmpNo = 'empty'
# Search on following Pages
i = page_num + 1
while i < pdf.numPages:
pageObjNext = pdf.getPage(i)
TextNext = pageObjNext.extractText()
MatchedNameArrayNext = re.findall("Name:[^0-9]+?\s", TextNext)
MatchedNameNext = (MatchedNameArrayNext[0].replace('Name:', '')).replace('\n', '')
if MatchedName == MatchedNameNext:
i += 1
count += 1
pdfWriter.addPage(pageObjNext)
else:
break
# Split MatchedText on UpperCase
res_pos = [j for j, e in enumerate(MatchedName+'A') if e.isupper()]
res_list = [MatchedName[res_pos[k]:res_pos[k + 1]] for k in range(len(res_pos)-1)]
# Extract Firstname
firstname = res_list[1]
# Extract Surname
surname = ''
del res_list[0:2]
if len(res_list) == 1:
surname = surname + res_list[0]
else:
surname = surname + res_list[0]
for l in (m+1 for m in range(len(res_list)-1)):
if res_list[l-1][-1] == "-" or res_list[l-1][-1] == "'" :
surname = surname + res_list[l]
else:
surname = surname + " " + res_list[l]
# Write PDF File
with open(
Path(output_folder_path / f"{surname.upper()}, {firstname.upper()} ({MatchedEmpNo.upper()})"), 'wb') as f:
pdfWriter.write(f)
f.close()
# Rename Files in Output Directory
files = os.listdir(str(output_folder_path))
for file in files:
os.rename(os.path.join(str(output_folder_path), file),
os.path.join(str(output_folder_path), 'PAYSLIP - WE 25JAN 2022 - ' + file + '.pdf'))
How to Use it
-
The input file is called Payslips.pdf. This should be stored in the same folder as this script.
-
The script can be given any meaningful name, but should have a file extension .py to tell the operating system on your computer that it is a Python Script
-
The split PDFs are saved in a folder called Output. This (i) Ouput folder should sit alongside the (ii) input PDF file and (iii) Python script, within the same working directory
-
The name given to each new PDF file can be modified by editing the last line of the Script above
-
The script can be run easily from any Python compiler on a Mac or Windows computer (Visual Studio Code is recommended!)
-
The script takes less than 5 seconds to process 500 payslips!
Important:
Re-running the script will overwrite any files already in the Output folder with the same name, so don’t forget to clear out the folder, and carefully check what filename you give to any newly generated PDFs, before running the script
How it Works
The script works by:
- Scanning through each page of the input PDF file
- Then finding out when the name written after the ‘Name: ‘ field changes, from one page to another.
- When it does, it then extracts all the relevant pages upto that point into a PDF object
- The PDF object is then written out as a new PDF file and saved into the folder called ‘Output’.
For example, if a name repeats across several pages in the input file, the Script will keep looping through each page, and add each page to the same PDF object, until the name on the page changes. When it does detect a name change, it will then write out and save the PDF object as a new PDF file. The process then repeats.
- Before each PDF file is saved, the firstname and surname are intelligently extracted from the detected name, converted to Upper Case, and included in the filename of the newly generated PDF.
Bonus Feature!
The script also extracts the Employee No. from each page (along with the firstname and surname). This is useful as on some productions, Crew Members can have the same name (it does happen on large Crews!), and thus including the Employee No. in the filename allows for better identification.
So each file has the person’s firstname, surname, and Employee No. - which is just what you want!
Note:
Unforunately not all Payslips have a Employee No. on them - (a bug Sargent Disc, presumably, are working to resolve…?). On pages where there is no Employee No., the word ‘empty’ is used, instead, as a text placeholder replacement.
That’s really all there is to it!
Extensions
The basic methods and concepts illustrated here can be extended to any text readble PDF (e.g. V.A.T. Invoices, Digital Timesheets, Purchase Orders, Contracts, Start Forms, etc.)
Other fields from the Payslip could also be extracted, such as WEEK No., etc.
Furthermore, Python also has libraries that can work with applications like MS Excel, and various Email Clients, so the above workflows can, with some effort, be extended out even further!
The sky is the limit!
Feedback
Submit and view feedback