How to use PDF miner to improve your data reporting

June 21, 2024 | Categories: Python

The problem

Many organizations utilize software tools that allow them to generate reports automatically, such as a content management system, a homeless management information system, or a student information system. For data engineering, the best case scenario is for these software tools to have multiple versions of a database, or otherwise provide strong database controls or an API, that allow auditors and evaluators direct access to the live data. The administrator would delegate read only access to the evaluator, control which tables / columns are available for viewing and provide a way for the engineer to connect to that database instance securely. Unfortunately, many software solutions do not come with this functionality out of the box. Instead, many programs will allow a user to select a date range, click a "Generate Report" button and output a PDF with tables that may look nice, but not provide data in a way that can be directly read by a Python library like Pandas or Polars and much less by Power BI or Tableau. This is where leveraging PDF miner becomes invaluable. 

 

The Solution

PDF Plumber enables a user to extract tables from within a PDF file, as well as work through a PDF file and extract other content as well in an automated fashion. This provides you with an opportunity to analyze the structure of a PDF file and work through an extract, transform, and load process to get the data into whatever format is necessary. To get started, open the terminal and run the following command 
 

pip install pdfplumber

 

Below is an example of a first pass using the pdfplumber library:


def extract_tables_with_pdfplumber(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        tables = []
        for page in pdf.pages:
            for table in page.extract_tables():
                df = pd.DataFrame(table)
                tables.append(df)
        return tables
pdf_path = r"C:\Users\Documents\Report.pdf"
extracted_tables = extract_tables_with_pdfplumber(pdf_path)

 

This will iterate through the pages in your PDF and add each table to a list. By comparing the PDF file to your list of dataframes, you can begin to find where in the list of dataframes the data you need lives. 
 

For example, let's the table you need is the ninth item in the list. You can reference this with the following line of code:
 


df_2 = extracted_tables[8]
 

 

PDF's utilize a format that ensures the page renders the same regardless of what device it is viewed on, so the file contains information about where certain elements begin and end based on their pixels. This means you can use XY coordinates to locate certain items, such as tables, from the PDF file. This can make extracting tables easier. For example, if you ran the above code and had multiple empty rows and columns, it may help to identify where the table element in the PDF begins and ends using a mouse coordinate tracker. You can tehn change your code to the following:


def extract_table_part(pdf_path, page_num, x0, x1, top, bottom):
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[page_num]
        table = page.extract_table()

        filtered_table = []
        for row in table:
            filtered_row = []
            for cell in row:
                # Check if the cell is a dictionary (meaning it has coordinates)
                if isinstance(cell, dict) and x0 <= cell["x0"] <= x1 and top <= cell["top"] <= bottom:
                    filtered_row.append(cell.get("text", ""))  # Extract text or empty string if missing
                elif isinstance(cell, str):
                    filtered_row.append(cell)

            if filtered_row:  # Add the row only if it has at least one cell
                filtered_table.append(filtered_row)

        return filtered_table
page_num = 1 # Zero-based index for the page
x0, x1 =  (0.32*72), (8*72) # X coordinates of the desired area
top, bottom = (0.4*72), (2.5*72) # Y coordinates of the desired area
receiving_list = []
pdf_path = r"C:\Users\Documents\Report.pdf"
for i in range(0,len(pdf.pages)):
    extracted_part = extract_table_part(pdf_path, i, x0, x1, top, bottom)
    receiving_list.extend(extracted_part)

dataframe = pd.DataFrame(extracted_part[7:14], columns=extracted_part[6])

 

 

With this new example, you identify the square area where the table begins and ends on each page, iterate through the pages, and then generate the final dataset needed. This can produce a cleaner final product.