Scraping Over 2000 PDF Pages!

Author

Havisha Khurana

Published

Monday, March 17, 2025

One of the data sources I worked on for the project were public documentation from the Oregon Department of Education (ODE) that detailed state funds allocation to school districts by student types. This document came in the form of structured .pdfs. The primary ‘Data Sciency’ task I undertook during the capstone was to scrape 14 pdfs of 233 pages each.

In this section, I explain the logic of the text-scraping code. All these functions are found in the code/ folder.

Let’s start by looking at one pdf page.

The first step is the read the pdf as a dataframe.

As parameter, I pass the relative file path where the pdf is saved.

I use the pdf_text() function from the pdftools package which converts each pdf page into an object of type string.

Then, I apply a series of cleaning process across each page-character object using the map2_dfr() function from the purr package.

The cleaning steps include breaking the string into different lines separating at next line delimiter, removing white spaces, and multiple spaces in each line, and changing the hyphen symbol.

Now each line of each page is a different character object. I collapse all of them into a single dataframe.

In the next function, I find lines corresponding to district enteries so that I can leverage the structure.

Then, I split the dataframe by district, so that each district has all its lines in a separate object.

This function shows the steps to extract information from lines related to a single district.

The function parameters is a district-specific dataframe, created above.

I classify all the lines in accordance with the structure.

Then, I use regular expressions to extract key details based on information_type

Another instance where I used a regular expressions to extract key details

Lastly, I organize this information as a dataframe.

scrape_pdf_data <- function(file_path) {

  file_text_lines <- pdf_text(file_path) %>%
    map2_dfr(., 1:length(.), ~ {
      text <- strsplit(.x, "\n")[[1]]
      text <- text[!grepl("^\\s*$", text)]
      text <- gsub("^\\s\\s+", "", text)
      text <- gsub("\\s+", " ", text)
      text <- gsub("‐", "-", text)

      page_df <- data.frame(text = text, page = .y)

      return(page_df)
    })

  [...]
}

scrape_pdf_data <- function(file_path) {
  [...]
  
  get_district_lines <- file_text_lines %>%
    mutate(district_id_found = as.integer(grepl("District ID:", text)))
    select(page, district_id_found) %>%
    distinct() %>%
    filter(district_id_found == 1) %>%
    mutate(district_index = row_number()) %>%
    select(-district_id_found) %>%
    full_join(tibble(page = 1:max(file_text_lines$page)), by = "page") %>%
    arrange(page) %>%
    fill(district_index, .direction = "down")

  district_year_list <- file_text_lines %>%
    left_join(get_district_lines) %>%
    split(., .$district_index)

  return(district_year_list)
}

one_district_info <- function(district_df) {
  district_df <- district_df %>%
    mutate(
      information_type = case_when(
        grepl("District ID", text) ~ "district_info",
        # adding as files from 2014 onwards have this in the entry name
        grepl("for funding calculation|information only", text) ~ "entry_name",
        grepl(":", text) ~ "category_info",
        grepl("ADMw", text) ~ "total",
        TRUE ~ "entry_name"
      )
    )  %>%
    mutate(entry_index = cumsum(information_type == "entry_name"))
  
  # grab district level info
  district_details <- district_df %>%
    filter(information_type == "district_info") %>%
    select(text) %>%
    unlist() %>%
    str_match(., "^(.*?),\\s*(.*?)\\s*District ID:\\s*(\\d+)$")
   
  [...]

# grab each entry-level information
one_district_info <- function(district_df) {
  [...]
  entry_category_information <- district_df %>%
    filter(entry_index != 0) %>%
    split(., .$entry_index) %>%
    map_dfr(., ~ {
      # pattern different for 2024, no = sign
      category_pattern <-  "(.*):\\s*(-?[\\d,]+\\.\\d+)?\\s*X\\s*(-?[\\d.]+)
      \\s*=?\\s*(-?[\\d,]+\\.\\d+)?\\s*(-?[\\d,]+\\.\\d+)?\\s*X\\s*(-?[\\d.]+)
      \\s*=?\\s*(-?[\\d,]+\\.\\d+)?"
      
      
      .x %>%
        filter(information_type == "category_info") %>%
        select(text) %>%
        unlist() %>%
        str_match(., category_pattern) %>%
        as.data.frame() %>%
        select(-V7) %>%
        rename(
          "org_text" = "V1",
          "category" = "V2",
          "current_year_adm" = "V3",
          "category_weight" = "V4",
          "current_year_admw" = "V5",
          "past_year_adm" = "V6",
          "past_year_admw" = "V8"
        ) 
    )
  
  [...]
}

I go through this process for all pdfs. Now, Let’s walk-through the steps using an example.