Data Science Capstone
  • Motivation
  • Scraping Code
  • Scraping Example
  • Analysis

Run-Through Example

Author

Havisha Khurana

Published

Monday, March 17, 2025


One of the data sources I worked on for the project were public documentation from the Oregon Department of Education (ODE) that detailed state funds allocation to school districts by student types. This document came in the form of structured .pdfs. The primary ‘Data Sciency’ task I undertook during the capstone was to scrape 14 pdfs of 233 pages each.


In this section, I will walkthrough an example of scraping data for one district. All these functions are found in the code/ folder.

Let’s start by looking at one pdf page.


Let’s see the intermediate steps for transforming the data.



From the pdf, I get to a dataframe of each line.

Then, I create a list of 197 dataframes corresponding to informaion associated with each Oregon school district.

I classify each line leveraging the structure.

Using the line classified as district_info, I grabbed the district details.

The resulting dataframe for one school district after applying the regular expressions rules.

     [,1]                                          [,2]           [,3]         
[1,] "Baker County, Baker SD 5J District ID: 1894" "Baker County" "Baker SD 5J"
     [,4]  
[1,] "1894"



And then repeating it over across all districts, and all years.