1 Data Acquisition

1.1 Initial Acquisition

As opposed to downloading large csv datasets, we decided to use rscorcard to retrieve only the relevant information. rscorecard is a R wrapper for U.S. Department of Education College Scorecard Data API which allows for the remotly access and download the Department of Education’s College Scorecard data. It is based on the ‘dplyr’ model of piped commands to select and filter data in a single chained function call.

This package can installed in RStudio by running the command install.packages("rscorecard") in the console.

Merely loading the package is not enough to use its functions. In order to communicate with the server which hosts the data, an API key from the U.S. Department of Education is required. This key acts as an authentication tool.

rscorecard::sc_key(APIKey)

Our group was interested in the relationship between tuition cost and earnings after graduation across different academic fields. Although there were fields which recorded the proportion of degrees recieved in specific fields, these fields were not well standardized (i.e. “Engineering” vs. “Engineering or related fields”). Therefore, we decided to narrow our scope to special-focus schools, as defined by the basic classification Carnegie Classification method, instead.

The special-focus designation was based on the concentration of degrees in a single field or set of related fields, at both the undergraduate and graduate levels. Institutions were determined to have a special focus if they met any of the following conditions:

Conferred at least 75% of degrees in just one field (as determined by the first two digits of the CIP Code) other than “Liberal Arts & Sciences, General Studies or Humanities” (CIP2=24)

Conferred 70-74% in one field and conferred degrees in no more than 2 other fields.

Conferred 60-69% in one field and conferred degrees in no more than 1 other field.

— Carnegie Classification of Institutions of Higher Education

Table 1.1 provides a list of the school types that we included in our dataset. For a complete list of all school classifications and corresponding definitions/methodology, see Carnegie Classification of Institutions of Higher Education. The specific earnings and tuition variables we considered are located in Table 1.2.

Table 1.1: School classifications considered in our analysis.
value	label
24	Special Focus Four-Year: Faith-Related Institutions
25	Special Focus Four-Year: Medical Schools & Centers
26	Special Focus Four-Year: Other Health Professions Schools
27	Special Focus Four-Year: Engineering Schools
28	Special Focus Four-Year: Other Technology-Related Schools
29	Special Focus Four-Year: Business & Management Schools
30	Special Focus Four-Year: Arts, Music & Design Schools
31	Special Focus Four-Year: Law Schools
32	Special Focus Four-Year: Other Special Focus Institutions

Data is obtained through a series of piped functions: sc_init(), sc_filter(), sc_select(), sc_year(), and sc_get(). Unfortunately, the sc_year() function can only take a single-year entry. To simplify the process of grabbing the same data over the course of several years, a simple function was created.

get.data <- function(year, vars.vector){
  print(paste("Grabbing data for",year))
  rscorecard::sc_init() %>%
    rscorecard::sc_filter(ccbasic %in% c(24:32)) %>%
    rscorecard::sc_select_(vars.vector) %>%
    rscorecard::sc_year(year) %>%
    rscorecard::sc_get()
}

As mentioned earlier, the data collection began back in the 1900’s. As it turns out, the variables of interest to our group, earnings later in life and tuition, were not recorded until 2000. Therefore, the earlier years are excluded from our analysis.

years <- c(2000:as.numeric(format(Sys.Date(), "%Y")))
csb.data <- dplyr::bind_rows( lapply(years, get.data, vars_to_pull) )

1.2 Continual Aquisition

The server which hosts the college data only allows up to 100 calls at a given time. Due to the use of Github Actions to automatically re-render the report after any changes pushed to the main branch, this access limitation makes it impractical for long-term data sourcing. Therefore, the compiled data was converted into a csv file and stored remotely in Google’s cloud service (Google Drive) to facilitate unrestricted access to the data.

write.csv(csb.data, "CollegeScorecardData20to21.csv", row.names=F)

The data stored on Google Drive can henceforth be obtained through a download using the id of the file and the following Google address to access it.

id <- "1Sfa6V3vhxQU3wDMTKsPUpnkp9DX8gsOK"
csb.data <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))