3 Data Preprocessing

The raw data we are working with has 18194 rows and 29 columns. A sample of this transposed data is provided below in table 3.1. Note, the transposition occurs to ensure information is not cut off by the edge of the page.

Notice that the columns for the earnings table places the year subsets in the order of 10, 6 then 8. To ensure this ordering is not reflected in the later graphs, the column are renamed and then sorted alphebetically.

Originally, the ccbasic (school type) column is an integer. Although the values are integers, they represent categories. Therefore, they must be converted into factors. Factors are R’s equivalent categorical variables. They are treated differently than normal characters. On that note, the unitid is also changed so that the id is treated as an object an not a descrete value. The original data types and altered types of the columns can be found in Table 3.2.

As mentioned earlier, the earnings features are all inflation adjusted, but to different years. In order to minimize the effect of economic health on our results, all the monetary values are inflated or deflated to the target year 2020. All cost of attendance related values are also adjusted to 2020. Table 3.3 provides the base assumptions of what years the earnings values have been adjusted to based on the information in the Technical Documentation of the Scorecard reports.