Exercises for Chapter 6
For the exercises, we need again the datasets used in Chapter 6 of the book. You can import and prepare these datasets using the code presented in the chapter.
The WID:
wid <- read.csv(file.path("ch06", "us-inequality.csv"))
wid <- subset(wid, select = c(year, value))
colnames(wid) <- c("year", "p90p100")
The US presidents data:
presidents <- read.csv(
file.path("ch06", "us-presidents.csv"),
sep = ";"
)
colnames(presidents) <- tolower(colnames(presidents))
presidents <- subset(presidents, select = c(inoffice, president))
presidents$startyear <- as.numeric(substr(presidents$inoffice, 1, 4))
presidents$endyear <- as.numeric(substr(presidents$inoffice, 6, 9)) - 1
presidents$inoffice <- NULL
For the US GDP data, we retain the original data (quarterly estimates)
gdp <- read.csv(file.path("ch06", "us-gdp-pc.csv"))
colnames(gdp) <- c("date", "gdppc")
gdp$date <- as.Date(gdp$date)
and add a separate column that contains the year. For this, we use again the format()
function, but extract the year (%Y
) from the date:
gdp$year <- as.numeric(format(gdp$date, "%Y"))
Exercise 1: Alternative Merge Conditions
As we have seen in the chapter, joining different datasets with merge()
is quite limited due to the fact that the function can only match on the same values of the merge attributes. In the chapter, we have presented a way to get around this by first creating the Cartesian Product of the two tables, and then filtering out the combination of entries that we need. Can you use the same procedure to find out that the level of inequality was in the year before each presidential term? Use the wid
and presidents
datasets introduced in the chapter, and prepare them using the code from the chapter (see below).
Exercise 2: Different Aggregation Functions
In the chapter, we used the us-presidents.csv
dataset. You may have noticed that for some presidents, there are two entries (one for each term), while for other presidents with two terms (such as, for example, Barack Obama), there is only one. The reason is that the dataset also lists the vice presidents. If there was a new vice president in a president’s second term in office, the dataset introduces another row. Let us know standardize this dataset such that there is exactly one row per president, which gives the name, the start year (of the first term) and the end year (of the last term) of the respective president. This can be done with a simple aggregation! Think about what the groups are that you aggregate over, and what suitable aggregation functions can give you the start and end years.
Exercise 3: Create a Lagged Variable with Merging
For many types of statistical analysis in the social sciences, we need so-called lagged dependent variables. These are variables that for a given case and point in time, contain the same variable for the same case, but for the previous time point. For example, lagging the (annual) inequality estimate for the US in 1992 would be the value for the previous year, 1991. It is actually possible to create a lagged dependent variable using merging. Can you do this for the US inequality estimates in the wid
data frame? Hint: you need to merge the table with a copy of itself, where you adjust the year such that it refers to the year for which a given value should be used.
Exercise 4: Determining When the Maximum Occurs
In this exercise, we look at the quarterly US GDP per capita data in the gdp
data frame. Recall that this dataset contains quarterly GDP estimates, for January 1, April 1, July 1 and October 1 of each year. Economic performance typically varies by season, and we would like to find out in which quarter of each year the US GDP is the highest in a given year. We do this with a series of aggregation and merge operations. Hint: First, create an aggregated dataset with the annual maximum values of GDP, and then merge this table with the original one to determine which quarter has the highest GDP in the respective year. For this merge operation, it is convenient to merge on two attributes instead of one, and to retain (some of the) unmatched rows. For the latter, take a close look at the all.x
or all.y
parameters of the merge()
function.