Exercises for Chapter 3
Exercise 1: Data Structure and Content
In this exercise, we think about the idea of a data structure more generally. There is no need to write R code!
Tables are only one type of data structure. From your experience with R, do you know others? What are their properties?
Computers store images as sets of pixels. Can you describe what the data structure of a pixelized image looks like? How do we obtain individual values from this data structure? What is the ``content’’ of this data structure?
Imagine we want to propose a simple data structure, a ``queue.’’ This data structure should store information on customers waiting to be served at a business. What properties should the queue have? What minimal operations for adding and removing customers do we need?
Exercise 2: Table Columns and Types in R
As you know, R stores tabular data in “data frames”. In this exercise, we take a closer look at the columns of data frames and their types. Here, you should use R to experiment!
Use the
tdb
data frame with two countries and three columns (country, population, capital) that we created in the chapter. What type does the population variable have? Add a third country (Germany) and the following population value:82 million
. What happens to the type of the population variable?In your dataset, somebody mistakenly changes Germany’s name to a numeric country identifier (265). How is this done in R? What happens to the type of the column? Why?
Now, we drop Germany from the data. We also want to fix the type of the population column. How can we do this?
Exercise 3: Data Filtering
- Use the
tdb
data frame with two countries and three columns (country, population, capital) that we have created:
tdb <- data.frame(
country = c("Switzerland", "Austria"),
population = c(8.3, 8.7),
capital = c("Bern", "Vienna")
)
How can we filter out entries with a population of more than 8.0 million, but less than 8.4 million?
- What exactly does this example do?
tdb[tdb$country == "Switzerland", ]
- (more difficult) Using the
grepl()
andnames()
functions, can you subset thetdb
data frame to those columns whose names start with the letter “c” (without specifying these column names explicitly)?
Exercise 4: Table Design
Imagine that we are conducting a research project that collects data about families. Each family consists of one or more persons living at the same address. The project collects data about each family as a whole (city of residence, total income), but also data about each family member (age, gender). The data should be stored in a tabular database.
Describe two different designs for this database, one using a single table, the second using two tables. What would these two designs looks like? Discuss weaknesses of either approach.