Exercises for Chapter 5

Exercise 1: A Malformed Excel Spreadsheet

The file wdi-population-file1.xlsx contains a modified version of the country-level population dataset obtained from the World Bank’s World Development Indicators. Export the file to CSV and import it in R with read.csv(). Make sure that all columns are imported and have the correct type, and that no unnecessary columns are present. What errors does the Excel contain (there are three of them)? Explain how they lead to the import problems you encountered.

Exercise 2: Multiple Tables in a Single Spreadsheet

One of the dangers of data being kept in spreadsheets is that they do not satisfy a tabular data format. For example, some spreadsheets contain multiple tables in a single sheet, which makes it impossible to easily export to CSV. Take a look at file wdi-population-file2.xlsx, where the first sheet contains multiple tables. We can either fix these cases manually, or use the read_excel() function. How can you import both tables with this function? Consult the documentation of this function to find out.

Exercise 3: Data with Color Formatting

We oftentimes encounter data formatted with colors, there the color actually contains information. As discussed in the chapter, this is generally not a good idea, because formatting elements such as color cannot be processed and exported easily. File wdi-population-file3.xlsx contains an example for this. In this file, entries that refer to regions or groups of countries (and not to individual, independent states), have been marked with a yellow background. We would like to clean up our dataset, such that these entries are removed. Can you find an easy way to do this, without deleting the individual entries? Hint: Excel’s filtering function described in the chapter could be helpful here.

Exercise 4: The True Structure of Excel Files

The file vdem-polyarchy.xlsx in the data repository contains democracy estimates (“polyarchy”) from the V-Dem project for three countries and the year 2020. Take a quick look with a spreadsheet software to see what the content looks like. The data is stored in Excel’s standard file format. As we have discussed in the book, this is a binary file format that cannot be read directly with a text editor. Really? Let’s see.

Rename the file and give it a different file ending: .zip. You can do this with the Explorer or in the macOS finder. If you get a warning that the file type will be changed, just confirm that you want to do this. The file name should now be vdem-polyarchy.zip. Now, let’s uncompress the file. Under Windows, you can just double-click the file to do this, but under macOS, you have to use the Terminal or the excellent Keka archiver (available free of charge from https://www.keka.io/en/). Once the file has been uncompressed, you will get a set of several files and folders, which together contain the data and the formatting of the Excel file. Ech of these files is a text file, so you can open and read it with a standard text editor such as the one built into RStudio. This shows that an .xlsx file simply a compressed collection of text files. Can you find out which one contains the data for the spreadsheet (the list of countries and democracy scores)?