Exercises for Chapter 12
In these exercises, we use again the database spatialdata
with the two datasets we have used in the chapter. You can read these datasets and create the corresponding tables using the code from the chapter:
library(RPostgres)
library(readtext)
library(quanteda)
db <- dbConnect(Postgres(),
dbname = "textdata",
user = "postgres",
password = "pgpasswd"
)
docs <- readtext("ch12/*.txt",
docvarsfrom = "filenames",
docvarnames = c("country", "session", "year")
)
docs$text <- gsub("\\d+\\.\t", "", docs$text)
dbWriteTable(db, "speeches", as.data.frame(docs))
Exercise 1: Finding Calendar Years Using Regex
In this exercise, we use regular expressions to search the UN speeches for references to calendar years. More specifically, we need a regular expression that matches years in the 1900s and the 2000s. First, think about what a search pattern should look like, without worrying about the regex syntax. You already know how to match single digits, but for our purpose it would not be sufficient to simply search for sequences of four digits - this could give us four-digit numbers that are not years. This is why your pattern needs to specify the first two digits explicitly. The OR operator |
may be helpful here. For example (part1 | part2)
matches part1
or part2
. Also, the regex operator {n}
that matches exactly n
occurrences of a preceding pattern may be useful here.
In addition, we would like to exclude cases where a number such as 1966
occurs as part of a longer number, such as 131966
. Therefore, we would like our four-digit number to be preceded and followed by a non-digit character. Can you figure out how to do this in a regular expression?
How many speeches contain calender years? Can you run the regular expression both in R and PostgreSQL?
Exercise 2: Extracting Regex Matches in R
In the chapter and in Exercise 1, we used regular expressions to locate texts that contain a particular pattern. However, so far we were only interested in whether a match occurred in a document, and we did not use those parts of the text where the match was found. This is what we do in this exercise. First, develop a regular expression that matches equality
or equal
. Since we are interested in finding these terms as separate words (and not as part of other words in the text), we need to make sure that the pattern is preceded and followed by a whitespace character (the \s
regex symbol may be helpful here, which matches whitespaces, tabulators, and newlines). Also, rather than specifying the two terms as alternatives with a logical OR, try to make the ity
part optional, such that it can occur 0 or 1 time. The (p){n,m}
regex quantifier is helpful here, which matches the pattern p
at least n
and and most m
times. Next, use the str_extract()
function from the stringr
package to locate the pattern in the documents. Take a look at its output to make sure you understand what it does. Does it find all matches in a given document? Can you use a related function that does?
Exercise 3: Copying Files Directly from and to PostgreSQL
In the book chapter and in these exercises, we imported the speeches from the UNGDC via R to the database. That is, we first loaded all of them into R, and then sent the complete data frame to the PostgreSQL server from R. This strategy will become difficult once the datasets become much larger, which can happen with text data, but also with other datasets. In these cases, it may be useful to import data directly from the files to the PostgreSQL server, without the need to do this via R. This is also possible for exporting data - PostgreSQL can also write entire tables (or subsets thereof) to CSV files. For file-based import and export, however, you must have access to the computer that PostgreSQL is running on. If you are using a PostgreSQL server on your local machine (which is what we set up in Chapter 2 of the book), we can try how this works.
First, in the PostgreSQL documentation, look up the COPY
command and try to figure out how it works. Export the speeches
table to file. Make sure that you specify the CSV format. Also, your export file should have a header with the column names. For the output file, make sure to specifiy a complete file name such as /Users/nils/dmbook
. Can you figure out how to restrict the output to the single column text
? Open the file you created in a text editor and take a look at its structure and content. Next, can you import the CSV file that you created in the previous step to a new table? Here, you need to make sure to create the table structure first.
Exercise 4: Fuzzy String Matching in R
So far, when we searched texts, we used fixed patterns that we wanted to locate within the strings. Regular expressions give you some flexibility in doing so, but also require the specification of a pattern. An alternative way to search texts is by means of fuzzy string matching, where you specify a search term, but allow for a certain tolerance such that the target only has to match the search term only approximately. This method is very useful when searching text that contains different spellings or spelling mistakes, as they often occur, for example, when a text is automatically extracted from a scanned document.
Fuzzy string matching uses one of different distance measures that capture how different two strings are from each other, and a match is returned if the text (or parts of it) are within a certain distance from the search term we are using. The Levenshtein editing distance is a common measure for this. It computes the number of editing operations (inserting, deleting, or substituting one character) that is required to transform one string into the other. If this number is low, the two strings are similar.
In the exercise, we first perform fuzzy string matching in R with the stringdist
package. Familiarize yourself with the Levenshtein editing distance and how it can be computed with the package. What is the editing distance of the two strings inequality
and equal
? You will need the stringdist()
function, but make sure to set the correct distance measure (the Levenshtein distance). Now, use the extract()
function from the package to find fuzzy matches for inequality
in the 49 UN General Debate speeches from the chapter. Start by defining a maximum distance of 2, and then increase this distance step by step. Why is iversality
a match for a maximum distance of 3?