Checking links and link responses with httr and R
This post is Part 2 of a 2-part series about using R to download and examine bookmarks using the httr package and the Pinboard API.
Part 1: Talking to the Pinboard API with RIn the previous post I described how to use the Pinboard API to download your bookmarks with the help of the httr
R package.
In this post I’ll talk about some of the munging, aggregation, and exploration I did with the data.
Descriptives
There are 4944 links in the dataset.
The earliest link is dated 2011-12-09 23:47:21 EST.
The last link is dated 2016-06-23 16:14:31 EDT
Top 10 hosts in my bookmarks, in descending order:
hostname | count | comments |
---|---|---|
twitter.com | 434 | You can tell Pinboard to archive all tweets that you favorite. |
economist.com | 391 | |
leancrew.com | 168 | I’m a big fan of Dr. Drang’s blog |
thedailywtf.com | 76 | I used to read TheDailyWTF a lot. I stopped a long time ago. Should probably delete those bookmarks. |
marco.org | 72 | |
github.com | 70 | |
stackoverflow.com | 68 | |
inessential.com | 66 | I am also a big fan of Brent Simmons’ inessential. |
objc.io | 60 | |
youtube.com | 60 | I pay for Pinboard’s archiving account, so I bookmark YT videos I think might disappear in the future. |
Code
Let’s jump in! I’ll start by loading some libraries, followed by the data.
require(ggplot2)
require(httr) # for getting headers and parsing server responses
require(jsonlite) # convert json to a data frame
require(magrittr) # %>%
require(purrr) # to use map*() functions
require(scales) # to use date_breaks() in scale_x_datetime()
I need to read the .RData
file that the code from the last post saved for me, and convert the json to an R dataframe.
# load the pins response object from file
# this was graciously saved there for us by get_pins.R
setwd('~/projects/pins/')
load('pins.RData')
pins <- content(pins_all) %>% fromJSON()
Great, that was easy enough.
Next, I need to do some cleanup. This involves converting two variables (shared
and toread
) to factors, which I do using %>% as.factor()
.
I also need to clean up the timestamps Pinboard returns, which are in UTC/GMT timezone and come in this unusual format: 2016-06-23T23:45Z
.
I will format and convert them to POSIXct
type, and put them in my US/Eastern timezone.
Cleanup
pins$shared <- pins$shared %>% as.factor()
pins$toread <- pins$toread %>% as.factor()
# pinboard exports timestamps in UTC/GMT, and in this format: '2016-06-23T23:45Z'
# I want them in POSIX, and in US/Eastern
# convert to POSIX with timezone information as GMT
pins$time <- as.POSIXct(pins$time,
format = '%Y-%m-%dT%H:%M:%SZ',
tz = 'GMT')
# convert to US/Eastern timezone
pins$time <- format(pins$time, tz = 'US/Eastern') %>%
as.POSIXct()
This concludes the cleanup of the data that came out of Pinboard.
Now I need to create some of my own.
Extracting hostnames
I’m interested in seeing how many links I have for each host. I’m also interested in finding out which hosts have the most broken links.
To figure that out, first I need to extract the hostnames out of the urls. I wrote a function get_hostname()
which accepts a url and attempts to use the httr::parse_url()
function to extract the hostname.
If it successfully extracts a hostname, the function will remove “www.” from the beginning of the string if it exists. I do this because some hostnames, like http://leancrew.com, used to be “www.leancrew.com”. I have bookmarked links with both urls, and without this fix, the aggregates and counts will be split between the two hostnames, and therefore, wrong.
I also do this for cleanliness.
# oooo .
# `888 .o8
# 888 .oo. .oooo.o .o888oo ooo. .oo. ooo. .oo. .oo.
# 888P"Y88b d88( "8 888 `888P"Y88b `888P"Y88bP"Y88b
# 888 888 `"Y88b. 888 888 888 888 888 888
# 888 888 o. )88b 888 . 888 888 888 888 888
# o888o o888o 8""888P' "888" o888o o888o o888o o888o o888o
## hostname
get_hostname <- function(href) {
tryCatch({
parsed_url <- parse_url(href)
if (!parsed_url$hostname %>% is.null()) {
hostname <- parsed_url$hostname %>%
gsub('^www.', '', ., perl = T)
return(hostname)
} else {
return('unresolved')
}
}, error = function(e) {
return('unresolved')
})
}
With the function written, now I need to apply it to all the rows in the dataframe.
I could have used one of the lapply
family of functions. But I thought I would try the map*
family from the purrr
package instead, and I have no complaints.
pins$hostname <- map_chr(pins$href, .f = get_hostname)
Counting links by host
Let’s take a look at the top 10 bookmarked sites in my account.
I’ll make a table of the counts of different hosts, convert it into a dataframe, give it column names, and order the dataframe in descending order of hostname frequency.
knitr::kable()
will convert the R table into a more readable and blog-friendly Markdown table.
## aggregate hostnames
hostname.freq <- pins$hostname %>%
table() %>%
as.data.frame()
colnames(hostname.freq) <- c('hostname', 'count')
hostname.freq <- hostname.freq[order(-hostname.freq$count),]
rownames(hostname.freq) <- c(1:nrow(hostname.freq))
# knitr::kable makes pretty markdown tables from R output
knitr::kable(hostname.freq[,c('hostname', 'count')] %>%
head(10), row.names = T, align = c('c'))
hostname | count | |
---|---|---|
1 | twitter.com | 434 |
2 | economist.com | 391 |
3 | leancrew.com | 168 |
4 | thedailywtf.com | 76 |
5 | marco.org | 72 |
6 | github.com | 70 |
7 | stackoverflow.com | 68 |
8 | inessential.com | 66 |
9 | objc.io | 60 |
10 | youtube.com | 60 |
Sweet!
Counting links by month
What about number of bookmarks added by month?
pins$month <- strftime(pins$time, format = '%Y-%m')
links.month <- pins$month %>%
table() %>%
as.data.frame()
colnames(links.month) <- c('month', 'count')
links.month$month <- links.month$month %>%
paste0('-01') %>%
as.POSIXct()
ggplot(links.month, aes(x = month, y = count)) +
geom_point() +
geom_line(group = 1) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(vjust = 1.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_datetime(breaks = date_breaks('2 months'),
labels = date_format('%Y-%m')) +
xlab('') +
ylab('Bookmarks') +
ggtitle('Links added per month')
The scale is ruined by the extreme number in the first month. I imported a bit over 400 bookmarks into Pinboard when I started my account, and that’s what the outlier is about.
Let’s exclude it.
ggplot(links.month[-1, ], aes(x = month, y = count)) +
geom_point() +
geom_line(group = 1) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(vjust = 1.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_datetime(breaks = date_breaks('2 months'),
labels = date_format('%Y-%m')) +
xlab('') +
ylab('Bookmarks') +
ggtitle('Links added per month')
That’s better.
URL responses
It’s now time for the big show. I want to check the server response for each of the 4944 urls to see which ones are dead or timing out.
To do this, I wrote another function, check_url_response()
which, like get_hostname()
accepts a url.
The function tries to get the url headers using the httr::HEAD()
function, and then tries to return the $status_code
from the response object.
It might fail, like when the request times out. If that happens, the function returns 0
.
For reference, here is a list of all HTTP status codes.
# .ooooo. oooo d8b oooo d8b .ooooo. oooo d8b .oooo.o
# d88' `88b `888""8P `888""8P d88' `88b `888""8P d88( "8
# 888ooo888 888 888 888 888 888 `"Y88b.
# 888 .o 888 888 888 888 888 o. )88b
# `Y8bod8P' d888b d888b `Y8bod8P' d888b 8""888P'
check_url_response <- function(href) {
cat('Checking', href, '...\n')
tryCatch({
check_head <- HEAD(href)
return(check_head$status_code %>% as.integer())
}, error = function(e) {
return(0)
})
}
This is an expensive operation time-wise. I recorded the time it took to check all links in my dataset using system.time()
:
user system elapsed
641.169 13.808 3893.732
3893 seconds, or about an hour of elapsed time on my Mac Mini.1
To avoid accidentally triggering this expensive process when I run the script, I check to see if I’ve already saved a file titled responses.RData
with these response codes. If the file is there, the script will load it and move on, and will only start checking urls if it can’t find the file.2
Then I add the responses as a new variable to the pins
dataframe.
if (file.exists('responses.RData')) {
load('responses.RData')
} else {
response_time <- system.time(responses <- map(pins$href, .f = check_url_response))
save(response_time, file = 'responses.RData')
}
pins$response <- unlist(responses)
The httr
package has another response-related function: http_status()
, which translates the response code to a human readable description.
Let’s check it out.
http_status(502)
## $category
## [1] "Server error"
##
## $reason
## [1] "Bad Gateway"
##
## $message
## [1] "Server error: (502) Bad Gateway"
The function returns an object with $category
, $reason
, and $message
.
response.aggs <- pins$response %>%
table() %>%
as.data.frame(stringsAsFactors = F)
colnames(response.aggs) <- c('response', 'count')
response.aggs <- response.aggs[order(response.aggs$count, decreasing = T),]
parse_response <- function(response_code) {
if (response_code == '0' | response_code == '522') { return('Timed out')}
if (response_code == '520') {return('Unknown error: (520)')}
status <- response_code %>%
as.integer() %>%
http_status()
return(status$message)
}
response.aggs$description <- map(response.aggs$response, .f = parse_response) %>%
unlist()
rownames(response.aggs) <- c(1:nrow(response.aggs))
response.aggs %>% knitr::kable(row.names = T)
response | count | description | |
---|---|---|---|
1 | 200 | 4644 | Success: (200) OK |
2 | 404 | 164 | Client error: (404) Not Found |
3 | 0 | 34 | Timed out |
4 | 405 | 28 | Client error: (405) Method Not Allowed |
5 | 401 | 13 | Client error: (401) Unauthorized |
6 | 403 | 13 | Client error: (403) Forbidden |
7 | 429 | 12 | Client error: (429) Too Many Requests (RFC 6585) |
8 | 409 | 6 | Client error: (409) Conflict |
9 | 400 | 5 | Client error: (400) Bad Request |
10 | 500 | 4 | Server error: (500) Internal Server Error |
11 | 406 | 2 | Client error: (406) Not Acceptable |
12 | 503 | 2 | Server error: (503) Service Unavailable |
13 | 520 | 2 | Unknown error: (520) |
14 | 204 | 1 | Success: (204) No Content |
15 | 415 | 1 | Client error: (415) Unsupported Media Type |
16 | 502 | 1 | Server error: (502) Bad Gateway |
17 | 522 | 1 | Timed out |
I expected more links to be dead (404 + the ones that timed out).
I think some urls are getting redirected without me being able to detect it – note the absence of any 3xx (redirection) status codes. Also, note that I got 12 “Too Many Requests” errors, which probably would have been 200 (OK) if I wasn’t making too many requests in such short time.
Errors over time
Let’s look into the breakdown of errors over time.
# consider anything other than 200 an error
# an imperfect heuristic, but a satisfactory approximation
pins$error <- ifelse(pins$response == 200, 0, 1)
pin.errors.month <- aggregate(error ~ month, data = pins,
FUN = sum)
pin.errors.month$month <- pin.errors.month$month %>%
paste0('-01') %>%
as.POSIXct()
ggplot(pin.errors.month, aes(x = month, y = error)) +
geom_point(color = 'red') +
geom_line(group = 1, color = 'red') +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(vjust = 1.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_datetime(breaks = date_breaks('2 months'),
labels = date_format('%Y-%m')) +
xlab('') +
ylab('Errors') +
ggtitle('Errors over time')
That’s not a great plot for the same reason the first links per month plot wasn’t a good plot. There’s an extreme value in the first month that ruins the scale.
Let’s exclude that first month.
ggplot(pin.errors.month[-1, ], aes(x = month, y = error)) +
geom_point(color = 'red') +
geom_line(group = 1, color = 'red') +
geom_smooth(se = F) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(vjust = 1.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_datetime(breaks = date_breaks('2 months'),
labels = date_format('%Y-%m')
) +
xlab('') +
ylab('Errors') +
ggtitle('Errors over time')
That’s a surprise to me. I expected to find the highest number of errors in the oldest links, instead, the highest number of errors comes from links that I added in December of 2014.
As with most things data, this plot might imply something that’s not true: that the number of response errors is at all related to time.
If we check the errors against the link count data, we’ll find that they are highly correlated, which is not a surprise.
cor.test(links.month$count, pin.errors.month$error)
##
## Pearson's product-moment correlation
##
## data: links.month$count and pin.errors.month$error
## t = 13.94, df = 53, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8120584 0.9324242
## sample estimates:
## cor
## 0.8863938
A correlation of .88 is large. Further investigation would help, but it might be that once we partial out the number of links added per month, there is no significant relationship between time and number of broken links.
Let’s look at errors aggregated by host.
Errors by host
hostname.errors <- aggregate(error ~ hostname, data = pins,
FUN = sum)
hostname.errors <- hostname.errors[order(hostname.errors$error),] %>%
tail(20)
hostname.errors$hostname <- hostname.errors$hostname %>%
factor(levels = hostname.errors[, 'hostname'])
ggplot(hostname.errors, aes(x = hostname, y = error)) +
geom_bar(stat = 'identity', fill = '#8FC3Eb') +
coord_flip() +
scale_y_discrete(limits = seq(0, 30, by = 2)) +
theme_bw() +
theme(panel.border = element_blank()) +
xlab('Hosts') +
ylab('Errors') +
ggtitle('Errors by host')
We can’t conclude a lot from this plot. It should be the beginning of an examination of the data, not the end of it.
For example, https://reddit.com comes fourth with 12 “errors”. But if we dig deeper we find:
pins[, c('hostname', 'response')] %>%
subset(hostname == 'reddit.com' & response != 200) %>%
knitr::kable(row.names = F)
hostname | response |
---|---|
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
reddit.com | 429 |
All the errors are Client errors: (429) Too Many Requests (RFC 6585). They’re not because reddit screwed up, they are my fault for hitting their servers more often than I should.
Full code
require(ggplot2)
require(httr) # for getting headers and parsing server responses
require(jsonlite) # convert json to a data frame
require(magrittr) # %>%
require(purrr) # to use map*() functions
require(scales) # to use date_breaks() in scale_x_datetime()
# load the pins response object from file
# this was graciously saved there for us by get_pins.R
setwd('~/projects/pins/')
load('pins.RData')
pins <- content(pins_all) %>% fromJSON()
pins$shared <- pins$shared %>% as.factor()
pins$toread <- pins$toread %>% as.factor()
# pinboard exports timestamps in UTC/GMT, and in this format: '2016-06-23T23:45Z'
# I want them in POSIX, and in US/Eastern
# convert to POSIX with timezone information as GMT
pins$time <- as.POSIXct(pins$time,
format = '%Y-%m-%dT%H:%M:%SZ',
tz = 'GMT')
# convert to US/Eastern timezone
pins$time <- format(pins$time, tz = 'US/Eastern') %>%
as.POSIXct()
# oooo .
# `888 .o8
# 888 .oo. .oooo.o .o888oo ooo. .oo. ooo. .oo. .oo.
# 888P"Y88b d88( "8 888 `888P"Y88b `888P"Y88bP"Y88b
# 888 888 `"Y88b. 888 888 888 888 888 888
# 888 888 o. )88b 888 . 888 888 888 888 888
# o888o o888o 8""888P' "888" o888o o888o o888o o888o o888o
## hostname
get_hostname <- function(href) {
tryCatch({
parsed_url <- parse_url(href)
if (!parsed_url$hostname %>% is.null()) {
hostname <- parsed_url$hostname %>%
gsub('^www.', '', ., perl = T)
return(hostname)
} else {
return('unresolved')
}
}, error = function(e) {
return('unresolved')
})
}
## aggregate hostnames
hostname.freq <- pins$hostname %>%
table() %>%
as.data.frame()
colnames(hostname.freq) <- c('hostname', 'count')
hostname.freq <- hostname.freq[order(-hostname.freq$count),]
rownames(hostname.freq) <- c(1:nrow(hostname.freq))
# knitr::kable makes pretty markdown tables from R output
knitr::kable(hostname.freq[,c('hostname', 'count')] %>%
head(10), row.names = T, align = c('c'))
pins$month <- strftime(pins$time, format = '%Y-%m')
links.month <- pins$month %>%
table() %>%
as.data.frame()
colnames(links.month) <- c('month', 'count')
links.month$month <- links.month$month %>%
paste0('-01') %>%
as.POSIXct()
ggplot(links.month, aes(x = month, y = count)) +
geom_point() +
geom_line(group = 1) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(vjust = 1.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_datetime(breaks = date_breaks('2 months'),
labels = date_format('%Y-%m')) +
xlab('') +
ylab('Bookmarks') +
ggtitle('Links added per month')
ggplot(links.month[-1, ], aes(x = month, y = count)) +
geom_point() +
geom_line(group = 1) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(vjust = 1.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_datetime(breaks = date_breaks('2 months'),
labels = date_format('%Y-%m')) +
xlab('') +
ylab('Bookmarks') +
ggtitle('Links added per month')
# .ooooo. oooo d8b oooo d8b .ooooo. oooo d8b .oooo.o
# d88' `88b `888""8P `888""8P d88' `88b `888""8P d88( "8
# 888ooo888 888 888 888 888 888 `"Y88b.
# 888 .o 888 888 888 888 888 o. )88b
# `Y8bod8P' d888b d888b `Y8bod8P' d888b 8""888P'
check_url_response <- function(href) {
cat('Checking', href, '...\n')
tryCatch({
check_head <- HEAD(href)
return(check_head$status_code %>% as.integer())
}, error = function(e) {
return(0)
})
}
if (file.exists('responses.RData')) {
load('responses.RData')
} else {
response_time <- system.time(responses <- map(pins$href, .f = check_url_response))
save(response_time, file = 'responses.RData')
}
pins$response <- unlist(responses)
http_status(502)
response.aggs <- pins$response %>%
table() %>%
as.data.frame(stringsAsFactors = F)
colnames(response.aggs) <- c('response', 'count')
response.aggs <- response.aggs[order(response.aggs$count, decreasing = T),]
parse_response <- function(response_code) {
if (response_code == '0' | response_code == '522') { return('Timed out')}
if (response_code == '520') {return('Unknown error: (520)')}
status <- response_code %>%
as.integer() %>%
http_status()
return(status$message)
}
response.aggs$description <- map(response.aggs$response, .f = parse_response) %>%
unlist()
rownames(response.aggs) <- c(1:nrow(response.aggs))
response.aggs %>% knitr::kable(row.names = T)
# consider anything other than 200 an error
# an imperfect heuristic, but a satisfactory approximation
pins$error <- ifelse(pins$response == 200, 0, 1)
pin.errors.month <- aggregate(error ~ month, data = pins,
FUN = sum)
pin.errors.month$month <- pin.errors.month$month %>%
paste0('-01') %>%
as.POSIXct()
ggplot(pin.errors.month, aes(x = month, y = error)) +
geom_point(color = 'red') +
geom_line(group = 1, color = 'red') +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(vjust = 1.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_datetime(breaks = date_breaks('2 months'),
labels = date_format('%Y-%m')) +
xlab('') +
ylab('Errors') +
ggtitle('Errors over time')
ggplot(pin.errors.month[-1, ], aes(x = month, y = error)) +
geom_point(color = 'red') +
geom_line(group = 1, color = 'red') +
geom_smooth(se = F) +
theme_bw() +
theme(panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(vjust = 1.5),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_x_datetime(breaks = date_breaks('2 months'),
labels = date_format('%Y-%m')
) +
xlab('') +
ylab('Errors') +
ggtitle('Errors over time')
cor.test(links.month$count, pin.errors.month$error)
hostname.errors <- aggregate(error ~ hostname, data = pins,
FUN = sum)
hostname.errors <- hostname.errors[order(hostname.errors$error),] %>%
tail(20)
hostname.errors$hostname <- hostname.errors$hostname %>%
factor(levels = hostname.errors[, 'hostname'])
ggplot(hostname.errors, aes(x = hostname, y = error)) +
geom_bar(stat = 'identity', fill = '#8FC3Eb') +
coord_flip() +
scale_y_discrete(limits = seq(0, 30, by = 2)) +
theme_bw() +
theme(panel.border = element_blank()) +
xlab('Hosts') +
ylab('Errors') +
ggtitle('Errors by host')
pins[, c('hostname', 'response')] %>%
subset(hostname == 'reddit.com' & response != 200) %>%
knitr::kable(row.names = F)
Closing notes
The purpose of this post was learning and exploration rather than definitive investigation. I got some interesting aggregates and plots, but I can’t conclude anything with certainty without digging deeper. With the exception of one correlation test, all I did was data cleaning, aggregation, and plotting. To make any claims about this data, I need to use statistical methods instead of intuition.
It was fun to look around this dataset and learn about the httr
and purrr
packages in the process.
It was also fun to learn how to write this entire post in R Markdown and make it ready for publishing in Jekyll with minimal touch-ups. That’s a story for another post though.
See also / references
- Converting time zones in R: tips, tricks and pitfalls
- strptime - Date-time Conversion Functions to and from Character
-
Not that the processor or machine should matter much. User and system times are a fraction (~17%) of the total elapsed time. The majority of waiting is probably on the network. ↩︎
-
There is a problem with this solution: if I download a new list of bookmarks from the Pinboard API, the
responses.RData
file will be outdated. It will also be outdated in a month or two as more links change or die. When either of those happens, I will manually delete it, causing the script to recheck all links the next time I run it. Avoiding accidentally calling 4944 urls is important enough that I’m willing to skip automating that part of the process for now. ↩︎