I’ve been on a Pinboard API roll lately. In hindsight it’s not surprising since I use Pinboard so much. Today’s post is another one in which I use R and the Pinboard API to fix a wrong in the world.

Problem

Have you ever noticed those Medium post links? Here’s an example:

https://medium.com/@timmywil/sign-your-commits-on-github-with-gpg-566f07762a43#.ncvbvfg3r

See that #.ncvbvfg3r tacked on the end? I noticed it a while ago, and I’m not the only one. That appendage tracks referrals, and I can imagine it allows Medium to build quite the social graph. I don’t like it for two reasons:

  1. Hey buddy? Don’t track me.
  2. It makes it difficult to know if you’ve already bookmarked a post because it’s likely that if you come across the post again, its url is not the same as the one you already saved. When you try to save it to your Pinboard account, it won’t warn you that you already saved it in the past.

You can find a discussion about this on the Pinboard Google Group.

Maciej Cegłowski, creator of Pinboard, was reassuringly himself about the issue:

I think the best thing in this situation is for Medium to die.

Should that happen I will shed few tears. I don’t want Medium to die, but they need to get better. In the meantime, they exist and I have to fix things on my end.

(½) Solution

I wrote a script that downloads all my Pinboard links, and removes that hash appendage before saving them back to my Pinboard account.

This is half a solution because it only solves reason 1, the tracking. Each time I visit or share a sanitized link, a new appendage will be generated, breaking its connection to how I came across the link in the first place.

It doesn’t solve reason 2 – if I had already saved a link to my Pinboard account, and then come across it again and try to save it, having forgotten that I already did so in the past, Pinboard won’t match the urls since the one it has is sanitized. Unless Maciej decides to implement a Medium-specific feature to strip those tracking tokens, there’s not much I can do about that.

First, let’s load some libraries and get our Pinboard links.

library(httr)
library(magrittr)
library(jsonlite)
library(stringr)

# My API token is saved in an environment file
pinsecret <- Sys.getenv('pin_token')

# GET all my links in JSON
pins_all <- GET('https://api.pinboard.in/v1/posts/all',
                query = list(auth_token = pinsecret,
                             format = 'json'))

pins <- pins_all %>% content() %>% fromJSON()

I load my API token from my .Renviron file, use the GET() function from the httr package to sent the GET request for all my links in JSON format, and then convert the returned data into a data frame by using content() from httr and piping the output to the fromJSON() function from jsonlite package.

Let’s examine the pins dataframe:

pins %>% 
    select(href, time) %>% 
    head() %>%  
    knitr::kable()

Which gives us:

href time
https://twitter.com/Samueltadros/status/800208013709688832 2016-11-20T14:23:11Z
http://gizmodo.com/authorities-just-shut-down-what-cd-the-best-music-torr-1789113647 2016-11-19T15:21:06Z
http://www.theverge.com/2016/11/17/13669832/what-cd-music-torrent-website-shut-down 2016-11-19T15:18:33Z
http://www.rollingstone.com/music/news/torrent-site-whatcd-shuts-down-destroys-user-data-w451239 2016-11-19T15:16:16Z
https://twitter.com/whatcd/status/799751019294965760 2016-11-18T23:56:23Z
https://twitter.com/sheriferson/status/799761561149722624/photo/1 2016-11-18T23:49:49Z

Let me break down that last command:

  • Start with pins dataframe.
  • Pipe that into select(), selecting the “href” and “time” columns.
  • Pipe the output into head() which selects the top (latest, in this case) 5 rows.
  • Pipe the output into kable() function from the knitr package, which converts the dataframe into a Markdown table.

That last part is very handy.

Now we have all our links, let’s select the ones for Medium links.

medium <- pins %>%
    filter(str_detect(href, 'medium.com'))

Again, let’s break it down

  • Store into medium the output of…
  • Piping pins into the filter() function from dplyr package.
  • Piping the output of that into filter() function, which is using str_detect() from the stringr package to search for “medium.com” in the “href” column.

Checking the medium dataframe shows…

href time
https://medium.com/something-learned/not-imposter-syndrome-621898bdabb2 2016-10-25T18:50:36Z
https://medium.com/@timmywil/sign-your-commits-on-github-with-gpg-566f07762a43#.ncvbvfg3r 2016-10-11T06:15:48Z
https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.by7z0gq33 2016-10-02T01:07:24Z
https://medium.com/@schtoeffel/you-don-t-need-more-than-one-cursor-in-vim-2c44117d51db#.nmev5f200 2016-09-19T23:35:16Z
https://medium.com/@akelleh/a-technical-primer-on-causality-181db2575e41 2016-09-07T16:30:57Z

Now, this looks like it worked, but I’m paranoid. It’s possible that the filtering caught links that have domains that end with “medium.com” but are not Medium links.

I want to be more careful, so I’ll use a function that I used before to extract the hostname from links.

get_hostname <- function(href) {
  tryCatch({
    parsed_url <- parse_url(href)
    if (!parsed_url$hostname %>% is.null()) {
      hostname <- parsed_url$hostname %>% 
        gsub('^www.', '', ., perl = T)
      return(hostname)  
    } else {
      return('unresolved')
    }
    
  }, error = function(e) {
    return('unresolved')
  })
}

pins$hostname <- map_chr(pins$href, .f = get_hostname)

medium <- pins %>%
    filter(hostname == 'medium.com')

This is dataframe of Medium links that I am more confident about.1

Now! Let’s remove that gunk.

medium$cleanhref <- sub("#\\..{9}$", "", medium$href)

That’s all. A quick regex substitution to remove the trailing hash garbage.

Old links Clean links
https://medium.com/something-learned/not-imposter-syndrome-621898bdabb2 https://medium.com/something-learned/not-imposter-syndrome-621898bdabb2
https://medium.com/@timmywil/sign-your-commits-on-github-with-gpg-566f07762a43#.ncvbvfg3r https://medium.com/@timmywil/sign-your-commits-on-github-with-gpg-566f07762a43
https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471#.by7z0gq33 https://medium.com/@ageitgey/machine-learning-is-fun-80ea3ec3c471
https://medium.com/@joshuatauberer/civic-techs-act-iii-is-beginning-4df5d1720468 https://medium.com/@joshuatauberer/civic-techs-act-iii-is-beginning-4df5d1720468
https://medium.com/@schtoeffel/you-don-t-need-more-than-one-cursor-in-vim-2c44117d51db#.nmev5f200 https://medium.com/@schtoeffel/you-don-t-need-more-than-one-cursor-in-vim-2c44117d51db
https://medium.com/@ESAJustinA/ant-to-advance-data-equality-in-america-join-us-were-hiring-developers-and-data-scientists-147f1bfedcb5#.mh8dpuqz9 https://medium.com/@ESAJustinA/ant-to-advance-data-equality-in-america-join-us-were-hiring-developers-and-data-scientists-147f1bfedcb5

Now we need to put this data back into the I N T E R N E T.

As far as I can tell reading the Pinboard API2, there’s no way to update a bookmark in-place with a new url. The best way to do this is to delete the old bookmarks and add the new ones with the tags, shared, to-read status, and date-time information of the old ones.

This is the dangerous part. I want to be as careful as possible. I want to store the https responses for each deletion and addition, and just so I don’t anger the rate-limiting gods, I will inject a 5 second delay between requests. 5 seconds is probably overkill, but this isn’t production code, it’s a personal thing and I don’t mind waiting.

medium$addition_response <- vector(length = nrow(medium))
medium$deletion_response <- vector(length = nrow(medium))

for (ii in 1:nrow(medium)) {
    deletion <- GET('https://api.pinboard.in/v1/posts/delete',
                    query = list(auth_token = pinsecret,
                                 url = medium$href[ii]))
    
    medium$deletion_response[ii] <- deletion$status_code
    
    addition <- GET('https://api.pinboard.in/v1/posts/add',
                    query = list(auth_token = pinsecret,
                                 url = medium$cleanhref[ii],
                                 description = medium$description[ii],
                                 extended = medium$extended[ii],
                                 tags = medium$tags[ii],
                                 dt = medium$time[ii],
                                 shared = medium$shared[ii],
                                 toread = medium$toread[ii]))
    
    medium$addition_response[ii] <- addition$status_code
    
    Sys.sleep(5)
}

A quick inspection of the deletion and addition response codes reveals nothing but sweet, sweet 200s. A quick inspection of the Medium links on my Pinboard account reveals clean, shiny, spring-scented urls.

The full code is available as a gist here and embedded below:

  1. The dataframe created using the hostname extraction function has the same number of rows as the one created with a simple grep of “medium.com”, which means it probably wouldn’t have been a problem to stick with the earlier solution. The second solution is still a lot better. ↩︎

  2. … which is a link that must be a record-holding in the number of times I’ve linked to it from this site. ↩︎