Asynchronous API calls with postlightmercury

In this post I’ll tell you about a new package I’ve built and also take you under the hood to show you a really cool thing thats going on: asynchronous API calls.

The package: postlightmercury

I created a package called postlightmercury which is now on cran. The package is a wrapper for the Mercury web parser by Postlight.

Basically you sign up for free, get an API key and with that you can send it urls that it then parses for you. This is actually pretty clever if you are scraping a lot of different websites and you don’t want to write a web parser for each and every one of them.

Here is how the package works

Installation

Since the package is on cran, it’s very straight forward to install it:

install.packages("postlightmercury")

Load libraries

We’ll need the postlightmercurydplyr and stringr libraries:

library(postlightmercury)
library(dplyr)
library(stringr)

Get an API key

Before you can use the package you need to get an API key from Postlight. Get yours here: https://mercury.postlight.com/web-parser/ Replace the XXXX’s below with your new API key.

Parse an URL

We will use this extremely sad story from BBC that Gangnam Style is no longer the #1 most viewed video on YouTube. Sad to see a masterpiece like that get dethroned 🙁

# Then run the code below replacing the X's wih your api key:
parsed_url <- web_parser(page_urls = "http://www.bbc.co.uk/news/entertainment-arts-40566816", api_key = XXXXXXXXXXXXXXXXXXXXXXX)

As you can see below the result is a tibble (data frame) with 14 different variables:

glimpse(parsed_url)
## Observations: 1
## Variables: 14
## $ title          <chr> "Gangnam Style is no longer the most-played vid...
## $ author         <chr> "Mark Savage BBC Music reporter"
## $ date_published <chr> NA
## $ dek            <chr> NA
## $ lead_image_url <chr> "https://ichef.bbci.co.uk/news/1024/cpsprodpb/9...
## $ content        <chr> "<div><p class=\"byline\"> <span class=\"byline...
## $ next_page_url  <chr> NA
## $ url            <chr> "http://www.bbc.co.uk/news/entertainment-arts-4...
## $ domain         <chr> "www.bbc.co.uk"
## $ excerpt        <chr> "Psy's megahit was the most-played video for fi...
## $ word_count     <int> 685
## $ direction      <chr> "ltr"
## $ total_pages    <int> 1
## $ rendered_pages <int> 1

Parse more than one URL:

You can also parse more than one URL. Instead of one, lets try giving it three URLs, two about Gangnam Style and one about Sauerkraut – with all that dancing it is after all important with proper nutrition.

urls <- c("http://www.bbc.co.uk/news/entertainment-arts-40566816",
          "http://www.bbc.co.uk/news/world-asia-30288542",
          "https://www.bbcgoodfood.com/howto/guide/health-benefits-sauerkraut")

# Then run the code below replacing the X's wih your api key:
parsed_url <- web_parser(page_urls = urls, api_key = XXXXXXXXXXXXXXXXXXXXXXX)

Just like before the result is a tibble (data frame) with 14 different variables – but this time with 3 observations instead of one:

glimpse(parsed_url)
## Observations: 3
## Variables: 14
## $ title          <chr> "Gangnam Style is no longer the most-played vid...
## $ author         <chr> "Mark Savage BBC Music reporter", NA, "Nicola S...
## $ date_published <chr> NA, NA, NA
## $ dek            <chr> NA, NA, NA
## $ lead_image_url <chr> "https://ichef.bbci.co.uk/news/1024/cpsprodpb/9...
## $ content        <chr> "<div><p class=\"byline\"> <span class=\"byline...
## $ next_page_url  <chr> NA, NA, NA
## $ url            <chr> "http://www.bbc.co.uk/news/entertainment-arts-4...
## $ domain         <chr> "www.bbc.co.uk", "www.bbc.co.uk", "www.bbcgoodf...
## $ excerpt        <chr> "Psy's megahit was the most-played video for fi...
## $ word_count     <int> 685, 305, 527
## $ direction      <chr> "ltr", "ltr", "ltr"
## $ total_pages    <int> 1, 1, 1
## $ rendered_pages <int> 1, 1, 1

Clean the content of HTML

The content block keeps the HTML of the website:

str_trunc(parsed_url$content[1], 500, "right")
## [1] "<div><p class=\"byline\"> <span class=\"byline__name\">By Mark Savage</span> <span class=\"byline__title\">BBC Music reporter</span> </p><div class=\"story-body__inner\"> <figure class=\"media-landscape has-caption full-width lead\"> <span class=\"image-and-copyright-container\"> <img class=\"js-image-replace\" alt=\"Still image from Gangnam Style\" src=\"https://ichef-1.bbci.co.uk/news/320/cpsprodpb/9C7A/production/_96885004_gangnam.jpg\" width=\"1024\"> <span class=\"off-screen\">Image copyright</span> <span cla..."

We can clean that quite easily:

parsed_url$content <- remove_html(parsed_url$content)

str_trunc(parsed_url$content[1], 500, "right")
## [1] "By Mark Savage BBC Music reporter   Image copyright Schoolboy/Universal Republic Records  Image caption  Gangnam Style had been YouTube's most-watched video for five years  Psy's Gangnam Style is no longer the most-watched video on YouTube.The South Korean megahit had been the site's most-played clip for the last five years.The surreal video became so popular that it \"broke\" YouTube's play counter, exceeding the maximum possible number of views (2,147,483,647), and forcing the company to rewr..."

And that is basically what the package does! 🙂

Under the hood: asynchronous API calls

Originally I wrote the package using the httr package which I normally use for my every day API calling business.

But after reading about the crul package on R-bloggers and how it can handle asynchronous api calls I rewrote the web_parser() function so it uses the crul package.

This means that instead of calling each URL sequentially it calls them in parrallel. This no doubt has major implications if you want to call a lot of URLs and can speed up your analysis significantly.

The web_parser function looks like this under the hood (look for where the magic happens):

web_parser <- function (page_urls, api_key) 
{
  if (missing(page_urls)) 
    stop("One or more urls must be provided")
  if (missing(api_key)) 
    stop("API key must be provided. Get one here: https://mercury.postlight.com/web-parser/")
  
  ### THIS IS WHERE THE MAGIC HAPPENS 
  
  async <- lapply(page_urls, function(page_url) {
    crul::HttpRequest$new(url = "https://mercury.postlight.com/parser", 
      headers = list(`x-api-key` = api_key))$get(query = list(url = page_url))
  })
  res <- crul::AsyncVaried$new(.list = async)
  
  ### END OF MAGIC
  
  output <- res$request()
  api_content <- lapply(output, function(x) x$parse("UTF-8"))
  api_content <- lapply(api_content, jsonlite::fromJSON)
  api_content <- null_to_na(api_content)
  df <- purrr::map_df(api_content, tibble::as_tibble)
  
  return(df)
}

As you can see from the above code I create a list async that holds the three different URL calls. I then add these to the res object. When I call the results from res it will fetch the data in parrallel if there is more than one URL. That is pretty smart!

You can use this basic temptlate for your own API calls if you have a function that rutinely calls several URL’s sequentially.

Note: In this case the “surrounding conditions” are all the same. But you can also do asynchronous requests that call different end-points. Check out the crul package documentation for more on that.