Asynchronous API calls with postlightmercury

In this post I’ll tell you about a new package I’ve built and also take you under the hood to show you a really cool thing thats going on: asynchronous API calls.

The package: postlightmercury

I created a package called postlightmercury which is now on cran. The package is a wrapper for the Mercury web parser by Postlight.

Basically you sign up for free, get an API key and with that you can send it urls that it then parses for you. This is actually pretty clever if you are scraping a lot of different websites and you don’t want to write a web parser for each and every one of them.

Here is how the package works

Installation

Since the package is on cran, it’s very straight forward to install it:

install.packages("postlightmercury")

Load libraries

We’ll need the postlightmercurydplyr and stringr libraries:

library(postlightmercury)
library(dplyr)
library(stringr)

Get an API key

Before you can use the package you need to get an API key from Postlight. Get yours here: https://mercury.postlight.com/web-parser/ Replace the XXXX’s below with your new API key.

Parse an URL

We will use this extremely sad story from BBC that Gangnam Style is no longer the #1 most viewed video on YouTube. Sad to see a masterpiece like that get dethroned 🙁

# Then run the code below replacing the X's wih your api key:
parsed_url <- web_parser(page_urls = "http://www.bbc.co.uk/news/entertainment-arts-40566816", api_key = XXXXXXXXXXXXXXXXXXXXXXX)

As you can see below the result is a tibble (data frame) with 14 different variables:

glimpse(parsed_url)
## Observations: 1
## Variables: 14
## $ title          <chr> "Gangnam Style is no longer the most-played vid...
## $ author         <chr> "Mark Savage BBC Music reporter"
## $ date_published <chr> NA
## $ dek            <chr> NA
## $ lead_image_url <chr> "https://ichef.bbci.co.uk/news/1024/cpsprodpb/9...
## $ content        <chr> "<div><p class=\"byline\"> <span class=\"byline...
## $ next_page_url  <chr> NA
## $ url            <chr> "http://www.bbc.co.uk/news/entertainment-arts-4...
## $ domain         <chr> "www.bbc.co.uk"
## $ excerpt        <chr> "Psy's megahit was the most-played video for fi...
## $ word_count     <int> 685
## $ direction      <chr> "ltr"
## $ total_pages    <int> 1
## $ rendered_pages <int> 1

Parse more than one URL:

You can also parse more than one URL. Instead of one, lets try giving it three URLs, two about Gangnam Style and one about Sauerkraut – with all that dancing it is after all important with proper nutrition.

urls <- c("http://www.bbc.co.uk/news/entertainment-arts-40566816",
          "http://www.bbc.co.uk/news/world-asia-30288542",
          "https://www.bbcgoodfood.com/howto/guide/health-benefits-sauerkraut")

# Then run the code below replacing the X's wih your api key:
parsed_url <- web_parser(page_urls = urls, api_key = XXXXXXXXXXXXXXXXXXXXXXX)

Just like before the result is a tibble (data frame) with 14 different variables – but this time with 3 observations instead of one:

glimpse(parsed_url)
## Observations: 3
## Variables: 14
## $ title          <chr> "Gangnam Style is no longer the most-played vid...
## $ author         <chr> "Mark Savage BBC Music reporter", NA, "Nicola S...
## $ date_published <chr> NA, NA, NA
## $ dek            <chr> NA, NA, NA
## $ lead_image_url <chr> "https://ichef.bbci.co.uk/news/1024/cpsprodpb/9...
## $ content        <chr> "<div><p class=\"byline\"> <span class=\"byline...
## $ next_page_url  <chr> NA, NA, NA
## $ url            <chr> "http://www.bbc.co.uk/news/entertainment-arts-4...
## $ domain         <chr> "www.bbc.co.uk", "www.bbc.co.uk", "www.bbcgoodf...
## $ excerpt        <chr> "Psy's megahit was the most-played video for fi...
## $ word_count     <int> 685, 305, 527
## $ direction      <chr> "ltr", "ltr", "ltr"
## $ total_pages    <int> 1, 1, 1
## $ rendered_pages <int> 1, 1, 1

Clean the content of HTML

The content block keeps the HTML of the website:

str_trunc(parsed_url$content[1], 500, "right")
## [1] "<div><p class=\"byline\"> <span class=\"byline__name\">By Mark Savage</span> <span class=\"byline__title\">BBC Music reporter</span> </p><div class=\"story-body__inner\"> <figure class=\"media-landscape has-caption full-width lead\"> <span class=\"image-and-copyright-container\"> <img class=\"js-image-replace\" alt=\"Still image from Gangnam Style\" src=\"https://ichef-1.bbci.co.uk/news/320/cpsprodpb/9C7A/production/_96885004_gangnam.jpg\" width=\"1024\"> <span class=\"off-screen\">Image copyright</span> <span cla..."

We can clean that quite easily:

parsed_url$content <- remove_html(parsed_url$content)

str_trunc(parsed_url$content[1], 500, "right")
## [1] "By Mark Savage BBC Music reporter   Image copyright Schoolboy/Universal Republic Records  Image caption  Gangnam Style had been YouTube's most-watched video for five years  Psy's Gangnam Style is no longer the most-watched video on YouTube.The South Korean megahit had been the site's most-played clip for the last five years.The surreal video became so popular that it \"broke\" YouTube's play counter, exceeding the maximum possible number of views (2,147,483,647), and forcing the company to rewr..."

And that is basically what the package does! 🙂

Under the hood: asynchronous API calls

Originally I wrote the package using the httr package which I normally use for my every day API calling business.

But after reading about the crul package on R-bloggers and how it can handle asynchronous api calls I rewrote the web_parser() function so it uses the crul package.

This means that instead of calling each URL sequentially it calls them in parrallel. This no doubt has major implications if you want to call a lot of URLs and can speed up your analysis significantly.

The web_parser function looks like this under the hood (look for where the magic happens):

web_parser <- function (page_urls, api_key) 
{
  if (missing(page_urls)) 
    stop("One or more urls must be provided")
  if (missing(api_key)) 
    stop("API key must be provided. Get one here: https://mercury.postlight.com/web-parser/")
  
  ### THIS IS WHERE THE MAGIC HAPPENS 
  
  async <- lapply(page_urls, function(page_url) {
    crul::HttpRequest$new(url = "https://mercury.postlight.com/parser", 
      headers = list(`x-api-key` = api_key))$get(query = list(url = page_url))
  })
  res <- crul::AsyncVaried$new(.list = async)
  
  ### END OF MAGIC
  
  output <- res$request()
  api_content <- lapply(output, function(x) x$parse("UTF-8"))
  api_content <- lapply(api_content, jsonlite::fromJSON)
  api_content <- null_to_na(api_content)
  df <- purrr::map_df(api_content, tibble::as_tibble)
  
  return(df)
}

As you can see from the above code I create a list async that holds the three different URL calls. I then add these to the res object. When I call the results from res it will fetch the data in parrallel if there is more than one URL. That is pretty smart!

You can use this basic temptlate for your own API calls if you have a function that rutinely calls several URL’s sequentially.

Note: In this case the “surrounding conditions” are all the same. But you can also do asynchronous requests that call different end-points. Check out the crul package documentation for more on that.

Setup encrypted Rstudio and Shiny dashboard solution in 3 minutes

I have created a repository (https://github.com/56north/encrypted_dashboard) with code that automatically sets up Rstudio Server and Shiny Server behind and Nginx proxy with SSL certificates. The two servers share apps and packages so you can build cool stuff in Rstudio and deploy it immediately to Shiny. Woow!!! 😀

Below is the readme from the repo. I do need help fixing some issues, but I am quite certain you guys can help with that. If we collaborate on this then we will all have a pretty awesome way to quickly setup an encrypted and easy to use dashboard environment.

Docker + Nginx + Let’s Encrypt + Rstudio + Shiny


NEED TO HAVE FIXED

SSL for all

It seems that only one of the containers aquire the SSL certificate. Maybey someone with more Nginx skills than me can have a look.

Upload limit

There seems to be an upload-limit when we’re running behind Nginx. This can be fixed in the config file. If any knows how, please do. Here is a link I found regarding the issue: https://support.rstudio.com/hc/en-us/community/posts/200769376–Unexpected-response-from-server-error-with-file-upload

NICE TO HAVE FIXED

Multiple users in Rstudio Server

Would be nice to be able to create multiple users in Rstudio when firing up the docker container. Maybe based on some of the information here: https://itsalocke.com/r-training-environment/


This code originated at https://github.com/gilyes/docker-nginx-letsencrypt-sample.

This simple example shows how to set up Rstudio Server and Shiny Server running behind a dockerized Nginx reverse proxy and served via HTTPS using free Let’s Encrypt certificates. New sites can be added on the fly by just modifying docker-compose.yml and then running docker-compose up as the main Nginx config is automatically updated and certificates (if needed) are automatically acquired.

Some of the configuration from the original repo is derived from https://github.com/fatk/docker-letsencrypt-nginx-proxy-companion-examples with some simplifications and updates to work with current nginx.tmpl from nginx-proxy and docker-compose v2 files.

Running the example

Prerequisites

  • docker (>= 1.10)
  • docker-compose (>= 1.8.1)
  • access to (sub)domain(s) pointing to a publicly accessible server (required for TLS)

Preparation

  • Clone the repository on the server pointed to by your domain.
  • In docker-compose.yml:
    • Change the VIRTUAL_HOST and LETSENCRYPT_HOST entries from rstudio.mydomain.com and shiny.mydomain.com to your domains.
    • Change LETSENCRYPT_EMAIL entries to the email address you want to be associated with the certificates.
    • Change USER and PASSWORD entries to the user and password you want for Rstudio.

Running

In the main directory run:

docker-compose up

This will perform the following steps:

  • Download the required images from Docker Hub (nginx, docker-gen, docker-letsencrypt-nginx-proxy-companion).
  • Create containers from them.
  • Build and create containers for Rstudio Server and Shiny Server
  • Start up the containers.
    • docker-letsencrypt-nginx-proxy-companion inspects containers’ metadata and tries to acquire certificates as needed (if successful then saving them in a volume shared with the host and the Nginx container).
    • docker-gen also inspects containers’ metadata and generates the configuration file for the main Nginx reverse proxy

If everything went well then you should now be able to access Rstudio and Shiny and the given addresses.

Troubleshooting

  • To view logs run docker-compose logs.
  • To view the generated Nginx configuration run docker exec -ti nginx cat /etc/nginx/conf.d/default.conf

How does it work

The system consists of 4 main parts:

  • Main Nginx reverse proxy container.
  • Container that generates the main Nginx config based on container metadata.
  • Container that automatically handles the acquisition and renewal of Let’s Encrypt TLS certificates.
  • The actual servers living in their own containers. In this example Rstudio and Shiny.

The main Nginx reverse proxy container

This is the only publicly exposed container, routes traffic to the backend servers and provides TLS termination.

Uses the official nginx Docker image.

It is defined in docker-compose.yml under the nginx service block:

services:
  nginx:
    restart: always
    image: nginx
    container_name: nginx
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - "/etc/nginx/conf.d"
      - "/etc/nginx/vhost.d"
      - "/usr/share/nginx/html"
      - "./volumes/proxy/certs:/etc/nginx/certs:ro"

As you can see it shares a few volumes:

  • Configuration folder: used by the container that generates the configuration file.
  • Default Nginx root folder: used by the Let’s Encrypt container for challenges from the CA.
  • Certificates folder: written to by the Let’s Encrypt container, this is where the TLS certificates are maintained.

The configuration generator container

This container inspects the other running containers and based on their metadata (like VIRTUAL_HOST environment variable) and a template file it generates the Nginx configuration file for the main Nginx container. When a new container is spinning up this container detects that, generates the appropriate configuration entries and restarts Nginx.

Uses the jwilder/docker-gen Docker image.

It is defined in docker-compose.yml under the nginx-gen service block:

services:
  ...

  nginx-gen:
    restart: always
    image: jwilder/docker-gen
    container_name: nginx-gen
    volumes:
      - "/var/run/docker.sock:/tmp/docker.sock:ro"
      - "./volumes/proxy/templates/nginx.tmpl:/etc/docker-gen/templates/nginx.tmpl:ro"
    volumes_from:
      - nginx
    entrypoint: /usr/local/bin/docker-gen -notify-sighup nginx -watch -wait 5s:30s /etc/docker-gen/templates/nginx.tmpl /etc/nginx/conf.d/default.conf

The container reads the nginx.tmpl template file (source: jwilder/nginx-proxy) via a volume shared with the host.

It also mounts the Docker socket into the container in order to be able to inspect the other containers (the "/var/run/docker.sock:/tmp/docker.sock:ro" line). Security warning: mounting the Docker socket is usually discouraged because the container getting (even read-only) access to it can get root access to the host. In our case, this container is not exposed to the world so if you trust the code running inside it the risks are probably fairly low. But definitely something to take into account. See e.g. The Dangers of Docker.sock for further details.

NOTE: it would be preferrable to have docker-gen only handle containers with exposed ports (via -only-exposed flag in the entrypoint script above) but currently that does not work, see e.g. https://github.com/jwilder/nginx-proxy/issues/438.

The Let’s Encrypt container

This container also inspects the other containers and acquires Let’s Encrypt TLS certificates based on the LETSENCRYPT_HOST and LETSENCRYPT_EMAIL environment variables. At regular intervals it checks and renews certificates as needed.

Uses the jrcs/letsencrypt-nginx-proxy-companion Docker image.

It is defined in docker-compose.yml under the letsencrypt-nginx-proxy-companion service block:

services:
  ...

  letsencrypt-nginx-proxy-companion:
    restart: always
    image: jrcs/letsencrypt-nginx-proxy-companion
    container_name: letsencrypt-nginx-proxy-companion
    volumes_from:
      - nginx
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "./volumes/proxy/certs:/etc/nginx/certs:rw"
    environment:
      - NGINX_DOCKER_GEN_CONTAINER=nginx-gen

The container uses a volume shared with the host and the Nginx container to maintain the certificates.

It also mounts the Docker socket in order to inspect the other containers. See the security warning above in the docker-gen section about the risks of that.

The Rstudio Server and Shiny Server

These two servers are running in their own respective containers. They are defined in docker-compose.yml under the tidyverse and shiny service blocks:

services:
  ...

  tidyverse:
    restart: always
    image: rocker/tidyverse
    container_name: rstudio
    expose:
      - "8787"
    environment:
      - VIRTUAL_HOST=rstudio.mydomain.com
      - VIRTUAL_NETWORK=nginx-proxy
      - VIRTUAL_PORT=80
      - LETSENCRYPT_HOST=rstudio.mydomain.com
      - LETSENCRYPT_EMAIL=me@myemail.com
      - USER=test
      - PASSWORD=test
    volumes:
      - shiny-apps:/home/mikkel/apps
      - r-packages:/usr/local/lib/R/site-library

  shiny:
    restart: always
    image: rocker/shiny
    container_name: shiny
    expose:
      - "3838"
    environment:
      - VIRTUAL_HOST=shiny.mydomain.com
      - VIRTUAL_NETWORK=nginx-proxy
      - VIRTUAL_PORT=80
      - LETSENCRYPT_HOST=shiny.mydomain.com
      - LETSENCRYPT_EMAIL=me@myemail.com
    volumes:
      - shiny-apps:/srv/shiny-server/
      - ./volumes/shiny/logs:/var/log/
      - r-packages:/usr/local/lib/R/site-library

The important part here are the environment variables and the volumes. The environment variables are used by the config generator and certificate maintainer containers to set up the system.

The data volumes

The volumes are used to ensure that Rstudio and Shiny share apps and packages in order for you to build apps in Rstudio and have them deployed on the Shiny server without too big a hassle.

Conclusion

This can be a fairly simple way to have easy, reproducible deploys for a secure R-based dashboard solution with auto-renewing TLS certificates.

 

 

 

Download product information and reviews from Amazon.com

Rmazon

The goal of Rmazon is to help you download product information and reviews from Amazon.com easily.

Installation

You can install Rmazon from github with:

# install.packages("devtools")
devtools::install_github("56north/Rmazon")

Example – product information

This is a basic example which shows you how ro get product information:

# Get product information for 'The Art of R Programming: A Tour of Statistical Software Design'

product_info <- Rmazon::get_product_info("1593273843")

Example – product reviews

This is a basic example which shows you how gto et reviews:

# Get reviews for 'The Art of R Programming: A Tour of Statistical Software Design'

reviews <- Rmazon::get_reviews("1593273843")

Building a package automatically

So, I just finished building a R wrapper for FOAAS (F*ck Off As A Service).  FOAAS (F*ck Off As A Service) provides a modern, RESTful, scalable solution to the common problem of telling people to f*ck off.

I wanted to share the package with you and also tell you how I automated the build. You can find the package at https://github.com/56north/foaas

I thought the API was quite a fun idea and I wanted to make a wrapper. I was curious to see if I could automate the build. The API has 72 different f*ck off calls and so I build a function to handle the heavy lifting. The reason for this is essentially what Hadley writes in “R for Data Science“:

One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:

You can give a function an evocative name that makes your code easier to understand.

As requirements change, you only need to update code in one place, instead of many.

You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

I build a function that got all the different calls from the API. Then I looped through each one and pasted together a string that would build a function for each call. I included som roxygen info and then wrote them all to a functions.R file.

Once the procedure was clear and the code was written it took < 2 seconds to write a package with 72 functions. Not bad! 🙂

Here is the code:

# Get possible functions
url <- “https://www.foaas.com/operations”

foaas_calls <- httr::content(httr::GET(url))

# Write functions for each
functions <- lapply(foaas_calls, function(x){
name <- unlist(stringr::str_split(x$url, “/”))[2]

description <- x$name

parms <- unlist(lapply(x$fields, function(y){ y$field }))

func_base <- paste0(“paste0(\”https://www.foaas.com/”, name, “/\”,”)
func_end <- paste(parms, collapse = “, \”/\”, “)

pck_name <- make.names(name)

func <- paste0(“foaas_”, pck_name, ” <- function(“, paste(parms, collapse = “, “), “){\n”,
“url <- “, paste(func_base, func_end, “)”), “\n”,
“return(jsonlite::fromJSON(url))\n”,
“}”)

func_ex <- paste0(“#’ “, description, “\n”,
“#’ @export \n\n”,
func, “\n\n##############################\n\n”)

return(func_ex)
})

# Write functions to package
readr::write_lines(functions[[1]], “R/fucking_functions.R”)

lapply(functions[2:72], function(x){

readr::write_lines(x, “R/fucking_functions.R”, append = T)

})

nutrients

Calculate your nutrients with my new package: NutrientData

I have created a new package: NutrientData

This package contains data sets with the composition of Foods: Raw, Processed, Prepared. The source of the data is the USDA National Nutrient Database for Standard Reference, Release 28 (2015), a long with two functions to search and calculate nutrients.

You download it from github:
devtools::install_github("56north/NutrientData")

Lets first have a look at the the top 20 calorie dense foods

library(NutrientData)
library(dplyr)

data("ABBREV") # Load the data

ABBREV %>% # Select the data
arrange(-Energ_Kcal) %>% # Sort by calories per 100 g
select(Food = Shrt_Desc, Calories = Energ_Kcal) %>% # Select relevant columns
slice(1:20) %>% # Choose the top 20

If you want to search for a specific ingredient you use the “search_ingredient” function. Lets search for raw onions:

search_ingredient("onion,raw")

You can also calculate the nutrient composition of several foods, like a simple yet delicious cabbage salad:

ingredients <- c("CABBAGE,RAW", "MAYONNAISE,RED FAT,W/ OLIVE OIL", "ONIONS,RAW")
grams <- c(100, 20, 10)

calculate_nutrients(ingredients, grams) %>%
select(Food = 1, Calories = 3, Protein = 4,
Fat = 5, Carbs = 7) %>% # Select only a few variables for looks and rename

Dinner is served. I look forward to your feedback! And if anyone is up for it, then this is a package that is just begging for cool visualizations for nutrient composition along with a Shiny overlay.!

Map danish administrative areas with leaflet

What is this?

a center for ants!?

leafletDK is a package that makes it easy to create interactive dynamic leaflet maps based on danish administrative areas.

The package is heavily inspired by the amazing mapDK package by Sebastian Barfort, which I recommend if you want to create static high quality maps.

Getting started

First you need to install the package from github. You do this by running:

devtools::install_github("56north/leafletDK") # install devtools if needed

After install it is really easy to use leafletDK. Simply call the administrative area that you want to map (like a municipality) and give it the data you want the map to be colored by.

Below is an example where we load data from Statistics Denmark and map it using leaflet.

First we load the package and get the most recent population count for Denmark via the API from Statistics Denmark.

library(leafletDK)

folk1 <- read.csv2("http://api.statbank.dk/v1/data/folk1/CSV?OMR%C3%85DE=*",
                   stringsAsFactors = F)

Now we have a data frame with three columns. “OMRÅDE” is the area, “TID” is time/date and “INDHOLD” is the people count. We will use the “OMRÅDE” and “INDHOLD” columns to call the municipalityDK function.

municipalityDK("INDHOLD", "OMRÅDE", data = folk1)

As default leafletDK plots the map without an underlying map. You can turn this on by supplying the parameter map = T. You can also turn on the legend with legend = T.

municipalityDK("INDHOLD", "OMRÅDE", data = folk1, map = T, legend = T)

This generates a map of Denmark where the 98 municipalities are colores according to the amount of people that live in each on. It becomes immediately apparent that a lot of people are living in Copenhagen municipality… a lot of people!

We can also zoom in the map on only a few select municipalities by selecting them with the subplot parameter. Lets take a look at Copenhagen (København), Frederiksberg and Hvidovre municipalities:

municipalityDK("INDHOLD", "OMRÅDE", 
                subplot = c("københavn", "frederiksberg", "hvidovre"), 
                data = folk1)

This generates a map with only our three chosen municipalities. If you click on one of the areas, a little popup appears with the mapped values.

Getting ids

If you are in doubt of what ids are being used to generate the maps, then you can use the getIDs function to see a list. If we want to get a list of the municipalities we will do the following:

getIDs("municipal")

The getIDs function accepts the following areas: “constituency”, “district”, “municipal”, “parish”, “regional”, “rural” or “zip”.

Changing the underlying map

You can change the underlying map by using the addProviderTiles() from the leaflet package. You can pipe the mapped areas directly to the function like this:

municipalityDK("INDHOLD", "OMRÅDE", 
                subplot = c("københavn", "frederiksberg", "hvidovre"), 
                data = folk1) %>% 
                addProviderTiles("Stamen.Toner")

You can get a full overview of the available maps (called tiles) on the leaflet-providers preview page

Classify gender based on danish first names

In Denmark we have official lists of what people are allowed to have as first names. That means there are lists of government approved boys names, girls names and unisex names. There is a total of 18.529 approved girls names, 15.052 boys names and 813 unisex names.

This means that we can write an R-package that can classify a name as either male, female, unisex or indeterminable. And I did just that. Allow me to introduce the “namesDK” package. It is available from github by running devtools::install_github(“56north/namesDK”).

After that you feed it a string of names. It uses the first name to classify the gender, so if you provide a full name (ie: Lars Løkke Rasmussen) then it will split the string and choose the first name (ie: Lars).

You can use the package if you have a lot of names, that you would like demographic variables attached to, such as gender. It could be names mined from social media, a customer list, etc.

In order to do this you simply call the “gender” function from the package. Here is a brief example of how it works:

library(namesDK)

gender(“Lars Løkke Rasmussen”)
#> [[1]]
#> [1] “male”

gender(c(“Helle Thorning Smidt”, “Lars Løkke Rasmussen”, “Traktor Troels”))
#> [[1]]
#> [1] “female”
#>
#> [[2]]
#> [1] “male”
#>
#> [[3]]
#> [1] NA

As you can see, the last string in the call above said “Traktor” as first name (the machine used in aggriculture) and therefore returns an NA, since Traktor is not an approved danish first name.

There you go. Sweet and simple. Enjoy.

If your country has the same sort of rules, maybe we should create a package that can classify gender based on first names across multiple languages. Let me know if you are interested 🙂

Are you happy or sad?

UPDATED: This is the first time I tried to include a dataset in a package and it didn’t go well 🙂

If you have problems then that is probably why. Sorry for the inconvenience. Will update everything as soon as possible.

—–
Well, if you wrote about it we might be able to figure it out with my new package: happyorsad

Happyorsad is a sentiment scorer. It uses the approach of Finn Årup Nielsen from Informatics and Mathematical Modelling at the Technical University of Denmark and this AFINN lists hosted in his Github repo.

Finn Årup Nielsen has constructed three lists that makes it possible to sentiment score in english, danish and using emoticons. There are already a few sentiment packages for english, but this is the first one for danish and as far as I now also the first one to sentiment score emoticons.

A big shout out to Finn for the lists.

If you want to try it out then just run the code below. Looking forward to your feedback!

if(!require("devtools")) install.packages("devtools")
devtools::install_github("56north/happyorsad")

# Examples of sentiment scoring

library(happyorsad)

# Score danish words
string_da <- ‘Hvis ikke det er det mest afskyelige elendige flueknepperi…’
happyorsad(string_da, “da”)

# Score english words
string_en <- ‘This is utterly excellent!’
happyorsad(string_en, “en”)

# Score emoticons
string_emoticon <- ‘I saw that yesterday :)’
happyorsad(string_emoticon, “emoticon”)

hexdk

Create your own hexamaps

Hexamaps are gaining in popularity. Most notably has been the versions, where the map of the USA has been made into a hexamap. But people have also made maps of Europe using hexagons.

The idea is that one unit is one hexagon. So in case of the US, each state is one hexagon. In the case of Europe, each country is a hexagon.

This means that all units (states, countries, etc.) are the same size. This of course skews the hexamap in relation to the real geographic proportions. But it gives the advantage of giving all units equal size for displaying information – for instance a shade or color depending on some underlying values.

I have made a hexamap of the municipalities in Denmark. The capital region is very dense so I had to sort of map that on the side. You can see my efforts here:

hexdk

To ease the process I’ve made the hexamapmaker package. It takes a set of points and turns them into hexagons. That means that you can quickly and easily design and produce hexamaps.

Below I’ve included the example code from the package if you want to get started yourself. If you create a map of your own please share it with me on twitter @mikkelkrogsholm. I’d love to see your work!
# Install hexamapmaker
devtools::install_github(“56north/hexamapmaker”)
library(hexamapmaker)

# Create data frame
# Notice the spacing of the points

x <- c(1,3,2,4,1,3,7,8)
y <- c(1,1,3,3,5,5,1,3)
id <- c(“test1”, “test2”, “test3”, “test4”, “test5”, “test6”, “test7”, “test8″)
z <- data.frame(id,x,y)

# Plot points

library(ggplot2)
ggplot(z, aes(x, y, group = id)) +
geom_point() +
coord_fixed(ratio = 1) +
ylim(0,max(y)) + xlim(0,max(x))

# Turn points into hexagons

library(hexamapmaker)

zz <- hexamap(z)

ggplot(zz, aes(x, y, group = id)) +
geom_polygon(colour=”black”, fill = NA) +
coord_fixed(ratio = 1)

Open Source Data Science at SXSW

I have pitched a panel for next years SXSW in Austin, Texas, along with Karthik Ram from rOpenSci. The panel is called “Open Source Data Science”.

SXSW has a panel picker process where you submit ideas and then other people can vote for them. The popular ones get selected. So I need your help to make this one popular!

Panel description:

“We need more data scientists. According to a McKinsey Report higlighting the Impending Data Scientist Shortage 23 July 2013, ‘…by 2018 the United States will experience a shortage of 190,000 skilled data scientists, and 1.5 million managers and analysts capable of reaping actionable insights from the big data deluge.’ This panel will dive into how you can get started as a data scientist using open source tools and open data sources. This panel aims to inspire by giving concrete examples, cases and tool tips to would-be data scientists and geeks already working in the field.”

You can read more and help vote it up at: http://panelpicker.sxsw.com/vote/48846