Classify gender based on danish first names

In Denmark we have official lists of what people are allowed to have as first names. That means there are lists of government approved boys names, girls names and unisex names. There is a total of 18.529 approved girls names, 15.052 boys names and 813 unisex names.

This means that we can write an R-package that can classify a name as either male, female, unisex or indeterminable. And I did just that. Allow me to introduce the “namesDK” package. It is available from github by running devtools::install_github(“56north/namesDK”).

After that you feed it a string of names. It uses the first name to classify the gender, so if you provide a full name (ie: Lars Løkke Rasmussen) then it will split the string and choose the first name (ie: Lars).

You can use the package if you have a lot of names, that you would like demographic variables attached to, such as gender. It could be names mined from social media, a customer list, etc.

In order to do this you simply call the “gender” function from the package. Here is a brief example of how it works:


gender(“Lars Løkke Rasmussen”)
#> [[1]]
#> [1] “male”

gender(c(“Helle Thorning Smidt”, “Lars Løkke Rasmussen”, “Traktor Troels”))
#> [[1]]
#> [1] “female”
#> [[2]]
#> [1] “male”
#> [[3]]
#> [1] NA

As you can see, the last string in the call above said “Traktor” as first name (the machine used in aggriculture) and therefore returns an NA, since Traktor is not an approved danish first name.

There you go. Sweet and simple. Enjoy.

If your country has the same sort of rules, maybe we should create a package that can classify gender based on first names across multiple languages. Let me know if you are interested 🙂