Where do letters occur in words

Bogstavsplaceringer

A while back I encountered an interesting graphic showing where letters were located in english words (http://www.prooffreader.com/2014/05/graphing-distribution-of-english.html). The other day I decided to do a similar one for letters in danish words and for this I used R.

I downloaded all abstracts from the danish Wikipedia and made my own version as you can see here:

Bogstavsplaceringer

Here is how you can do it:

# First you need to load in some text

library(rvest)

# I’ll grab an article from FiveThirtyEight.com as a show case.
# I did my analysis on all the danish abstracts from Wikipedia (took a while!)
# When you do your final analysis you’ll want as much text as possible too.

# We grab the html data
html_data <- html(“http://fivethirtyeight.com/features/how-to-read-the-mind-of-a-supreme-court-justice/”)

# We extract some text
textfile <- html_data %>% html_nodes(“p”) %>% html_text(trim=TRUE)

# We collapse it in to a single string
textfile <- paste(textfile, collapse= ” “)

# Then we need to do a little string manipulation

library(stringr)

# We set all text to lower case
textfile <- str_to_lower(textfile)

# We remove all punctuation and all digits
textfile <- str_replace_all(textfile, “[[:punct:]]|[[:digit:]]”, “”)

# Then we split the string into individual words
words <- unique(unlist(str_split(textfile, ” “)))

# And we count the letters in each word
word_length <- unlist(lapply(words, function(x) nchar(x)))

# And we split each word in to its individual letters
split_words <- str_split(words, “”)

# Then we create a loop to find the position of each letter in each word
# If you have national letters like we do in Denmark you icnlude them like this: for(i in c(letters, “æ”, “ø”, “å”))

for(i in letters){ # We loop through all the letters

# Create empty list to hold data later
letter_place.list <- c()

# We find the position of each letter in the words (that we split apart)
letter_data <- lapply(split_words, function(x) which(x == i))

# A nested loop calculates the relative position of the letter in each word
for(y in 1:length(word_length)){

# We find the relative position
letter_place <- unlist(lapply(letter_data[y], function(x) x/word_length[y]))

# We add that position to a lit of positions
letter_place.list <- c(letter_place.list, letter_place)
}

# We create a new list to hold all the data and we then add the results from the loop
if(!exists(“letter_place.data”)) letter_place.data <- list(letter_place.list) else letter_place.data <- append(letter_place.data , list(letter_place.list))

# We make sure to name each list properly
names(letter_place.data)[length(letter_place.data)] <- i

}

# Now we have a nested list with the data we need, but first we’ll convert it to a long form data frame

# We create an empty data frame to hold the data
letter_place.data.df <- data.frame()

# Then we create a loop to put the data from each letter list into the data frame
for(z in 1:length(letter_place.data)){ # We loop through each nested list

tryCatch({ # I add the tryCatch so the loop doesn’t break if there is an error (can occur with if a letter is missing)

# Here we extract the data from the letter list and create a data frame
loop_data <- data.frame(letter = names(letter_place.data)[z], value = letter_place.data[[z]], stringsAsFactors = F)

# We then bind all the data frames together
letter_place.data.df <- rbind(letter_place.data.df, loop_data)

}, error=function(e){}) # Ends the tryCatch
}

# We check to see if we have all the letters
unique(letter_place.data.df$letter)

# We change the letters back to upper case for aesthetics in the graphic
letter_place.data.df$letter <- str_to_upper(letter_place.data.df$letter)

library(ggplot2)

# We create a density plot with free y scales to show the distribution, we choose a red fill colour and then we facet wrap it to show each individual letter
p <- ggplot(letter_place.data.df, aes(x=value)) + geom_density(aes(fill=”red”)) + facet_wrap( ~ letter, scales=”free_y”)

# We add appropriate text to titles and axis
p <- p + labs(title = “Where do letters typically appear in english words”, y = “Appearance”, x = “Word length”, fill=””)

# We set a deeper red, choose the minimal theme, remove axis markers and grid, and remove the legend
p <- p + scale_fill_brewer(palette = “Set1″) + theme_minimal() +
theme(axis.ticks = element_blank(), axis.text.y = element_blank(), axis.text.x = element_blank(),
legend.position=”none”, panel.grid.major = element_blank(), panel.grid.minor = element_blank())

# Voila! Here it is
p

 

I hope the post inspired you to do one in your own language. If you do I’ll love to see it.

And if you want more inspiration on cool projects to do check out: http://www.r-bloggers.com/

Showing 1 comment

  1. ML
    Svar

    Great idea. What do you mean by “Wikipedia abstract” and how did you get it? I’ve found no abstract part in Wikipedia HTML code.

Leave a Comment

Din e-mailadresse vil ikke blive offentliggjort. Krævede felter er markeret med *

*