Where is the R Activity?

r_activity

R has become one of the world’s most widely used statistics and visualisation software packages with an ever growing user community. Thanks to the release of log files containing all hits to http://cran.rstudio.com/ server it is possible to make a map showing the parts of the world with the most active R users (specifically those mostly using the RStudio interface). The USA comes top with 3,045,960 requests to the server between October 2012 and June 2013. Japan is in 2nd place with a mere 756,177 requests and Germany 3rd. In all 203 countries appear in the server logs. I have scaled the map according to the number of server requests made and you can clearly see the dominance of Japan, Europe and North America compared with other parts of the world, especially Africa. The map of course isn’t a perfect representation of the number of R users, as you could have one or two people making hundreds of server requests a day versus a large number of people only making a couple. This is why I have entitled the map “Activity” rather than “Users”.  Either way R hasn’t quite achieved global domination but it is getting there…

To create the map I obtained the files following the instructions on the logs download page. I then combined them with the following code (take from here):
setwd("XXX") #this needs to be the directory with the downloaded files in it.
file_list <- list.files() for (file in file_list){ # if the merged dataset doesn't exist, create it if (!exists("dataset")){ dataset <- read.csv(file, header=TRUE) } # if the merged dataset does exist, append to it if (exists("dataset")){ temp_dataset <- read.csv(file, header=TRUE) dataset<-rbind(dataset, temp_dataset) rm(temp_dataset) } print(file) }

It is then possible to aggregate the data to get the number of requests per country.

dataset$flag<- 1 counts<- aggregate(dataset$flag, by=list(dataset$country), sum) names(counts)<- c("country", "count")

The next step was to download a world shapefile (containing the country borders) from Natural Earth. This contains the country codes used in the log file (the dataset object above). We can open this file with the maptools package:

library(maptools)
world<-readShapePoly("yourworldshapefile")

It is then possible to join our counts object to the world object to assign the log counts to each country based on the "iso_a2" and "country" fields respectively. The new shapefile is also saved.

world@data = data.frame(world@data, counts[match(world@data[,"iso_a2"], counts[,"country"]),])
writePolyShape(world, "world_r_use.shp")

This next bit is a bit of a cheat as I used the ScapeToad software to create the cartogram. A package exists to do this in R but I find ScapeToad to be more powerful. You can download the shapefile I produced from here. I have then reloaded the new shapefile into R and used the basic plot functions to produce the map.

cartogram<-readShapePoly("world_r_carto.shp")

plot(cartogram)
title(main="R Activity Around the World", sub="Based on cran.rstudio.com Activity Logs October 2012-June 2013")

This is my first stab at looking at the data - there is a lot more that can be done with it!

23 Comments

  1. andrew clark

    Fascinating work

    Can you let me now the line(s) of code you got to download the files from the urls set on the CRAN page.
    I’m getting errors with download.file

    Tx

    1. andrew clark

      The error was because the Rstudio code had
      ‘http//cran-logs.rstudio.com

      instead of ‘http://cran-logs.rstudio.com

      Hadley has been alerted so should be corrected soon

    1. James Author

      This is because the IP addresses are registered to countries so Alaska gets distorted in the same way as the rest of the US. It’s not perfect I know, just a fund first visualisation with these data..

  2. Dave Unwin

    Hi James
    Nice map but shouldn’t it be normalised in some way? As it stands its really just a choropleth based on absolute numbers. Informative enough but … Note that NZ, home of R, doesn’t seem to get a look in.

    Dave

    1. James Author

      I agree it always makes sense to normalise but I am not sure what a good denominator would be? I think per head of population would be a bit odd – we need a rough estimate of statisticians in each country…

      1. Dave Unwin

        Yes, the denominator problem seems to me to be a very serious one in any area/value mapping and one that most people, including otherwise sensible statisticians don’t seem to ‘get’. As a first approximation one might argue that #statisticians is proportionate to #population, so this would be vaguely appropriate?

        dave

      1. Dave Unwin

        Many thanks, I see what you mean and in fact know the NZ stats/geog scene quite well having worked in Hamilton and in ChCh. Given the history of R, I suspect that as a proportion of the ‘at risk’ population NZ will have much more R based activity and even noting your point, the map doesn’t show this. Compare it with Danny’s worldmapper cartograms of country population. As James indicates the denominator problem isn’t trivial and I am prepared to argue that deriving a sensible one is probably the key to effective choropleth mapping. Many cartographers will go so far as to argue that non-normalised choropleths should never be drawn. But I guess you know that already. I suspect that adding a second variable (colour?) to these univariate cartograms is probably the best way to use them in geovis work.
        dave

Comments are closed.