First of all, this is in no way a statement on the immigration crisis in Europe. I do have opinions. But it is more a reaction or reflection on three maps I saw on this page.
Danish televison channel TV2 is illustrating the number of refugees or perhaps rather immigrants received in EU-memberstates in the period 2015 to 2017. This is the map showing the number of immigrants to EU in 2015
Note Germany. Germany welcomed the absolutely highest number of immigrants. What piqued my interest though, is that this might be a good illustration of the numbers, it is not really the relevant comparisons. Yes, Germany welcomed more refugees than Denmark did. But Germany is a rather larger country than Denmark. For a given value of “fair”, it is only fair that Germany takes more refugees than smaller countries.
A more relevant comparison might be the number of refugees compared to population. Or area. Sweden saw (at that time) no problems with welcoming a huge number of migrants, because, as they said, there are a lot of un-populated space in Sweden, plenty of room for everyone! Or perhaps GDP is a better way. Richer countries should shoulder a larger part of the challenge than poorer countries.
I’m not concerned here with what is fair. What concerns me is that the graphic is misleading. Lets make an attempt at fixing that. Or at least present a slightly different perspective on the data.
I’ll try to illustrate the number of migrants as a proportion of population in the different countries. The data is “stolen” directly from the news-channel. They have it from UNHCR, Eurostat and the European Parlament.
The first step will be to get the data.
url <- "http://nyheder.tv2.dk/udland/2018-06-28-se-kortet-saa-mange-asylansoegere-har-de-forskellige-eu-lande-taget"
data <- readLines(url)
By inspection, I can see that the relevant data is in these three lines:
dat.2015 <- data[460]
dat.2016 <- data[409]
dat.2017 <- data[358]
There is a small problem. Strange danish characters are encoded. Lets fix that:
library(stringr)
data <- str_replace_all(data,"\\\\u00d8", "Ø")
data <- str_replace_all(data,"\\\\u00e6", "æ")
data <- str_replace_all(data,"\\\\u00f8", "ø")
And load it into the separate variables again:
dat.2015 <- data[460]
dat.2016 <- data[409]
dat.2017 <- data[358]
Next, I need to get the names of the countries, and the number of migrants received in each country.
The data I’m after looks like this:
\“Bulgarien\”:{\“valueheat\”:20365,\“valuecolored\”:\“none\”,\“description\”:\“\”},
And the regular expression picking that out of the data ought to be:
‘\“(\p{L}+)\”:{\“valueheat\”:(\d+|\“\”),’
For some reason that is not working. I probably should try to figure that out, but I’m on vacation, and would rather drink cold white wine that dig too deep into the weirdness that is regular expressions in R.
Instead I’m going to use this simpler pattern:
pattern <- '(\")(\\w+)(.*?)(\\d+|\\\"\\\")'
And then fix problems later.
Extracting the data:
dat.2015 <- unlist(str_extract_all(dat.2015, pattern))
dat.2016 <- unlist(str_extract_all(dat.2016, pattern))
dat.2017 <- unlist(str_extract_all(dat.2017, pattern))
Inspecting the data, reveals that the interesting parts are lines 23 to 111.
dat.2015 <- dat.2015[23:111]
dat.2016 <- dat.2016[23:111]
dat.2017 <- dat.2017[23:111]
An example of two lines:
dat.2015[11:12]
## [1] "\"Danmark\":{\"valueheat\":20935" ## [2] "\"valuecolored\":\"none\",\"description\":\"\""
First, lets get rid of the second line. There are one of those for each country.
Secondly, I’ll remove the first to characters in the first line. \“ to be precise.
And thirdly, I’ll split that line on \”:{\“valueheat\”:
Using the nice pipes makes that easy:
library(purrr)
dat.2015 <- dat.2015 %>%
discard(str_detect, "valuecolored") %>%
substring(2) %>%
strsplit(split="\":{\"valueheat\":",fixed=TRUE)
That should give me a list where each element is a vector with two elements:
dat.2015[[8]]
## [1] "Finland" "32345"
Neat! Lets do that with the other years:
dat.2016 <- dat.2016 %>%
discard(str_detect, "valuecolored") %>%
substring(2) %>%
strsplit(split="\":{\"valueheat\":",fixed=TRUE)
dat.2017 <- dat.2017 %>%
discard(str_detect, "valuecolored") %>%
substring(2) %>%
strsplit(split="\":{\"valueheat\":",fixed=TRUE)
Now, lets get these data into some dataframes. First I’m unlisting the data, then I pour it into a matrix to get the right shape. And then I’m converting the matrices to dataframes:
dat.2017 <- data.frame(matrix(unlist(dat.2017), ncol=2, byrow=T), stringsAsFactors = F)
dat.2016 <- data.frame(matrix(unlist(dat.2016), ncol=2, byrow=T), stringsAsFactors = F)
dat.2015 <- data.frame(matrix(unlist(dat.2015), ncol=2, byrow=T), stringsAsFactors = F)
There is a slight problem. Albania was the first country in the list. And the structure of the raw data was a bit different.
dat.2015[1,1]
## [1] "regions\":{\"Albanien"
Lets fix that:
dat.2015[1,1] <- "Albanien"
dat.2016[1,1] <- "Albanien"
dat.2017[1,1] <- "Albanien"
And let me just change the column names:
colnames(dat.2015) <- c("Land", "2015")
colnames(dat.2016) <- c("Land", "2016")
colnames(dat.2017) <- c("Land", "2017")
“Land” is danish for “country”.
I’m going to need just one dataframe. I get that by joining the three dataframes:
library(dplyr)
total <- left_join(dat.2015, dat.2016, by="Land")
total <- left_join(total, dat.2017, by="Land")
The numbers are saved as characters. Converting them to numeric:
total$`2015` <- as.numeric(total$`2015`)
## Warning: NAs introduced by coercion
total$`2016` <- as.numeric(total$`2016`)
## Warning: NAs introduced by coercion
total$`2017` <- as.numeric(total$`2017`)
## Warning: NAs introduced by coercion
That introduced some NAs. Countries where there are no data.
Inspecting the data, I can see that there are data for all three years for some countries. For other countries, there are no data at all. The function complete.cases() will return true for a row without NAs.
Using that to get rid of countries where we don’t have complete data:
total <- total[complete.cases(total),]
Next is getting some figures for the populations.
The relevant page on Wikipedia is:
url <- 'https://da.wikipedia.org/wiki/Verdens_landes_befolkningsst%C3%B8rrelser'
Getting that:
library(XML)
library(httr)
r <- GET(url)
doc <- readHTMLTable(
doc=content(r, "text"))
tabellen <- doc$'NULL'
colnames(tabellen) <- apply(tabellen[1,],2,as.character)
I’m only interested in the country name, and the population:
tabellen <- tabellen %>%
select(`Land (eller territorium)`, Population)
Renaming the colums:
colnames(tabellen) <- c("Land", "Population")
tabellen$Population <- as.character(tabellen$Population)
tabellen$Population <- as.numeric(str_remove_all(tabellen$Population, fixed(".")))
## Warning: NAs introduced by coercion
And while I’m at it, the second line gets rid of the factors, and the third removes the thousand separators (“.”)
Now I can join the dataframe containing population figures, with the dataframe containing countries and number of migrants:
total <- left_join(total, tabellen, by="Land")
## Warning: Column `Land` joining character vector and factor, coercing into ## character vector
There are three smaller problems. Cyprys, France and Ireland. The problem is that the country name I get from Wikipedia contains a note. I might be able to get rid of that by code. I’m going to do it manually.
total[10,5] <- 4635400
total[7,5] <- 67286000
total[3,5] <- 847000
Now I have a nice dataframe with the name of the countries (in danish), the numbe of migrants received in 2015, 2016 and 2017, and the population in 2018.
Now it is time to look at some maps.
library(ggplot2)
library(rworldmap)
## ### Welcome to rworldmap ###
## For a short introduction type : vignette('rworldmap')
I am going to match the countries in my dataframe, with the countries I get from the map data. That requires that I have the english names for the countries in my dataframe.
This is the translation:
enland <- c("Belgium", "Bulgaria", "Cyprus", "Denmark", "Estonia", "Finland", "France", "Greece", "Netherlands", "Ireland",
"Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Poland", "Portugal", "Romania", "Slovakia",
"Slovenia", "Spain", "United Kingdom", "Sweden", "Czech Rep.", "Germany", "Hungary", "Austria" )
Getting that into the data frame:
total$enland <- enland
Next, lets get the map:
worldmap <- getMap()
That retrieves data for the entire world. I’m only interested in EU:
EU <- worldmap[which(worldmap$NAME %in% enland),]
EU <- map_data(EU)
## ## Attaching package: 'maps'
## The following object is masked from 'package:purrr': ## ## map
The first line extracts the part of the world map that has names in the list of countries that I have data for.
map_data() converts that into a nice structure that is suitable for entering into ggplot.
Next step is calculating the number of migrants received in each country as a proportion of that countrys population:
total <- total %>%
mutate(`2015` = `2015`/Population*100, `2016` = `2016`/Population*100, `2017`=`2017`/Population*100)
I’m mutating the columns 2015-2017 by dividing by population. And multiplying by 100 to get percentages.
The almost final step, is to join my migrant-proportions with the map data:
total <- left_join(total,EU, by=c("enland"="region") )
The map data does not call the countries for countries. Rather their names are saved in the variable “region”.
And now the final step. I’m going to need the data on tidy form. So I’m loading tidyr.
Then I pass the data frame to select(), where I pick out the variables I need. Long(itude), lat(itude), 2015, 2016 and 2017, and the name of the country.
That is passed to gather(), where I make a new row for each year, with the proportions in the new variabels year and prop.
All that is passed to ggplot, and a layer where the polygons describing the countries are plotted. They are with a colour matching the proportions. And grouped by “group”. This is important. Grouping by country name gives weird results. I’ll get back to that. color=“white” plots the lines in the polygons in white.
Finally, I facet the data on year.
library(tidyr)
total %>%
select(long,lat,`2015`,`2016`,`2017`, group) %>%
gather(year, prop, `2015`:`2017`)%>%
ggplot() +
geom_polygon(aes(long, lat, fill = prop, group = group), color = "white") +
theme_void() +
facet_wrap(~year, ncol=2)
Thats it!
And now the picture is slightly different. What is interesting is that Germany still takes a higher proportion of the migrants than other countries. But in 2015, they didn’t. That was the year when the german chancellor Angela Merkel said the famous words “Wir schaffen das”, We’ll manage. But also the year when Hungary and sweden welcomed migrants in numbers equalling 1.79% and 1.65% of their population respectively. You can compare that with the fact that Germany the same year received migrants equalling 0.58% of their population.
A cynic might claim that it is no surprise that Sweden and Hungary closed their borders late in 2015.
Any way, that is a different subject. I just think that these three maps are slightly more informative than what TV2 provided.
Also, I promised to get back to the group thingy.
Making the same plot, but grouping on country names:
total %>%
select(long,lat,`2015`,`2016`,`2017`, group, enland) %>%
gather(year, prop, `2015`:`2017`)%>%
ggplot() +
geom_polygon(aes(long, lat, fill = prop, group = enland), color = "white") +
theme_void()
What happens is that the polygons describing Italy are grouped in a way that connects the parts describing sicily to the northern part of Italy. That looks weird. The same happens with Sardinia.
Finally. I have not been very consistent in my use of words. I have used “received” and “welcomed” interchangeably. Hungary and Denmark has not been very welcoming. But we are talking about real humans here, and welcoming simply sounds nicer than received. Complicating the situation was the fact that a lot of the arrivals were not actually what we would normally call refugees. At least not refugees from war. So I have also not been consistent in the use of “migrant” vs “refugee”. That is not really my point. The point is that we should always think about how these kinds of numbers are presented.