Lets get to work on the network. I don’t have any citation lists, so what I’m after here, is coauthorship. I’m going to draw on some unpublished work I did on Zika-virus research.
I can begin by disregarding all papers with only one author. I want to keep the publication year – as I want to animate the network.
As always, we’ll begin by reading the data:
data <- readRDS(file="d:\\acta\\consistentdata.rda")
head(data)
## title
## 1 Some Improvements in Electrophoresis.
## 2 Di-O-alkylmonothiophosphates and Di-O-alkylmonoselenophosphates and the Corresponding Pseudohalogens.
## 3 On the determination of Reducing Sugars.
## 4 On the Application of p-Carboxyphenylhydrazones in the Identification of Carbonyl Compounds.
## 5 A Note on the Growth Promoting Properties of an Enzymatic Hydrolysate of Casein.
## 6 Die Konstitution der Harzphenole und ihre biogenetischen Zusammenhänge. X. Herausspaltung des "MIttelstückes" des Pinoresinols
## authors language pages
## 1 Astrup, Tage; Brodersen, Rolf english 1-7
## 2 Foss, Olav english 8-31
## 3 Blom, Jakob; Rosted, Carl Olof english 32-53
## 4 Veibel, Stig english 54-68
## 5 Ågren, Gunnar english 69-70
## 6 Erdtman, Holger; Gripenberg, Jarl german 71-75
## doi volume year
## 1 10.3891/acta.chem.scand.01-0001 1 1947
## 2 10.3891/acta.chem.scand.01-0008 1 1947
## 3 10.3891/acta.chem.scand.01-0032 1 1947
## 4 10.3891/acta.chem.scand.01-0054 1 1947
## 5 10.3891/acta.chem.scand.01-0069 1 1947
## 6 10.3891/acta.chem.scand.01-0071 1 1947
## url
## 1 /pdf/acta_vol_01_p0001-0007.pdf
## 2 /pdf/acta_vol_01_p0008-0031.pdf
## 3 /pdf/acta_vol_01_p0032-0053.pdf
## 4 /pdf/acta_vol_01_p0054-0068.pdf
## 5 /pdf/acta_vol_01_p0069-0070.pdf
## 6 /pdf/acta_vol_01_p0071-0075.pdf
First, I’ll need to extract the authornames. And get a count.
For each row in the dataframe, there is a column with authornames. I extracted the number of authors in a previous post:
for(i in 1:nrow(data)){
data$number[i] <- length(unlist(strsplit(data$authors[i],';')))
}
head(data[,c('authors', 'number', 'year')])
## authors number year
## 1 Astrup, Tage; Brodersen, Rolf 2 1947
## 2 Foss, Olav 1 1947
## 3 Blom, Jakob; Rosted, Carl Olof 2 1947
## 4 Veibel, Stig 1 1947
## 5 Ågren, Gunnar 1 1947
## 6 Erdtman, Holger; Gripenberg, Jarl 2 1947
Thats the data I need. I am going to make the assumption, that there are not two different authors with the same name. And even though I might be interested in adding other data to the plots later, I’m going to limit myself to this.
First I’m going to get rid of the papers with only one author. I’m not sure that the iterative reuse of the same variable name is good practice. I’m gonna do it anyway:
data <- data[,c('authors', 'number', 'year')]
head(data)
## authors number year
## 1 Astrup, Tage; Brodersen, Rolf 2 1947
## 2 Foss, Olav 1 1947
## 3 Blom, Jakob; Rosted, Carl Olof 2 1947
## 4 Veibel, Stig 1 1947
## 5 Ågren, Gunnar 1 1947
## 6 Erdtman, Holger; Gripenberg, Jarl 2 1947
data <- data[data$number != 1,]
head(data)
## authors
## 1 Astrup, Tage; Brodersen, Rolf
## 3 Blom, Jakob; Rosted, Carl Olof
## 6 Erdtman, Holger; Gripenberg, Jarl
## 7 Byström, Anders; Almin, Karl Erik
## 8 Virtanen, Artturi I.; Jorma, Juho; Linkola, Hilkka; Linnasalmi, Annikki
## 9 Sörensen, Nils Andreas; Bruun, Torger
## number year
## 1 2 1947
## 3 2 1947
## 6 2 1947
## 7 2 1947
## 8 4 1947
## 9 2 1947
But wait! there was something about a paper with zero authors!
min(data$number)
## [1] 0
Damn. I should probably have written data <- data[data$number > 1,] instead. Anyway, easily remedied:
data <- data[,c('authors', 'number', 'year')]
head(data)
## authors
## 1 Astrup, Tage; Brodersen, Rolf
## 3 Blom, Jakob; Rosted, Carl Olof
## 6 Erdtman, Holger; Gripenberg, Jarl
## 7 Byström, Anders; Almin, Karl Erik
## 8 Virtanen, Artturi I.; Jorma, Juho; Linkola, Hilkka; Linnasalmi, Annikki
## 9 Sörensen, Nils Andreas; Bruun, Torger
## number year
## 1 2 1947
## 3 2 1947
## 6 2 1947
## 7 2 1947
## 8 4 1947
## 9 2 1947
data <- data[data$number != 0,]
head(data)
## authors
## 1 Astrup, Tage; Brodersen, Rolf
## 3 Blom, Jakob; Rosted, Carl Olof
## 6 Erdtman, Holger; Gripenberg, Jarl
## 7 Byström, Anders; Almin, Karl Erik
## 8 Virtanen, Artturi I.; Jorma, Juho; Linkola, Hilkka; Linnasalmi, Annikki
## 9 Sörensen, Nils Andreas; Bruun, Torger
## number year
## 1 2 1947
## 3 2 1947
## 6 2 1947
## 7 2 1947
## 8 4 1947
## 9 2 1947
For each line in this dataframe, I need to do the following: 1. extract all the authornames. 2. make all combinations of two authornames 3. write each of those combinations to a new dataframe 4. along with the publication year of the paper.
Step 1: Get a list with the authors for a line
authorList <- unlist(strsplit(data$authors[1],';'))
authorList
## [1] "Astrup, Tage" " Brodersen, Rolf"
Step 2: Use combn to make the combinations:
combinations <- NULL
combinations <- combn(authorList, 2, simplify = FALSE)
combinations
## [[1]]
## [1] "Astrup, Tage" " Brodersen, Rolf"
Hmm. That is rather simple. Lets look at an example with more than two authors:
authorList <- unlist(strsplit(data$authors[5],';'))
combinations <- combn(authorList, 2, simplify = FALSE)
combinations
## [[1]]
## [1] "Virtanen, Artturi I." " Jorma, Juho"
##
## [[2]]
## [1] "Virtanen, Artturi I." " Linkola, Hilkka"
##
## [[3]]
## [1] "Virtanen, Artturi I." " Linnasalmi, Annikki"
##
## [[4]]
## [1] " Jorma, Juho" " Linkola, Hilkka"
##
## [[5]]
## [1] " Jorma, Juho" " Linnasalmi, Annikki"
##
## [[6]]
## [1] " Linkola, Hilkka" " Linnasalmi, Annikki"
Okay. combn takes a list of author names. The “2” tells combn how many elements from that list I want to choose (in this case 2). Simplify determines if the result should be given as a matrix (TRUE) or as a list (FALSE). I’m going to go with the list here.
Therefore, I get a list with 6 items. Each a list of two names. Because there are 4 names, and 6 different ways to pick 2 names from those 6.
They can be adressed individually:
combinations[[6]]
## [1] " Linkola, Hilkka" " Linnasalmi, Annikki"
combinations[[6]][1]
## [1] " Linkola, Hilkka"
combinations[[6]][2]
## [1] " Linnasalmi, Annikki"
Lets put that together:
raw_df <- NULL
for(i in 1:nrow(data)){
authorList <- unlist(strsplit(data$authors[i],';'))
combinations <- combn(authorList, 2, simplify = FALSE)
for(j in 1:length(combinations)){
newrow <- c(combinations[[j]][1], combinations[[j]][2], data$year[i])
raw_df = rbind(raw_df, newrow)
}
}
colnames(raw_df) <- c("author1", "author2", "year")
head(raw_df)
## author1 author2 year
## newrow "Astrup, Tage" " Brodersen, Rolf" "1947"
## newrow "Blom, Jakob" " Rosted, Carl Olof" "1947"
## newrow "Erdtman, Holger" " Gripenberg, Jarl" "1947"
## newrow "Byström, Anders" " Almin, Karl Erik" "1947"
## newrow "Virtanen, Artturi I." " Jorma, Juho" "1947"
## newrow "Virtanen, Artturi I." " Linkola, Hilkka" "1947"
data <- raw_df
What was I doing? I began by defining a new (raw_df) dataframe. Then, for each row in my data, I extracted a list of authornames. I made all combinations of those names. For each of the combinations, I took the first authorname, the second authorname, and the year of the paper, and made a list. Which I added to the new dataframe. And then I continued with the next row in the original data.
Oh, and I gave the dataframe some colum names. And a new name.
And now I should be ready to make a graph.
A graph consists of nodes (individual authors), connected by edges (the fact that they co-wrote a paper). Those edges may have weigths, eg. if two authors has written ten papers together, the edges might be colored stronger/darker/something, than the edge between two authors that has only written one paper together.
A commonly used library for working with networks, is igraph. There are a LOT of functionality in it.
I’ll begin by making the network graph object.
require(igraph)
## Loading required package: igraph
## Warning: package 'igraph' was built under R version 3.2.5
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
network <- graph.data.frame(data, directed=F)
network
## IGRAPH UN-- 12145 30520 --
## + attr: name (v/c), year (e/c)
## + edges (vertex names):
## [1] Astrup, Tage -- Brodersen, Rolf
## [2] Blom, Jakob -- Rosted, Carl Olof
## [3] Erdtman, Holger -- Gripenberg, Jarl
## [4] Byström, Anders -- Almin, Karl Erik
## [5] Virtanen, Artturi I. -- Jorma, Juho
## [6] Virtanen, Artturi I. -- Linkola, Hilkka
## [7] Virtanen, Artturi I. -- Linnasalmi, Annikki
## [8] Jorma, Juho -- Linkola, Hilkka
## + ... omitted several edges
Simple enough. Or something. IGRAPH tells us that this is an igraph graph. The “UN–” part: U tells that this is an undirected graph. There is no direction in it. Author A is connected to author B, but there is no hierarchi or direction in that connection. The N tells that the graph has a name. And the two “–”. The first could have been a W instead, if there had been weights on the edges, eg that some edges, connections, had more weight, importance or whatever than others. It doesn’t yet. The last – could have been B, if the graph was bipartite. What that actually means? No idea (yet).
The two next numbers, 12145 and 30520, are the numbers of vertices/nodes and edges respectively. In this case there are 12145 authors, and 30520 connections between them. I would guess that we don’t actually have that many authors.
We can take a closer look at the vertices and the edges:
V(network)
## + 12145/12145 vertices, named:
## [1] Astrup, Tage
## [2] Blom, Jakob
## [3] Erdtman, Holger
## [4] Byström, Anders
## [5] Virtanen, Artturi I.
## [6] Jorma, Juho
## [7] Linkola, Hilkka
## [8] Sörensen, Nils Andreas
## [9] Toivonen, N. J.
## [10] Niininen (Tommila), Salli
## + ... omitted several vertices
E(network)
## + 30520/30520 edges (vertex names):
## [1] Astrup, Tage -- Brodersen, Rolf
## [2] Blom, Jakob -- Rosted, Carl Olof
## [3] Erdtman, Holger -- Gripenberg, Jarl
## [4] Byström, Anders -- Almin, Karl Erik
## [5] Virtanen, Artturi I. -- Jorma, Juho
## [6] Virtanen, Artturi I. -- Linkola, Hilkka
## [7] Virtanen, Artturi I. -- Linnasalmi, Annikki
## [8] Jorma, Juho -- Linkola, Hilkka
## [9] Jorma, Juho -- Linnasalmi, Annikki
## [10] Linkola, Hilkka -- Linnasalmi, Annikki
## + ... omitted several edges
V returns the vertices, and E the edges. One of the things I note, is that there appears to be extra spaces in some of the names. I should do something about that. But it will have to wait, I’m impatient!
And now, a graph!:
plot(network)
Oh dear. That is not at all nice!
Lets begin by trying to only plot the graph for the year 1947:
graph_1947 <- subgraph.edges(network, which(E(network)$year=="1947"))
graph_1947
## IGRAPH UN-- 92 83 --
## + attr: name (v/c), year (e/c)
## + edges (vertex names):
## [1] Astrup, Tage -- Brodersen, Rolf
## [2] Blom, Jakob -- Rosted, Carl Olof
## [3] Erdtman, Holger -- Gripenberg, Jarl
## [4] Byström, Anders -- Almin, Karl Erik
## [5] Virtanen, Artturi I. -- Jorma, Juho
## [6] Virtanen, Artturi I. -- Linkola, Hilkka
## [7] Virtanen, Artturi I. -- Linnasalmi, Annikki
## [8] Jorma, Juho -- Linkola, Hilkka
## + ... omitted several edges
subgraph.edges creates a new graph, a subgraph, that only includes specified vertices and edges. Here I take the network. The edges of it has an attribute, year. And I get the edges which have the year attribute equal to 1947.
I now have a new graph, with 83 nodes, and 48 edges. The plot is still a bit confusing. I’m not actually sure that the edges are plotted. Lets remove the author names (vertex.label=NA), and make the vertices smaller (vertex.size=4)
plot(graph_1947,vertex.label=NA, vertex.size=4)
Much better. There are some “loops”. They appear when authors have written more than one paper. There are also a couple of crossing edges. And what is much worse, plotting the graph repeatedly, gives different looking graphs. There is an element of randomness in the plot. That is going to give me problems when I get to animating the networks. I need to be sure that the vertices stay in the same place.
The way to do something about that? The function set.seed().
set.seed(47)
plot(graph_1947,vertex.label=NA, vertex.size=4)
Geeks amongst my readers will note that I’m a trekkie 🙂
Oh, and the seeding does not look necessary here – but when I ran the script through R again, it was set, and the two graphs are therefore identical.
Thats it for now, time to go to bed!