Guessing languages

Those issue numbers are still frustrating me. Untill I get over that, lets take a look at guessing languages.

We’ll begin by importing the data (and a library):

data <- readRDS(file="d:\\acta\\consistentdata.rda")
library(textcat)
## Warning: package 'textcat' was built under R version 3.2.5

As I mentioned, it is possible to help textcat, by telling it that only certain languages are possible.

After the close analysis I made earlier, I now have the correct languages in the dataset. I have no idea how long textcat took to determine the language. That was done on the fly. But lets find out.

timer <- proc.time()
textcat_guess <- textcat(data$title)
time_textcat <- proc.time() - timer
time_textcat
##    user  system elapsed 
##   91.67    0.07   92.15

That actually takes some time! There are:

nrow(data)
## [1] 16494

Which means, that textcat processes

paste((nrow(data)/time_textcat[3]),"rows/second", sep=" ")
## [1] "178.990775908844 rows/second"

On this computer, while I’m doing other stuff, cause it still takes more than a minute to get through them all.

Well, that can be done faster!

my.profiles <- TC_byte_profiles[names(TC_byte_profiles) %in% c("english", "french", "german")]
timer2 <- proc.time()
textcat_guess_withhint <- textcat(data$title, p=my.profiles)
time_textcat_withhint <- proc.time() - timer2
time_textcat_withhint
##    user  system elapsed 
##   19.01    0.00   19.05

Nice! That was fast. At least compared.

paste(round(time_textcat_withhint[3]/time_textcat[3]*100,1), " % of the time it took without hints.", sep=" ")
## [1] "20.7  % of the time it took without hints."

Thats about five times as fast.

But how good are the guesses? Lets find out:

list1 <- c(data$language == textcat_guess)
list2 <- c(data$language == textcat_guess_withhint)

So. Without hints, we get:

paste(round(sum(list1, na.rm=TRUE)/length(data$language)*100,1),"%")
## [1] "91.2 %"

Correct guesses.

And with hints, we get:

paste(round(sum(list2, na.rm=TRUE)/length(data$language)*100,1),"%")
## [1] "97.7 %"

correct guesses.

That is quite an improvement.

Where are the mistakes? Basically, if textcat guess that the language is romanian, how many of the papers are english, german and french. Does textcat think that romanian lies closer to german than english?

Please note that I’m not trying to bash textcat. I’m asking it to guess the language, based on no more than a single sentence, riddled with chemical terms. I’m actually quite impressed that it does it that well.

I have to lists. data$language, which is the correct language. And textcat_guess, which is the guessed language.

So, lets define a new dataframe with those to lists:

correct <- data$language
guessed <- textcat_guess

confmat <- data.frame(correct, guessed, stringsAsFactors = FALSE)

I’m calling it a confusion matrix. I’m not sure that is actually the correct term. Anyway, this is the result:

table(confmat$guessed, confmat$correct)
##                      
##                       english french german
##   afrikaans                13      0      0
##   albanian                  1      0      0
##   basque                   11      0      0
##   catalan                 397      4      0
##   croatian-ascii            2      0      0
##   czech-iso8859_2           2      0      0
##   danish                  139      0      2
##   dutch                    12      0      2
##   english               14763      0      2
##   esperanto                18      0      1
##   estonian                  2      0      0
##   finnish                   3      0      1
##   french                   39     23      5
##   frisian                  26      0      2
##   german                   58      0    253
##   indonesian                3      0      1
##   irish                     4      0      0
##   italian                   2      0      0
##   latin                   162      0      0
##   lithuanian                2      0      0
##   manx                      4      0      0
##   middle_frisian           52      0      1
##   portuguese                6      0      0
##   romanian                117      0      1
##   rumantsch                38      0      0
##   scots                    78      0      0
##   scots_gaelic             12      0      0
##   serbian-ascii             1      0      0
##   slovak-ascii            148      0      2
##   slovak-windows1250        1      0      0
##   slovenian-iso8859_2       9      0      0
##   spanish                  20      0      0
##   swedish                   7      0      1
##   tagalog                   3      0      0
##   welsh                    34      0      2

Neat. Mostly, it is english-language papers that are assigned a wrong language. Of course it is, there are far more english papers than anything else. But it is interesting, that textcat thinks that 39 papers, that are actually english, is in french. It finds more french papers than english. Four french papers are assigned the language catalan.

Anyway, this is not that interesting. It would be interesting if I had another library to determine languages, that I could compare textcat with.

And a meta-result. From I began thinking about making this table, to I actually found a way to do it, 20 hours passed. I wasn’t thinking about it all the time. But still.

10 minutes after I figured out a very complicated way to do it, I realised that there was a far simpler way. It might not even be the simples. But a lot simpler than what I came up with first. I’m obviously a long way from being good at R.

Back to data – how well does textcat do, when we help it by telling it that there only are papers in english, german and french?

hinted <- textcat_guess_withhint

conf2mat <- data.frame(correct, hinted, stringsAsFactors = FALSE)
table(conf2mat$hinted, confmat$correct)
##          
##           english french german
##   english   15825      0      3
##   french      136     26      9
##   german      228      1    264

And we have the correct language horizontally, and the guessed vertically. Of those that are actually french, 26 are recognized as such. One is determined to be german.

I still think textcat does a good job.

That was that, lets take a look at the historical decline of french and german.