The docs for the Text_LanguageDetect library say that you need to pass it 4-5 sentences to get an accurate language identification, but as we saw in part 1 of this tutorial, even a single sentence seems to work. This is great, since we will need this to work with tweets that average 5-6 words. So how small a string will give you accurate results? It varies with each language, but from my tests you need at least 3-4 words in most languages.
This sample script demonstrates the problem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | <?php // language_detect2.php require_once 'Text/LanguageDetect.php'; $oLang = new Text_LanguageDetect(); $long_french = 'qui propose une école maternelle bilingue français'; print "Long French: $long_french<br/>"; print "Language: " . $oLang->detectSimple($long_french) . "<br/>"; $short_french = 'école maternelle'; print "<br/>Short French: $short_french<br/>"; print "Language: " . $oLang->detectSimple($short_french) . "<br/>"; $long_english = 'the latest episode of american idol sucks'; print "<br/>Long English: $long_english<br/>"; print "Language: " . $oLang->detectSimple($long_english) . "<br/>"; $short_english = 'american idol'; print "<br/>Short English: $short_english<br/>"; print "Language: " . $oLang->detectSimple($short_english) . "<br/>"; ?> |
Running this example in a browser shows that with just 2 words, the language returned by the library can’t be trusted.
Long French: qui propose une école maternelle bilingue français
Language: french
Short French: école maternelle
Language: danish
Long English: the latest episode of american idol sucks
Language: english
Short English: american idol
Language: welsh
I’ve found that the best way to test the accuracy of this language detection method is to process a sample set of tweets with it, and examine the results for different languages. The next script will do this with a list of 16 tweets I pulled out of a database I built for a firm that consults to drug companies in Europe. They need to collect tweets for different diseases, and separate the results by language. The sample table we’ll process here has 4 tweets in each of 4 languages. The code uses my standard db_lib.php database library to read the tweets from the database.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | <?php // language_detect3.php // Get ready to use the language library require_once 'Text/LanguageDetect.php'; $oLang = new Text_LanguageDetect(); // Connect to the database with the sample tweet table require_once('db_lib.php'); $oDB = new db; // Loop through the sample tweets $query = "SELECT tweet_text FROM language"; $result = $oDB->select($query); while ($row=mysqli_fetch_assoc($result)) { // Print the detected language $text = $row['tweet_text']; print "Text: $text<br/>"; print "Language: " . $oLang->detectSimple($text) . "<br/><br/>"; } ?> |
Running this script in a browser shows that with a reasonably long tweet, the language identification is really good, especially for a free library.
Text: Neugründung von “Deutsche Diabetes-Hilfe – Menschen mit Diabetes” http://t.co/mRRAhvPh
Language: german
Text: RT @minihex: BMI denkt sich wiedermal:ein bissl rassistischer gehts noch-gesetzesentwurf sieht neue schikanen f asylwerberInnen vor htt ...
Language: german
Text: @myMONK_de naja, da wäre noch die üble Bronchitis, die ich seit über 2 Wochen habe, aber Magen-Darm ist wenigstens wieder okay endlich
Language: german
Text: COPD - eine Gefahr für die Lunge nicht nur bei Rauchern: http://t.co/8B2UnJng
Language: german
Text: Mierda de profesor que no supo explicar nada de las columnas de dominancia ocular y ahora no entiendo nada
Language: spanish
Text: #diabetesla Hace poco puse el enlace a 1foro de diabetes. Se comenta que insulina Lantus provoca depresión.¿Algo de cierto? 10 días con ella
Language: spanish
Text: Queridos padres, tengo casi 16 años, creerme ya he aprendido a vivir con la diabetes, me acompaña desde los 3 años, así que por favor +
Language: spanish
Text: La OMS advierte sobre el aumento de casos de hipertensión y diabetes en el mundo - http://t.co/074JHwBh http://t.co/PKUyPLuG
Language: spanish
Text: Les problemes ou sa fait maigrir ou sa fait grossir. Personnellemnt je suis devenue obese. Fais chié!
Language: french
Text: @Mangeunepomme C'est ce que je compte faire
Language: french
Text: Genre c'est une grosse limite obese et elle fait la meuf genre c'est une salope
Language: french
Text: @GlodieGabrielle hehehehehe. A kelke kilos detr obese, u va mettre ta tente dans une salle de gym de la place
Language: french
Text: Oh shoot looks like I've got hay fever... This is bad :/
Language: english
Text: RT @StephenAtHome: A study predicts nearly half of all Americans will be obese by 2030. But with a little American ingenuity I bet we ca ...
Language: english
Text: Vital Signs: Options for weight loss: In addition to dietitians, counselors and life coaches who can walk you th... http://t.co/hBam9XiF
Language: english
Text: I was looking for some weight loss computer support in my area, but there's no low-cal IT in my locality.
Language: english
You can now see how easy it is to get the language for a series of tweets. We’ll dig deeper tomorrow and learn how to use this library’s confidence level results. That will let you select only the tweets that have a high chance of being in the language you need. Then later in the week I’ll create a standard language detection function that you can call whenever you need to process tweets.