Adam Green
Twitter API Consultant
adam@140dev.com
781-879-2960
@140dev

Language detection for tweets: Part 2

by Adam Green on May 22, 2012

in Twitter API Tutorials,Twitter Language Detection

The docs for the Text_LanguageDetect library say that you need to pass it 4-5 sentences to get an accurate language identification, but as we saw in part 1 of this tutorial, even a single sentence seems to work. This is great, since we will need this to work with tweets that average 5-6 words. So how small a string will give you accurate results? It varies with each language, but from my tests you need at least 3-4 words in most languages.

This sample script demonstrates the problem.

language_detect2.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<?php
// language_detect2.php

require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

$long_french = 'qui propose une école maternelle bilingue français';
print "Long French: $long_french<br/>";
print "Language: " . $oLang->detectSimple($long_french) . "<br/>";

$short_french = 'école maternelle';
print "<br/>Short French: $short_french<br/>";
print "Language: " . $oLang->detectSimple($short_french) . "<br/>";

$long_english = 'the latest episode of american idol sucks';
print "<br/>Long English: $long_english<br/>";
print "Language: " . $oLang->detectSimple($long_english) . "<br/>";

$short_english = 'american idol';
print "<br/>Short English: $short_english<br/>";
print "Language: " . $oLang->detectSimple($short_english) . "<br/>";

?>

Running this example in a browser shows that with just 2 words, the language returned by the library can’t be trusted.

Long French: qui propose une école maternelle bilingue français
Language: french

Short French: école maternelle
Language: danish

Long English: the latest episode of american idol sucks
Language: english

Short English: american idol
Language: welsh

I’ve found that the best way to test the accuracy of this language detection method is to process a sample set of tweets with it, and examine the results for different languages. The next script will do this with a list of 16 tweets I pulled out of a database I built for a firm that consults to drug companies in Europe. They need to collect tweets for different diseases, and separate the results by language. The sample table we’ll process here has 4 tweets in each of 4 languages. The code uses my standard db_lib.php database library to read the tweets from the database.

language_detect3.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?php
// language_detect3.php

// Get ready to use the language library
require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

// Connect to the database with the sample tweet table
require_once('db_lib.php');
$oDB = new db;

// Loop through the sample tweets
$query = "SELECT tweet_text FROM language";
$result = $oDB->select($query);
while ($row=mysqli_fetch_assoc($result)) {

	// Print the detected language	
	$text = $row['tweet_text'];
	print "Text: $text<br/>";
	print "Language: " . $oLang->detectSimple($text) . "<br/><br/>";
}
?>

Running this script in a browser shows that with a reasonably long tweet, the language identification is really good, especially for a free library.

Text: Neugründung von “Deutsche Diabetes-Hilfe – Menschen mit Diabetes” http://t.co/mRRAhvPh
Language: german

Text: RT @minihex: BMI denkt sich wiedermal:ein bissl rassistischer gehts noch-gesetzesentwurf sieht neue schikanen f asylwerberInnen vor htt ...
Language: german

Text: @myMONK_de naja, da wäre noch die üble Bronchitis, die ich seit über 2 Wochen habe, aber Magen-Darm ist wenigstens wieder okay endlich
Language: german

Text: COPD - eine Gefahr für die Lunge nicht nur bei Rauchern: http://t.co/8B2UnJng
Language: german

Text: Mierda de profesor que no supo explicar nada de las columnas de dominancia ocular y ahora no entiendo nada
Language: spanish

Text: #diabetesla Hace poco puse el enlace a 1foro de diabetes. Se comenta que insulina Lantus provoca depresión.¿Algo de cierto? 10 días con ella
Language: spanish

Text: Queridos padres, tengo casi 16 años, creerme ya he aprendido a vivir con la diabetes, me acompaña desde los 3 años, así que por favor +
Language: spanish

Text: La OMS advierte sobre el aumento de casos de hipertensión y diabetes en el mundo - http://t.co/074JHwBh http://t.co/PKUyPLuG
Language: spanish

Text: Les problemes ou sa fait maigrir ou sa fait grossir. Personnellemnt je suis devenue obese. Fais chié!
Language: french

Text: @Mangeunepomme C'est ce que je compte faire
Language: french

Text: Genre c'est une grosse limite obese et elle fait la meuf genre c'est une salope
Language: french

Text: @GlodieGabrielle hehehehehe. A kelke kilos detr obese, u va mettre ta tente dans une salle de gym de la place
Language: french

Text: Oh shoot looks like I've got hay fever... This is bad :/
Language: english

Text: RT @StephenAtHome: A study predicts nearly half of all Americans will be obese by 2030. But with a little American ingenuity I bet we ca ...
Language: english

Text: Vital Signs: Options for weight loss: In addition to dietitians, counselors and life coaches who can walk you th... http://t.co/hBam9XiF
Language: english

Text: I was looking for some weight loss computer support in my area, but there's no low-cal IT in my locality.
Language: english

You can now see how easy it is to get the language for a series of tweets. We’ll dig deeper tomorrow and learn how to use this library’s confidence level results. That will let you select only the tweets that have a high chance of being in the language you need. Then later in the week I’ll create a standard language detection function that you can call whenever you need to process tweets.

Previous post:

Next post: