Adam Green
Twitter API Consultant
adam@140dev.com
781-879-2960
@140dev

Language detection for tweets: Part 3

by Adam Green on May 23, 2012

in Twitter API Tutorials,Twitter Language Detection

In yesterday’s installment we learned how to get the most likely language for a tweet with the detectSimple() function. We also discovered that this library sometimes fails when you get down to just 2 or 3 words. The Text_LanguageDetect library has a more advanced function, called detect(), that delivers an array of possible language matches and a numeric confidence level for each. The higher the confidence level, the more likely the language is a match.

language_detect4.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<?php
// language_detect4.php

require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

$long_french = 'qui propose une école maternelle bilingue français';
print "Long French: $long_french<br/>";
print "Language: <br/>";
print_r($oLang->detect($long_french));

$long_english = 'the latest episode of american idol sucks';
print "<br/><br/>Long English: $long_english<br/>";
print "Language: <br/>";
print_r($oLang->detect($long_english));

?>

If you run this script in a browser, you will see that there are many possible languages to choose from, in order by confidence level.

Long French: qui propose une école maternelle bilingue français
Language:
Array ( [french] => 0.32340136054422 [romanian] => 0.25102040816327 [slovene] => 0.24061224489796 [danish] => 0.23877551020408 [latin] => 0.21857142857143 [italian] => 0.21761904761905 [english] => 0.21040816326531 [norwegian] => 0.20884353741497 [portuguese] => 0.20047619047619 [estonian] => 0.18700680272109 [spanish] => 0.18503401360544 [croatian] => 0.18428571428571 [pidgin] => 0.17292517006803 [slovak] => 0.16809523809524 [dutch] => 0.16224489795918 [czech] => 0.14707482993197 [german] => 0.14544217687075 [tagalog] => 0.14510204081633 [cebuano] => 0.11734693877551 [finnish] => 0.1147619047619 [swedish] => 0.11469387755102 [lithuanian] => 0.11333333333333 [latvian] => 0.10857142857143 [polish] => 0.1069387755102 [swahili] => 0.10551020408163 [turkish] => 0.094149659863946 [hawaiian] => 0.09204081632653 [indonesian] => 0.089727891156463 [albanian] => 0.080544217687075 [hausa] => 0.077142857142857 [azeri] => 0.067074829931973 [hungarian] => 0.052517006802721 [icelandic] => 0.052448979591837 [vietnamese] => 0.051768707482993 [welsh] => 0.051700680272109 [somali] => 0.037142857142857 [bengali] => 0 [mongolian] => 0 )

Long English: the latest episode of american idol sucks
Language:
Array ( [english] => 0.26414634146341 [pidgin] => 0.20056910569106 [spanish] => 0.17081300813008 [slovak] => 0.16130081300813 [estonian] => 0.15845528455285 [italian] => 0.15471544715447 [welsh] => 0.14829268292683 [latin] => 0.14739837398374 [danish] => 0.14585365853659 [romanian] => 0.14268292682927 [french] => 0.1409756097561 [norwegian] => 0.14048780487805 [dutch] => 0.12666666666667 [portuguese] => 0.12065040650406 [german] => 0.1130081300813 [indonesian] => 0.1079674796748 [slovene] => 0.090487804878049 [swahili] => 0.09 [latvian] => 0.086991869918699 [turkish] => 0.08 [azeri] => 0.079512195121951 [swedish] => 0.075447154471545 [albanian] => 0.07479674796748 [hungarian] => 0.074065040650407 [hawaiian] => 0.072926829268293 [finnish] => 0.07260162601626 [tagalog] => 0.072113821138211 [cebuano] => 0.060894308943089 [hausa] => 0.059105691056911 [croatian] => 0.057967479674797 [lithuanian] => 0.055528455284553 [somali] => 0.053170731707317 [polish] => 0.043170731707317 [czech] => 0.041219512195122 [vietnamese] => 0.040975609756098 [icelandic] => 0.034146341463415 [mongolian] => 0 [bengali] => 0 )

Manipulating arrays is sometimes tricky, so here is an extension of this script that delivers the most likely language for a string, along with its confidence level and number of words.

language_detect5.php

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
<?php
// language_detect5.php

require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

$long_french = 'qui propose   une école maternelle bilingue français';
print "Long French: $long_french<br/>";
language_info($long_french);

$long_english = 'the latest episode of american idol sucks';
print "<br/><br/>Long English: $long_english<br/>";
language_info($long_english);

function language_info($text) {
	global $oLang;
	
	// Split out the key and value of the first array element
	list($language, $confidence) = each($oLang->detect($text));
	
	// Convert the confidence level to a 2 digit integer for convenience
	$confidence = round($confidence*100,0);
	
	// Get the number of words in this string
	$string = eregi_replace(" +", " ", $text);
	$array = explode(" ", $string);
	$word_count = sizeof($array);
	
	print "Language: $language<br/>";
	print "Confidence: $confidence%<br/>"; 
	print "Words: $word_count<br/>"; 	
}

?>

Long French: qui propose une école maternelle bilingue français
Language: french
Confidence: 33%
Words: 7

Long English: the latest episode of american idol sucks
Language: english
Confidence: 26%
Words: 7

We now have the basic tools to create a library of language functions that can be used when processing tweets from the Twitter API. Come back tomorrow and we’ll work out the details of such a library.

Previous post:

Next post: