In yesterday’s installment we learned how to get the most likely language for a tweet with the detectSimple() function. We also discovered that this library sometimes fails when you get down to just 2 or 3 words. The Text_LanguageDetect library has a more advanced function, called detect(), that delivers an array of possible language matches and a numeric confidence level for each. The higher the confidence level, the more likely the language is a match.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <?php // language_detect4.php require_once 'Text/LanguageDetect.php'; $oLang = new Text_LanguageDetect(); $long_french = 'qui propose une école maternelle bilingue français'; print "Long French: $long_french<br/>"; print "Language: <br/>"; print_r($oLang->detect($long_french)); $long_english = 'the latest episode of american idol sucks'; print "<br/><br/>Long English: $long_english<br/>"; print "Language: <br/>"; print_r($oLang->detect($long_english)); ?> |
If you run this script in a browser, you will see that there are many possible languages to choose from, in order by confidence level.
Long French: qui propose une école maternelle bilingue français
Language:
Array ( [french] => 0.32340136054422 [romanian] => 0.25102040816327 [slovene] => 0.24061224489796 [danish] => 0.23877551020408 [latin] => 0.21857142857143 [italian] => 0.21761904761905 [english] => 0.21040816326531 [norwegian] => 0.20884353741497 [portuguese] => 0.20047619047619 [estonian] => 0.18700680272109 [spanish] => 0.18503401360544 [croatian] => 0.18428571428571 [pidgin] => 0.17292517006803 [slovak] => 0.16809523809524 [dutch] => 0.16224489795918 [czech] => 0.14707482993197 [german] => 0.14544217687075 [tagalog] => 0.14510204081633 [cebuano] => 0.11734693877551 [finnish] => 0.1147619047619 [swedish] => 0.11469387755102 [lithuanian] => 0.11333333333333 [latvian] => 0.10857142857143 [polish] => 0.1069387755102 [swahili] => 0.10551020408163 [turkish] => 0.094149659863946 [hawaiian] => 0.09204081632653 [indonesian] => 0.089727891156463 [albanian] => 0.080544217687075 [hausa] => 0.077142857142857 [azeri] => 0.067074829931973 [hungarian] => 0.052517006802721 [icelandic] => 0.052448979591837 [vietnamese] => 0.051768707482993 [welsh] => 0.051700680272109 [somali] => 0.037142857142857 [bengali] => 0 [mongolian] => 0 )
Long English: the latest episode of american idol sucks
Language:
Array ( [english] => 0.26414634146341 [pidgin] => 0.20056910569106 [spanish] => 0.17081300813008 [slovak] => 0.16130081300813 [estonian] => 0.15845528455285 [italian] => 0.15471544715447 [welsh] => 0.14829268292683 [latin] => 0.14739837398374 [danish] => 0.14585365853659 [romanian] => 0.14268292682927 [french] => 0.1409756097561 [norwegian] => 0.14048780487805 [dutch] => 0.12666666666667 [portuguese] => 0.12065040650406 [german] => 0.1130081300813 [indonesian] => 0.1079674796748 [slovene] => 0.090487804878049 [swahili] => 0.09 [latvian] => 0.086991869918699 [turkish] => 0.08 [azeri] => 0.079512195121951 [swedish] => 0.075447154471545 [albanian] => 0.07479674796748 [hungarian] => 0.074065040650407 [hawaiian] => 0.072926829268293 [finnish] => 0.07260162601626 [tagalog] => 0.072113821138211 [cebuano] => 0.060894308943089 [hausa] => 0.059105691056911 [croatian] => 0.057967479674797 [lithuanian] => 0.055528455284553 [somali] => 0.053170731707317 [polish] => 0.043170731707317 [czech] => 0.041219512195122 [vietnamese] => 0.040975609756098 [icelandic] => 0.034146341463415 [mongolian] => 0 [bengali] => 0 )
Manipulating arrays is sometimes tricky, so here is an extension of this script that delivers the most likely language for a string, along with its confidence level and number of words.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | <?php // language_detect5.php require_once 'Text/LanguageDetect.php'; $oLang = new Text_LanguageDetect(); $long_french = 'qui propose une école maternelle bilingue français'; print "Long French: $long_french<br/>"; language_info($long_french); $long_english = 'the latest episode of american idol sucks'; print "<br/><br/>Long English: $long_english<br/>"; language_info($long_english); function language_info($text) { global $oLang; // Split out the key and value of the first array element list($language, $confidence) = each($oLang->detect($text)); // Convert the confidence level to a 2 digit integer for convenience $confidence = round($confidence*100,0); // Get the number of words in this string $string = eregi_replace(" +", " ", $text); $array = explode(" ", $string); $word_count = sizeof($array); print "Language: $language<br/>"; print "Confidence: $confidence%<br/>"; print "Words: $word_count<br/>"; } ?> |
Long French: qui propose une école maternelle bilingue français
Language: french
Confidence: 33%
Words: 7
Long English: the latest episode of american idol sucks
Language: english
Confidence: 26%
Words: 7
We now have the basic tools to create a library of language functions that can be used when processing tweets from the Twitter API. Come back tomorrow and we’ll work out the details of such a library.