<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>140dev &#187; Twitter Language Detection</title>
	<atom:link href="http://140dev.com/twitter-api-programming-blog/category/twitter-language-detection/feed/" rel="self" type="application/rss+xml" />
	<link>http://140dev.com</link>
	<description>Twitter API Programming Tips, Tutorials, Source Code Libraries and Consulting</description>
	<lastBuildDate>Wed, 31 Jul 2019 10:03:15 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.6</generator>
		<item>
		<title>Language detection for tweets: Part 4</title>
		<link>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-4/</link>
		<comments>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-4/#comments</comments>
		<pubDate>Thu, 24 May 2012 15:50:46 +0000</pubDate>
		<dc:creator>Adam Green</dc:creator>
				<category><![CDATA[Twitter API Tutorials]]></category>
		<category><![CDATA[Twitter Language Detection]]></category>

		<guid isPermaLink="false">http://140dev.com/?p=1472</guid>
		<description><![CDATA[There are two problems you typically want to solve with language detection for tweets. First you need to analyse the types of languages you end up with for a specific set of keywords, and determine the minimum confidence level needed to get a clean result. Then when you have that data, you can process a [&#8230;]]]></description>
				<content:encoded><![CDATA[<p></p><p>There are two problems you typically want to solve with <strong>language detection for tweets</strong>. First you need to analyse the types of languages you end up with for a specific set of keywords, and determine the minimum confidence level needed to get a clean result. Then when you have that data, you can process a tweet stream and pull out just the tweets that meet your goals. This simple language library will address both of these issues. </p>
<p><strong>language_lib.php</strong><br />
<table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
</pre></td><td class="code"><pre>&lt;?php
// language_lib.php

require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

// Return an array with the language data for any text
function language_info($text) {
 	global $oLang;
	
 	// Split out the key and value of the first array element
	list($language, $confidence) = each($oLang-&gt;detect($text));
	
	// Convert the confidence level to a 2 digit integer for convenience
	$confidence = round($confidence*100,0);
	
	// Get the number of words in this string
	$string = eregi_replace(&quot; +&quot;, &quot; &quot;, $text);
	$array = explode(&quot; &quot;, $string);
	$word_count = sizeof($array);
	
	return array( 'language' =&gt; $language,
		'confidence' =&gt; $confidence, 
		'word_count' =&gt; $word_count); 	
}

// Return 1 if the text meets your requirements, and 0 if not
function is_language($text, $target_language, $min_confidence, $min_words) {
 	global $oLang;
	
	// Get the number of words in this string
	$string = eregi_replace(&quot; +&quot;, &quot; &quot;, $text);
	$array = explode(&quot; &quot;, $string);
	// Exit if there aren't enough words
	if (sizeof($array) &lt; $min_words) {return 0;}
	
 	// Test all the possible languages returned by detect()
	foreach($oLang-&gt;detect($text) as $language =&gt; $confidence) {
		$confidence = round($confidence*100,0);
		
		// We have a good tweet
		if ((strtolower($language) == strtolower($target_language)) &amp;&amp; 
			($confidence &gt;= $min_confidence)) {
				
			return 1;
		}
	}

	// No acceptable languages were found
	return 0;	
}

?&gt;</pre></td></tr></table></p>
<p>Let&#8217;s use the first library function, language_info(), to examine all the tweets in the sample database. In a real application, I would typically store the results in a database, so I could run some queries to find things like the average confidence level and number of words for tweets in different languages. Based on that data, I could build a quality control routine to pick out just the best tweets. For now you can test this idea by <a href="http://140dev.com/tutorials/language_detection/language_detect6.php">running</a> the next script in a browser. </p>
<p><strong><a href="http://140dev.com/tutorials/language_detection/language_detect6.php">language_detect6.php</a></strong><br />
<table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
</pre></td><td class="code"><pre>&lt;?php
// language_detect6.php

require_once 'language_lib.php';

// Connect to the database with the sample tweet table
require_once('db_lib.php');
$oDB = new db;

// Loop through the sample tweets
$query = &quot;SELECT tweet_text FROM language&quot;;
$result = $oDB-&gt;select($query);
while ($row=mysqli_fetch_assoc($result)) {

	// Print the detected language info
	$text = $row['tweet_text'];
	print &quot;Text: $text&lt;br/&gt;&quot;;
	print_r( language_info($text));
	print &quot;&lt;br/&gt;&lt;br/&gt;&quot;;
}
?&gt;
</pre></td></tr></table></p>
<p>I won&#8217;t bother including all the results from this script, but you can see from the first few tweets that we now have a way of extracting what we need for all the tweets in a stream.</p>
<p><code>Text: Neugründung von “Deutsche Diabetes-Hilfe – Menschen mit Diabetes” http://t.co/mRRAhvPh<br />
Array ( [language] => german [confidence] => 26 [word_count] => 9 ) </p>
<p>Text: RT @minihex: BMI denkt sich wiedermal:ein bissl rassistischer gehts noch-gesetzesentwurf sieht neue schikanen f asylwerberInnen vor htt ...<br />
Array ( [language] => german [confidence] => 32 [word_count] => 18 ) </p>
<p>Text: @myMONK_de naja, da wäre noch die üble Bronchitis, die ich seit über 2 Wochen habe, aber Magen-Darm ist wenigstens wieder okay endlich<br />
Array ( [language] => german [confidence] => 32 [word_count] => 23 ) </p>
<p>Text: COPD - eine Gefahr für die Lunge nicht nur bei Rauchern: http://t.co/8B2UnJng<br />
Array ( [language] => german [confidence] => 41 [word_count] => 12 ) </code></p>
<p>The next example uses the library&#8217;s is_language() function to select just the tweets for a specific language. In this case, I&#8217;ve tested for English, but the function will work with any of the languages that the Text_LanguageDetect code returns. We saw <a href="http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-3/">yesterday</a> that there is a large number of possible languages. </p>
<p><strong><a href="http://140dev.com/tutorials/language_detection/language_detect7.php">language_detect7.php</a></strong><br />
<table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
</pre></td><td class="code"><pre>&lt;?php
// language_detect7.php

require_once 'language_lib.php';

// Connect to the database with the sample tweet table
require_once('db_lib.php');
$oDB = new db;

// Loop through the sample tweets
$query = &quot;SELECT tweet_text FROM language&quot;;
$result = $oDB-&gt;select($query);
while ($row=mysqli_fetch_assoc($result)) {

	// Print the detected language info
	$text = $row['tweet_text'];
	
	// Only display tweets in English 
	// with a confidence level of at least 30%,
	// and at least 5 words
	if (is_language($text,'english',30,5)) {
		print &quot;Text: $text&lt;br/&gt;&quot;;
		print_r( language_info($text));
		print &quot;&lt;br/&gt;&lt;br/&gt;&quot;;
	}
}
?&gt;</pre></td></tr></table></p>
<p>If you <a href="http://140dev.com/tutorials/language_detection/language_detect7.php">run this script</a> in your browser, you&#8217;ll see just the English tweets that have a confidence level of at least 30% and 5 words or more. I chose to reject tweets that that didn&#8217;t meet the minimum word count, but another option would have been to set the minimum word count to 0, so all English tweets that met the confidence level were displayed. </p>
<p><code>Text: RT @StephenAtHome: A study predicts nearly half of all Americans will be obese by 2030. But with a little American ingenuity I bet we ca ...<br />
Array ( [language] => english [confidence] => 32 [word_count] => 26 ) </p>
<p>Text: I was looking for some weight loss computer support in my area, but there's no low-cal IT in my locality.<br />
Array ( [language] => english [confidence] => 35 [word_count] => 20 ) </code></p>
]]></content:encoded>
			<wfw:commentRss>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Language detection for tweets: Part 3</title>
		<link>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-3/</link>
		<comments>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-3/#comments</comments>
		<pubDate>Wed, 23 May 2012 15:02:21 +0000</pubDate>
		<dc:creator>Adam Green</dc:creator>
				<category><![CDATA[Twitter API Tutorials]]></category>
		<category><![CDATA[Twitter Language Detection]]></category>

		<guid isPermaLink="false">http://140dev.com/?p=1465</guid>
		<description><![CDATA[In yesterday&#8217;s installment we learned how to get the most likely language for a tweet with the detectSimple() function. We also discovered that this library sometimes fails when you get down to just 2 or 3 words. The Text_LanguageDetect library has a more advanced function, called detect(), that delivers an array of possible language matches [&#8230;]]]></description>
				<content:encoded><![CDATA[<p></p><p>In <a href="http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-2/">yesterday&#8217;s installment</a> we learned how to get the most likely language for a tweet with the detectSimple() function. We also discovered that this library sometimes fails when you get down to just 2 or 3 words. The Text_LanguageDetect library has a more advanced function, called detect(), that delivers an array of possible language matches and a numeric confidence level for each. The higher the confidence level, the more likely the language is a match. </p>
<p><strong><a href="http://140dev.com/tutorials/language_detection/language_detect4.php">language_detect4.php</a></strong><br />
<table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="code"><pre>&lt;?php
// language_detect4.php

require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

$long_french = 'qui propose une &eacute;cole maternelle bilingue fran&ccedil;ais';
print &quot;Long French: $long_french&lt;br/&gt;&quot;;
print &quot;Language: &lt;br/&gt;&quot;;
print_r($oLang-&gt;detect($long_french));

$long_english = 'the latest episode of american idol sucks';
print &quot;&lt;br/&gt;&lt;br/&gt;Long English: $long_english&lt;br/&gt;&quot;;
print &quot;Language: &lt;br/&gt;&quot;;
print_r($oLang-&gt;detect($long_english));

?&gt;</pre></td></tr></table></p>
<p>If you <a href="http://140dev.com/tutorials/language_detection/language_detect4.php">run this script</a> in a browser, you will see that there are many possible languages to choose from, in order by confidence level. </p>
<p><code>Long French: qui propose une école maternelle bilingue français<br />
Language:<br />
Array ( [french] => 0.32340136054422 [romanian] => 0.25102040816327 [slovene] => 0.24061224489796 [danish] => 0.23877551020408 [latin] => 0.21857142857143 [italian] => 0.21761904761905 [english] => 0.21040816326531 [norwegian] => 0.20884353741497 [portuguese] => 0.20047619047619 [estonian] => 0.18700680272109 [spanish] => 0.18503401360544 [croatian] => 0.18428571428571 [pidgin] => 0.17292517006803 [slovak] => 0.16809523809524 [dutch] => 0.16224489795918 [czech] => 0.14707482993197 [german] => 0.14544217687075 [tagalog] => 0.14510204081633 [cebuano] => 0.11734693877551 [finnish] => 0.1147619047619 [swedish] => 0.11469387755102 [lithuanian] => 0.11333333333333 [latvian] => 0.10857142857143 [polish] => 0.1069387755102 [swahili] => 0.10551020408163 [turkish] => 0.094149659863946 [hawaiian] => 0.09204081632653 [indonesian] => 0.089727891156463 [albanian] => 0.080544217687075 [hausa] => 0.077142857142857 [azeri] => 0.067074829931973 [hungarian] => 0.052517006802721 [icelandic] => 0.052448979591837 [vietnamese] => 0.051768707482993 [welsh] => 0.051700680272109 [somali] => 0.037142857142857 [bengali] => 0 [mongolian] => 0 ) </p>
<p>Long English: the latest episode of american idol sucks<br />
Language:<br />
Array ( [english] => 0.26414634146341 [pidgin] => 0.20056910569106 [spanish] => 0.17081300813008 [slovak] => 0.16130081300813 [estonian] => 0.15845528455285 [italian] => 0.15471544715447 [welsh] => 0.14829268292683 [latin] => 0.14739837398374 [danish] => 0.14585365853659 [romanian] => 0.14268292682927 [french] => 0.1409756097561 [norwegian] => 0.14048780487805 [dutch] => 0.12666666666667 [portuguese] => 0.12065040650406 [german] => 0.1130081300813 [indonesian] => 0.1079674796748 [slovene] => 0.090487804878049 [swahili] => 0.09 [latvian] => 0.086991869918699 [turkish] => 0.08 [azeri] => 0.079512195121951 [swedish] => 0.075447154471545 [albanian] => 0.07479674796748 [hungarian] => 0.074065040650407 [hawaiian] => 0.072926829268293 [finnish] => 0.07260162601626 [tagalog] => 0.072113821138211 [cebuano] => 0.060894308943089 [hausa] => 0.059105691056911 [croatian] => 0.057967479674797 [lithuanian] => 0.055528455284553 [somali] => 0.053170731707317 [polish] => 0.043170731707317 [czech] => 0.041219512195122 [vietnamese] => 0.040975609756098 [icelandic] => 0.034146341463415 [mongolian] => 0 [bengali] => 0 )</code></p>
<p>Manipulating arrays is sometimes tricky, so here is an extension of this script that delivers the most likely language for a string, along with its confidence level and number of words. </p>
<p><strong><a href="http://140dev.com/tutorials/language_detection/language_detect5.php">language_detect5.php</a></strong><br />
<table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="code"><pre>&lt;?php
// language_detect5.php

require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

$long_french = 'qui propose   une &eacute;cole maternelle bilingue fran&ccedil;ais';
print &quot;Long French: $long_french&lt;br/&gt;&quot;;
language_info($long_french);

$long_english = 'the latest episode of american idol sucks';
print &quot;&lt;br/&gt;&lt;br/&gt;Long English: $long_english&lt;br/&gt;&quot;;
language_info($long_english);

function language_info($text) {
	global $oLang;
	
	// Split out the key and value of the first array element
	list($language, $confidence) = each($oLang-&gt;detect($text));
	
	// Convert the confidence level to a 2 digit integer for convenience
	$confidence = round($confidence*100,0);
	
	// Get the number of words in this string
	$string = eregi_replace(&quot; +&quot;, &quot; &quot;, $text);
	$array = explode(&quot; &quot;, $string);
	$word_count = sizeof($array);
	
	print &quot;Language: $language&lt;br/&gt;&quot;;
	print &quot;Confidence: $confidence%&lt;br/&gt;&quot;; 
	print &quot;Words: $word_count&lt;br/&gt;&quot;; 	
}

?&gt;</pre></td></tr></table></p>
<p><code>Long French: qui propose une école maternelle bilingue français<br />
Language: french<br />
Confidence: 33%<br />
Words: 7</p>
<p>Long English: the latest episode of american idol sucks<br />
Language: english<br />
Confidence: 26%<br />
Words: 7</code></p>
<p>We now have the basic tools to create a library of language functions that can be used when processing tweets from the Twitter API. Come back tomorrow and we&#8217;ll work out the details of such a library. </p>
]]></content:encoded>
			<wfw:commentRss>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Language detection for tweets: Part 2</title>
		<link>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-2/</link>
		<comments>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-2/#comments</comments>
		<pubDate>Tue, 22 May 2012 13:32:11 +0000</pubDate>
		<dc:creator>Adam Green</dc:creator>
				<category><![CDATA[Twitter API Tutorials]]></category>
		<category><![CDATA[Twitter Language Detection]]></category>

		<guid isPermaLink="false">http://140dev.com/?p=1448</guid>
		<description><![CDATA[The docs for the Text_LanguageDetect library say that you need to pass it 4-5 sentences to get an accurate language identification, but as we saw in part 1 of this tutorial, even a single sentence seems to work. This is great, since we will need this to work with tweets that average 5-6 words. So [&#8230;]]]></description>
				<content:encoded><![CDATA[<p></p><p>The docs for the <a href="http://pear.php.net/package/Text_LanguageDetect/docs">Text_LanguageDetect</a> library say that you need to pass it 4-5 sentences to get an accurate language identification, but as we saw in <a href="http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-1/">part 1</a> of this tutorial, even a single sentence seems to work. This is great, since we will need this to work with tweets that average 5-6 words. So how small a string will give you accurate results? It varies with each language, but from my tests you need at least 3-4 words in most languages. </p>
<p>This <a href="http://140dev.com/tutorials/language_detection/language_detect2.php">sample script</a> demonstrates the problem. </p>
<p><strong><a href="http://140dev.com/tutorials/language_detection/language_detect2.php">language_detect2.php</a></strong><br />
<table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="code"><pre>&lt;?php
// language_detect2.php

require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

$long_french = 'qui propose une &eacute;cole maternelle bilingue fran&ccedil;ais';
print &quot;Long French: $long_french&lt;br/&gt;&quot;;
print &quot;Language: &quot; . $oLang-&gt;detectSimple($long_french) . &quot;&lt;br/&gt;&quot;;

$short_french = '&eacute;cole maternelle';
print &quot;&lt;br/&gt;Short French: $short_french&lt;br/&gt;&quot;;
print &quot;Language: &quot; . $oLang-&gt;detectSimple($short_french) . &quot;&lt;br/&gt;&quot;;

$long_english = 'the latest episode of american idol sucks';
print &quot;&lt;br/&gt;Long English: $long_english&lt;br/&gt;&quot;;
print &quot;Language: &quot; . $oLang-&gt;detectSimple($long_english) . &quot;&lt;br/&gt;&quot;;

$short_english = 'american idol';
print &quot;&lt;br/&gt;Short English: $short_english&lt;br/&gt;&quot;;
print &quot;Language: &quot; . $oLang-&gt;detectSimple($short_english) . &quot;&lt;br/&gt;&quot;;

?&gt;</pre></td></tr></table></p>
<p>Running this example in a browser shows that with just 2 words, the language returned by the library can&#8217;t be trusted.</p>
<p><code>Long French: qui propose une école maternelle bilingue français<br />
Language: french</p>
<p>Short French: école maternelle<br />
Language: danish</p>
<p>Long English: the latest episode of american idol sucks<br />
Language: english</p>
<p>Short English: american idol<br />
Language: welsh</code></p>
<p>I&#8217;ve found that the best way to test the accuracy of this language detection method is to process a sample set of tweets with it, and examine the results for different languages. The next script will do this with a list of 16 tweets I pulled out of a database I built for a firm that consults to drug companies in Europe. They need to collect tweets for different diseases, and separate the results by language. The sample table we&#8217;ll process here has 4 tweets in each of 4 languages. The code uses my standard <a href="http://140dev.com/twitter-api-programming-blog/simple-php-mysql-database-library-source-code/">db_lib.php database library</a> to read the tweets from the database. </p>
<p><strong><a href="http://140dev.com/tutorials/language_detection/language_detect3.php">language_detect3.php</a></strong><br />
<table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
</pre></td><td class="code"><pre>&lt;?php
// language_detect3.php

// Get ready to use the language library
require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

// Connect to the database with the sample tweet table
require_once('db_lib.php');
$oDB = new db;

// Loop through the sample tweets
$query = &quot;SELECT tweet_text FROM language&quot;;
$result = $oDB-&gt;select($query);
while ($row=mysqli_fetch_assoc($result)) {

	// Print the detected language	
	$text = $row['tweet_text'];
	print &quot;Text: $text&lt;br/&gt;&quot;;
	print &quot;Language: &quot; . $oLang-&gt;detectSimple($text) . &quot;&lt;br/&gt;&lt;br/&gt;&quot;;
}
?&gt;</pre></td></tr></table></p>
<p><a href="http://140dev.com/tutorials/language_detection/language_detect3.php">Running this script</a> in a browser shows that with a reasonably long tweet, the language identification is really good, especially for a free library. </p>
<p><code>Text: Neugründung von “Deutsche Diabetes-Hilfe – Menschen mit Diabetes” http://t.co/mRRAhvPh<br />
Language: german</p>
<p>Text: RT @minihex: BMI denkt sich wiedermal:ein bissl rassistischer gehts noch-gesetzesentwurf sieht neue schikanen f asylwerberInnen vor htt ...<br />
Language: german</p>
<p>Text: @myMONK_de naja, da wäre noch die üble Bronchitis, die ich seit über 2 Wochen habe, aber Magen-Darm ist wenigstens wieder okay endlich<br />
Language: german</p>
<p>Text: COPD - eine Gefahr für die Lunge nicht nur bei Rauchern: http://t.co/8B2UnJng<br />
Language: german</p>
<p>Text: Mierda de profesor que no supo explicar nada de las columnas de dominancia ocular y ahora no entiendo nada<br />
Language: spanish</p>
<p>Text: #diabetesla Hace poco puse el enlace a 1foro de diabetes. Se comenta que insulina Lantus provoca depresión.¿Algo de cierto? 10 días con ella<br />
Language: spanish</p>
<p>Text: Queridos padres, tengo casi 16 años, creerme ya he aprendido a vivir con la diabetes, me acompaña desde los 3 años, así que por favor +<br />
Language: spanish</p>
<p>Text: La OMS advierte sobre el aumento de casos de hipertensión y diabetes en el mundo - http://t.co/074JHwBh http://t.co/PKUyPLuG<br />
Language: spanish</p>
<p>Text: Les problemes ou sa fait maigrir ou sa fait grossir. Personnellemnt je suis devenue obese. Fais chié!<br />
Language: french</p>
<p>Text: @Mangeunepomme C'est ce que je compte faire<br />
Language: french</p>
<p>Text: Genre c'est une grosse limite obese et elle fait la meuf genre c'est une salope<br />
Language: french</p>
<p>Text: @GlodieGabrielle hehehehehe. A kelke kilos detr obese, u va mettre ta tente dans une salle de gym de la place<br />
Language: french</p>
<p>Text: Oh shoot looks like I've got hay fever... This is bad :/<br />
Language: english</p>
<p>Text: RT @StephenAtHome: A study predicts nearly half of all Americans will be obese by 2030. But with a little American ingenuity I bet we ca ...<br />
Language: english</p>
<p>Text: Vital Signs: Options for weight loss: In addition to dietitians, counselors and life coaches who can walk you th... http://t.co/hBam9XiF<br />
Language: english</p>
<p>Text: I was looking for some weight loss computer support in my area, but there's no low-cal IT in my locality.<br />
Language: english</code></p>
<p>You can now see how easy it is to get the language for a series of tweets. We&#8217;ll dig deeper tomorrow and learn how to use this library&#8217;s confidence level results. That will let you select only the tweets that have a high chance of being in the language you need. Then later in the week I&#8217;ll create a standard language detection function that you can call whenever you need to process tweets. </p>
]]></content:encoded>
			<wfw:commentRss>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Language detection for tweets: Part 1</title>
		<link>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-1/</link>
		<comments>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-1/#comments</comments>
		<pubDate>Tue, 22 May 2012 01:48:06 +0000</pubDate>
		<dc:creator>Adam Green</dc:creator>
				<category><![CDATA[Twitter API Tutorials]]></category>
		<category><![CDATA[Twitter Language Detection]]></category>

		<guid isPermaLink="false">http://140dev.com/?p=1439</guid>
		<description><![CDATA[One thing I learned early on in building tweet aggregation sites for clients is that they expect to only see tweets in English. After all, Google can do it, why can&#8217;t I? In theory there is a lang=en argument in the search API, but it doesn&#8217;t help much, because it only uses the language setting [&#8230;]]]></description>
				<content:encoded><![CDATA[<p></p><p>One thing I learned early on in building tweet aggregation sites for clients is that they expect to only see tweets in English. After all, Google can do it, why can&#8217;t I? In theory there is a lang=en argument in the search API, but it doesn&#8217;t help much, because it only uses the language setting entered by users in their profile. Since English is the default, and hardly anyone changes it, almost all tweets are labelled as English. I seem to remember the streaming API having a lang argument also, but it isn&#8217;t in the docs now. Either way, I gave up and found my own solution a long time ago. The good thing is that it doesn&#8217;t just work for English. It also does a remarkably good job for over a dozen languages I have tested it for, and claims to do a lot more. Best of all, it is free and open source. </p>
<p>The library I use is called Text_LanguageDetect, and it is available as a Pear module, which makes installation very easy for PHP. You can download the code <a href="http://pear.php.net/package/Text_LanguageDetect/download">here</a>, and get docs <a href="http://pear.php.net/package/Text_LanguageDetect/docs">here</a>. It requires PHP 5.3, and Pear 1.9. You don&#8217;t have to download it and install manually, you can just use the Pear install command:<br />
<code>pear install Text_LanguageDetect-0.3.0</code></p>
<p>Using the library only takes a few lines of code. It is a class, so you have to create an instance of the class, and then you can call its functions.<br />
<code>require_once 'Text/LanguageDetect.php';<br />
$oLang = new Text_LanguageDetect();</code></p>
<p>The simplest function you can call is detectSimple(), which returns the most likely language for the text it is passed. Here is a <a href="http://140dev.com/tutorials/language_detection/language_detect1.php">basic test script</a>. </p>
<p><strong><a href="http://140dev.com/tutorials/language_detection/language_detect1.php">language_detect1.php</a></strong><br />
<table><tr><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre>&lt;?php
// language_detect1.php

require_once 'Text/LanguageDetect.php';
$oLang = new Text_LanguageDetect();

$text = 'La OMS advierte sobre el aumento de casos de hipertensi&oacute;n y diabetes en el mundo';
print &quot;Text: $text&lt;br/&gt;&quot;;
print &quot;Language: &quot; . $oLang-&gt;detectSimple($text);

?&gt;</pre></td></tr></table></p>
<p>Running this script through a browser shows that the language detection library correctly identified the text as Spanish.<br />
<code>Text: La OMS advierte sobre el aumento de casos de hipertensión y diabetes en el mundo<br />
Language: spanish</code></p>
<p>Tomorrow we&#8217;ll dig deeper into this library, and see how to handle tweets that are more borderline as to their language. </p>
]]></content:encoded>
			<wfw:commentRss>http://140dev.com/twitter-api-programming-blog/language-detection-for-tweets-part-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
