140dev » Search API

Search API: Are search results filtered for user quality?

Adam Green — Fri, 29 Nov 2013 18:08:35 +0000

A continual question on the Twitter developer mailing list is why certain tweets and even entire accounts don’t show up in search results. The standard answer is that the search API filters out tweets that don’t meet a minimum quality threshold. That makes a lot of sense, and should definitely be done, if it returns in better results.

I decided to test the Search API using the quality metrics I would typically apply in my own code to filter out spam accounts: account age, number of followers, and number of tweets. I wrote the following search_quality.php script, and ran it against 100 different query terms. Each execution of the script collected up to 100 tweets and returned the minimum account values found for these quality criteria. What I found was very surprising. For most queries I was able to get back tweets from accounts that were as little as 1 day old, had zero followers, and had sent only 1 or 2 tweets.

The test script uses the tmhOAuth.php OAuth library, as I do in all my code. If you don’t already have a copy of this library, you can download it along with the search_quality.php script. You will need to fill in a set of OAuth tokens to make the API request. You also need to fill in your own query. Try different queries and see what you get.

search_quality.php

 '*****',
'consumer_secret' => '*****',
'user_token' => '*****',
'user_secret' => '*****'
));

// Get up to 100 tweets
// You must fill in the query term
$connection->request('GET', $connection->url('1.1/search/tweets'),
array('q' => '*****',
'result_type' => 'recent',
'count' => 100));

// Extract tweets
$results = json_decode($connection->response['response']);
$tweets = $results->statuses;

if (sizeof($tweets)==0) {
  print "No tweets found for: $query";
  exit;
}

// Loop through all tweets found
$tweets_found = 0;
$min_account_age = account_age($tweets[0]->user->created_at);
$min_followers_count = $tweets[0]->user->followers_count;
$min_statuses_count = $tweets[0]->user->statuses_count;
foreach($tweets as $tweet) {
  ++$tweets_found;

  if ($min_account_age > account_age($tweet->user->created_at)){
    $min_account_age = account_age($tweet->user->created_at);
  }
  if ($min_followers_count > $tweet->user->followers_count) {
    $min_followers_count = $tweet->user->followers_count;
  }
  if ($min_statuses_count > $tweet->user->statuses_count) {
    $min_statuses_count = $tweet->user->statuses_count;
  }
}

print "Tweets found: $tweets_found Minimum account age: $min_account_age " .
"Minimum followers: $min_followers_count Minimum tweets: $min_statuses_count";

// Return number of days since start date
function account_age($start) {
  date_default_timezone_set('America/New_York');
  $end = date('Y-m-d H:i:s',time());
  return round(abs(strtotime($start)-strtotime($end))/86400) + 1;
}

?>

After you run this, tweet your results to me @140dev, and I’ll pass them along to the rest of the 140dev community.

Of course, if you write your own search API code, and collect the results, you can filter out the tweets based on any quality control rules you want. This is one of the ways developers can add value to Twitter API results.

Test the search API to see if poor quality users are filtered out.

Search API: More secret search operators

Adam Green — Tue, 26 Nov 2013 14:47:17 +0000

Last week I pointed out the undocumented search operator min_retweets. I’ve been searching tweets about this operator (yes, that is pretty meta) and found two more operators that aren’t in the official docs: min_replies and min_faves. You’ll have to experiment with these to see which are best for different needs. Before we get into the details, you should be aware that as undocumented features there is no guarantee they will continue to be available. I guess that is the true for documented features of the API as well.

Personally, I find min_retweets most useful for identifying influential users. Min_replies also implies influence, but an even better use is finding a group of people who know each other well enough to carry on an extended conversation. You can collect all the replies to a tweet and extract the user names. If this is done repeatedly, you can build up a highly engaged circle of friends. Min_faves implies that the tweets are some of the most informative, even if their authors aren’t that influential.

You can use these operators directly in search.twitter.com:
https://twitter.com/search?q=obama%20min_retweets%3A5&src=typd&f=realtime
https://twitter.com/search?q=obama%20min_replies%3A5&src=typd&f=realtime
https://twitter.com/search?q=obama%20min_faves%3A5&src=typd&f=realtime

There is a trick when you use them with the search API. You have to include the operators as part of the query parameter, not as a separate parameter. To get tweets with “obama” and a minimum of 5 retweets, you set the q parameter to “obama min_retweets:5″. Here is an example in the API Console:
http://140dev.com/twitter-api-console/?method=GET&url=1.1/search/tweets&q=obama%20min_retweets:5&run=1

The complete search API call when using the tmhOAuth.php library is:
$connection->request(‘GET’,
$connection->url(’1.1/search/tweets’),
array( ‘q’ => ‘obama min_retweets:5′));

Search API: What are the real limits on tweet results?

Adam Green — Mon, 25 Nov 2013 21:31:44 +0000

A common question asked by potential clients is how many tweets they can expect to get from the search API. Although I have been telling them “1,500 tweets up to 7 days old” for years, I decided to confirm that. To my surprise, the limit is no longer provided in the official docs. I tried asking for the current limits on the developer mailing list and got no answer, so I’m going to try an experiment. My hope is that the developer community can come up with our own answers for this important limit.

I wrote a simple test script that counts the tweets and also reports on the date of the oldest tweet returned. Running this myself gave me two possible answers. When I used a low volume query of my town of “lexington mass”, I got 24 tweets going back 7 days, which is what I expected. But when I used the high volume query of “obama”, I got 17,773 tweets before the script hit the rate limit for requests. Clearly something has changed in a big, yet undocumented way.

Here is the script I used, called search_limits.php. It uses the tmhOAuth.php OAuth library, as I do in all my code. If you don’t already have a copy of this library, you can download it along with the search_limits.php script. You will need to fill in a set of OAuth tokens to make the API request. You also need to fill in your own query. Try different queries and see what you get.

search_limits.php

 '*************',
	'consumer_secret' => '*************',
	'user_token'      => '*************',
	'user_secret'     => '*************'
));
	
// Loop through search results and accumulate count
$query = '*************';
$max_id = 0;
$oldest_tweet = '';
$tweets_found = 0;
while (true) {

  // First API call
  if ($max_id == 0) {
    $connection->request('GET', $connection->url('1.1/search/tweets'), 
      array('q' => $query,
      'result_type' => 'recent',
      'count' => 100));
				
   // Repeated API call
   } else {
    // Collect older tweets
    --$max_id;
		
    $connection->request('GET', $connection->url('1.1/search/tweets'), 
      array('q' => $query,
      'result_type' => 'recent',
      'count' => 100,
      'max_id' => $max_id));
  }			

  // Exit on error
  if ($connection->response['code'] != 200) {
    print "Exited with error: " . $connection->response['code'] . "\n";
    break;			
  } 

  // Process each tweet returned
  $results = json_decode($connection->response['response']);
  $tweets = $results->statuses;

  // Exit when no more tweets are returned
  if (sizeof($tweets)==0) {
    break;
  }
  foreach($tweets as $tweet) {
    ++$tweets_found;
    $max_id = $tweet->id;
    $oldest_tweet = $tweet->created_at;
  } 
}

print "Tweets found: $tweets_found Oldest tweet: $oldest_tweet";

?>

It would be great if a group of developers ran this script with their own queries and reported the results. I’d like to know how many tweets you got back from the API and how far back in time they went. What do you say? Can we start working together to answer questions rather than waiting for answers on the developer mailing list? Tweet your answers to me @140dev. Thanks for your help.

Help me solve the mystery of search API limits.

Search Programming: Secret query operator min_retweets

Adam Green — Fri, 22 Nov 2013 13:49:06 +0000

Twitter search has been steadily improving since it was acquired from Summize in 2008. At first it returned tweets in a different format from the rest of the API, and had other integration problems, but Twitter has been working on it steadily. I’m writing a book on search API programming, and the first step is testing every possible query option and documenting them on this blog. I thought I’d start with a very cool operator that isn’t documented by Twitter. It is min_retweets. As the name implies, it will identify tweets that have gotten at least the specified number of retweets.

This is a great form of quality control. I’m always telling clients that Twitter is a focus group for their messaging, and this option clears out the noise so they can identify the tweets on any subject that have gotten the most reaction.

For example, this search for Obama has no minimum set for retweets:
https://twitter.com/search?q=obama&src=typd&f=realtime

Here is the same search for Obama with a min_retweets setting of 100:
https://twitter.com/search?q=obama%20min_retweets%3A5&src=typd&f=realtime

When I run these two queries, the difference in quality is obvious. Another benefit of min_retweets is that it reveals the most influential users on any subject. Anyone who can get 100 retweets or more has a lot of influence on that subject.

Do you know any secret search operators? I’d love to hear about them.

Exceeding the search API rate limit

Adam Green — Sat, 10 Dec 2011 17:20:13 +0000

We recently built a cool site called This R That for a client.

Besides having a great UI that my son, Zach, built, it also has a neat architecture for a Twitter search site. The major weakness of the Twitter search API is that rate limiting is based on the IP making the request. While Twitter won’t reveal the actual limit, it is believed to be about 200 an hour. If you build a web page that takes the search request and sends it to a server to do the work, that server’s IP will be capped at the rate limit across all users. A popular site would reach that limit fast.

The solution we used was to do the search with Javascript from within the user’s browser. Then we used Javascript to parse the JSON results and display them as tweet streams. With this model, the IP of the user’s computer is applied to the rate limit. So each user can do up to 200 search requests every hour, or more if Twitter is feeling generous. Any number of users can be running the same web page simultaneously.

OK, Twitter. This time you have to fix the Search API

Adam Green — Tue, 30 Nov 2010 15:16:44 +0000

Twitter bought the code for the search API when they acquired Summize, and while it did give them a fast search, I get the feeling they aren’t too happy about the quality of the code. The biggest hint is that they never fix it. The best example is the documented bug about the search API returning invalid user ids. That’s right. When you do a search with this API, the id reported for the author of the tweets is often invalid. Not always, but a high percentage. This was added to their bug tracking system 2 years ago! Another big issue is the lack of tweet entities in search results. This doesn’t mean that the search API is worthless. I often use it to get applications going, and there are even some benefits over the streaming API, as I note in my tutorial on this subject.

Now Twitter has a real problem with search. Sometime last week the ability to use the lang parameter in a search broke. I use it as lang=en to get tweets in English. This has never been 100% effective, but it used to filter out most of the non-English tweets. Throughout the week various uses of the search API with the lang parameter stopped returning any results at all, which I would classify as a serious bug. There were complaints about this all week, but since it was Thanksgiving, nobody at Twitter replied. On Monday they did finally say they were looking at it.

I understand how the Twitter engineers feel. I’ve been in the position of maintaining and improving acquired code, and it can be a bitch. My development team at Andover.net had the unpleasant task of rewriting the code for Slashdot after we acquired it and we were in the middle of going public. So I sympathize with the problem Twitter faces, but they can’t push this off to some long-term version 2 solution. If they leave this broken for more than another week, things will get hot, and the press will discover the problem. There are already signs a lot of the MSM is starting to try and tear down what has become a major competitor. This problem is now strategic, not just a wish list item that can be ignored.

Update: Twitter now says this problem is fixed, and my tests show that the lang parameter is now working. Good job. Now about that user_id issue.

New Twitter API tutorial comparing search API and streaming API

Adam Green — Sun, 10 Oct 2010 15:34:39 +0000

I just finished a tutorial on the two methods of searching for tweets. Whenever this subject comes up on the Twitter developers mailing list, the usual response is that the streaming API is best, but that depends on your goals and programming ability. If you want to search for tweets in the past, or if you are not a very experienced programmer, the search API is the right choice. On the other hand, the streaming API will deliver tweets in real-time, which is very impressive for an app. I lay out all the pros and cons here.