Aggregating tweets: Search API vs. Streaming API
Whether to use the search API or the streaming API is one of the most common questions on the Twitter developers mailing list. Twitter HQ seems to want to move people away from the search API, so most responses to this question recommend the streaming API. There are valid reasons to use both, so here is my list of the pros and cons of each method of collecting tweets for an aggregation database.
The search API is the easier of the two methods to implement. It is called with a REST URL that can be retrieved with a simple HTTP Get request. In PHP most people do this with the cURL library. Each request will return up to 100 tweets, and you can use a page parameter to request up to 15 pages, giving you a theoretical maximum of 1,500 tweets for a single query. By running the search script as a cron job every 60 seconds you can keep up with most search topics without missing any tweets.
The streaming API, on the other hand, is used through a continuous connection to the Twitter servers. You need to run a script that establishes a connection with your query request, and then accepts tweets as they are returned by the server. This connection is maintained for as long as you need to collect tweets, possibly for months or even years. This approach requires greater knowledge of network programming, and most people use a library, such as Phirehose for PHP, to manage the connection. The script that communicates with the streaming API must be run as a long-term background process or a system daemon, a style of programming that is unfamiliar to most Web developers.
Rate limit differences
The search API is rate limited, but Twitter is even more obscure than usual about revealing the exact limits:
Requests to the Search API, hosted on search.twitter.com, do not count towards the REST API limit. However, all requests coming from an IP address are applied to a Search Rate Limit. The Search Rate Limit isn’t made public to discourage unnecessary search usage and abuse, but it is higher than the REST Rate Limit. We feel the Search Rate Limit is both liberal and sufficient for most applications and know that many application vendors have found it suitable for their needs.
In practice I have found that a rate of once every minute has worked for months at a time without any rate limit errors.
The streaming API is a single connection that is continuously maintained, so rate limiting doesn’t really apply. The docs claim that there is a rate limit, but give no explanation of what it is. I have never experienced a rate limit error when using the streaming API.
Lag in retrieving results
This is one of the major advantages of the streaming API. Tweets delivered with this method are basically real-time, with a lag of a second or two at most between the time the tweet is posted and it is received from the API. But this lag can be greater at times of Twitter overload. Generally a tweet aggregation system based on the streaming API can be assumed to be real-time. Since aggregation with the search API is run as a repeated series of requests, the lag will be be equal to the delay between requests. I’ve never run a search based aggregation at a rate faster than one request each 60 seconds.
Finding tweets in the past
The search API wins by default in this area, because the streaming API doesn’t deliver any past tweets. You only receive tweets starting from the time the server connection is established. The search API will return tweets matching the current query up to 7 days old in theory, but that is entirely up to Twitter’s current load. At times this interval has been as short as 24 hours. In addition, you are limited by the ability to only receive up to 1,500 tweets regardless of how old they are.
Type of queries
There is a lot of overlap in capabilities in this area. Both the search and streaming APIs let you search for keywords or user names in tweets, and both have their own syntax for combining search terms in an AND or OR fashion. You can also search for tweets by longitude and latitude, if they are attached to the tweet.
The differences are subtle. For example, the search API lets you search for an exact phrase using multiple words in a specific order, while the streaming API only allows for multiple word queries that match all the words in any order. Another difference is that the search API lets you specify the author of requested tweets by their screen name, while the streaming API wants user ids.
Complexity of queries
There are no documented limits on the allowed complexity of a search API request, but there are frequent complaints on the Twitter developers mailing list that requests with more than a dozen or so keywords often fail. The usual response from Twitter is to reduce the complexity. I guess that means that you can only tell if a search API request will handle what you want to search for by trying it out for yourself.
The streaming API is much more robust in this area. As the Twitter docs say:
The default access level allows up to 400 track keywords, 5,000 follow userids and 25 0.1-360 degree location boxes. Increased access levels allow 100,000 follow userids (“shadow” role), 400,000 follow userids (“birddog” role), 10,000 track keywords (“restricted track” role), 200,000 track keywords (“partner track” role), and 200 0.1-360 degree location boxes (“locRestricted” role).
You can request higher access levels for your Twitter app by sending a request to firstname.lastname@example.org.
Limitations in data returned by search API
The search API has a number of serious deficiencies in this area.
- The biggest problem is that the user id of the account that posted a tweet is incorrect in search results. Yes, you read that correctly. When you get a tweet from the search API, it includes the user id, screen name, user name, and profile image URL of the person who tweeted. Unfortunately the user id cannot be trusted. It randomly returns the wrong value. Even more amazing, this is a known bug that has been in place for a long time, probably since the start of this API. You can overcome this problem by using the screen name, which is correct, to look up the proper user id, but that requires additional REST API calls, which are rate limited.
- Another limitation is the frequent failure of the since_id parameter that is part of a search API call. In theory, each request returns a since_id value that can be used to request only more recent tweets on subsequent calls. In practice this parameter is often ignored. I get around this by checking all tweets returned by a search API call, and stopping the requests when a tweet with status id that is already in my database is returned. Since search API tweets are returned in date order, this method works well.
- The streaming API delivers a very complete set of user data with each tweet that gives you everything you’d want to know, such as the user’s follower and friends counts, and number of tweets made since the user’s account started. This data is not included with the search API results, although it can be obtained with additional REST API requests.
- The search API also fails to deliver tweet entities, which is a useful package of parsed user mentions, tags, and URLs found within a tweet. These values are available in a streaming API result, and greatly eliminate the need of parsing these values in your own code. As someone who avoids regular expression programming whenever possible, this is a major weakness of the search API.
Sampling the entire Twitter stream
The streaming API is unique in delivering a subset of the entire flow of tweets passing through Twitter. The size of this sample has changed over time, but it does make it possible to estimate things like popular URLs and trending topics by examining this subset. The search API has no comparable feature.
The search API does not require any authentication, but rate limiting is based on the IP address of the server making the request. This is a form of authentication, since Twitter has blocked abusive IPs. The good thing is that server IPs can easily be changed, making blacklisting by Twitter easy to overcome. Not that I would ever do such a thing.
Connections to the streaming API are done on an account basis, by using either basic authentication with the username and password of a Twitter account, or though OAuth using an account’s access tokens. Basic authentication has been turned off for all other parts of the Twitter API, but for some reason it is still allowed in this case.
Long-term viability of search API
I have real doubts about the search API as a long-term feature. The search API was added to Twitter’s code base through the acquisition of the Summize.com site. The fact that bugs in this API, such as the user_id return value and since_id query parameter, remain broken for more than a year, and tweet entities were never added to the search API return data tells me that this is a set of code that the Twitter development team either can’t or won’t modify. I’ve seen this in other examples of companies integrating acquired code. I’m willing to bet that the search API will remain in place for a few years longer, but will never be improved. The long-term future lies with the streaming API.
Choosing between the search API and streaming API
If you have a large-scale tweet aggregation project with many keywords or users to track, and you want lightening fast delivery of tweets, the streaming API is without doubt the way to go. On the other hand, the search API is so much easier to implement that it makes sense for simple, proof of concept projects where you just want to see what data is available for a small set of keywords.
Another place where the search API is essential is “back filling” an aggregation database. If a client asks you to start tracking a new set of keywords, you can use the search API to gather older tweets as a base, and then use the streaming API to collect all future tweets.