Aggregating tweets: Search API vs. Streaming API
As part of upgrading the 140dev.com site for API 1.1, this page has replaced the old posting on this subject.
My attitude towards the search API has changed since the early days when Twitter bought it from Summize. I had real doubts back then as to Twitter’s ability and inclination to integrate the search API correctly. I was wrong. Search has been fixed, strengthened and integrated. Both search and streaming APIs are now essential parts of Twitter programming for Tweet collection. It is still useful to lay out their differences side by side. It isn’t a matter of using one or the other. Now you need to know when to use each API for maximum efficiency.
Past vs. Future
This is the essential difference between the two APIs. Search goes back in time and streaming goes forward. When you first decide to collect tweets on a subject, you have nothing to start with. The search API is best used then to fill in the past 7 days for your subject matter. This is often called back-filling. With a database caught up to the present moment, you can then turn on the streaming API and capture all tweets going forward.
Both need OAuth to connect to Twitter
The search API used to be limited just by IP address with no login required, but since version 1.1 was released, you have to log in with OAuth for all requests, including search and streaming. If you never learned how to use OAuth, this free e-book will get you started.
Rate limit rules are completely different
There is a subtle, yet very powerful difference between search and streaming API when it comes to OAuth tokens. You are only allowed to make a single streaming API OAuth connection for each twitter account that owns the app. No matter how many people log into your site, the app they all log into can only use a single streaming API connection. The search API, on the other hand, allows a separate rate limited bucket of requests for each user who logs into your app. There are many implications of this difference, but I can’t digress to cover them all here. The takeaway is that you have to plan your rate limit utilization with these two sets of limits in mind. You might find it more effective to use both APIs for different portions of your collection process.
Data formats are almost the same
I would never say that the search and streaming API return data in exactly the same formats, but the differences are small enough to not matter. You might have to tweak the collection scripts for each API, but the dependencies are isolated in those 2 scripts. The key is that the data each API returns is the same, even when their JSON return structures aren’t. You can safely mix data from both search and streaming into the same database. After that is done it doesn’t matter where the data originally came from.
Search API has more powerful queries
The search API has a fairly rich set of operators that can filter results based on attributes like location of sender, language, and various popularity measurements. The streaming API has a more limited approach of only collecting tweets containing words, sent by specific accounts, or within a geographic area.
Seach API can collect a wider range of data
The targets for tweet collection vary in several ways. The streaming API can collect all tweets that contain up to 400 keyword phrases, were sent by up to 5,000 accounts, and originated in up to 25 geographic areas. The exact limit on search API queries aren’t documented, but it is a good estimate that a query cannot contain greater than 15-20 keywords. On the other hand, you can make up to 15 search API requests a minute. That works out to about 250 keywords being searched each minute, or 15,000 keywords an hour. It is possible to switch the streaming keywords, but not at as high a rate as search.
Streaming API usually returns a much higher flow of tweets
Another limit that isn’t documented is the total flow from the streaming API. The docs say up to 1% of the full firehose of tweets. I’ve found that the streaming API has maxed out at around 3,000 tweets a minute, although that may have changed. This delivers a maximum flow of 180,000 tweets an hour. The search API returns up to 100 tweets per search and allows 720 requests per hour, giving us a max of 72,000 tweets per hour. On the other, other hand, if each user who logs into your app asks you to make search requests, then you can get up to 72,000 tweets per hour for every user.
You can see that this comparison is not as easy as it once was. If you need to squeeze out the maximum results from Twitter, you need to juggle the various factors to get the best combination of both search and streaming API calls. In the simplest case where you have a relatively fixed set of keywords, you should first run a search to collect the old tweets going back a week or so. Then turn on the streaming API for the same keywords.