Since the announcement of the Twitter-Gnip partnership, there have been lots of news stories and blog posts stating that this is the end of the independent developer, because there is no more free Twitter data. This is completely wrong. You can get all the Twitter data you need, as long as you don’t want *all* the Twitter data. What Twitter is selling through Gnip is up to 50% of the full Firehose, which means 50% of all tweets. That is 50 million tweets a day at the present time. Twitter is also selling the entire Firehose to search engines, like Google and Bing.
Nobody is going to convince me that an independent consultant or a private corporation needs a copy of every single tweet. What these people need is all the tweets for a specific set of keywords or from specific users, and that is still free through the streaming API. Using the /statuses/filter request you can get all the tweets for up to 400 keywords and 5,000 users. All you have to do is decide which words or users you need to track when you make the request.
What Twitter won’t let you do is try and grab every single tweet, store them in a database, and then deliver them for selected keywords or users. That is the definition of a search engine. If you really have the bandwidth and server capacity needed to do this for 100,000,000 tweets each day, why in the world should Twitter deliver this to you for free and foot the bill for its side of the bandwidth and server capacity? That is just absurd. But then a lot of Web 2.0 was exactly that. Thankfully, it is now over.
This ability to get all the tweets, as long as you limit the keywords and users to a reasonable number was restated again today by John Kalucki of the Twitter API team. A developer asked on the API mailing list:
If I am using the statuses/filter streaming API, with a “track=” query
that is not overly broad, and my client never receives any “limit”
responses, can I assume that the results returned represent all the
results from the entire firehose? In other words, in the absence of
“limit” response, is my visibility into the firehose 100%?
Yes, where firehose is the stream of all public statuses, with some low-quality accounts removed.
From my usage of the streaming API, this is correct.
But what about even higher limits? Shouldn’t data be free? Maybe it should be in a perfect world, but in the real world bandwidth, servers, and labor costs aren’t. If you actually need more than the default level of access, you can request a higher level, which is often given for free. If you need all of Twitter’s data, you should share Twitter’s costs, because you better have a business model that supports your side of the costs.
And if you need source code for gathering tweets from the streaming API and storing them in a database, that is still free also.