When new Twitter consulting clients ask me to plan a tweet collection database, the first question they always ask is how much it will cost. I can give them a rough estimate for the cost of my programming time based on their desired features, but it is impossible to know how much server power they will have to pay for without testing first.
Calling the REST API or the Search API is predictable, because there is a one to one correspondence between what you ask for and receive. The Streaming API, on the other hand, is completely unpredictable. The only thing you can be sure of is that the maximum you will receive is 1% of the total tweet flow or 3.5 million tweets a day. Exactly how many you will receive from the Streaming API up to that limit is dependent on the keywords and accounts you choose to follow.
The average Twitter account in our various tweet databases has sent about 6 tweets a day since they were created, but each account is allowed to send up to 1,000 tweets a day, and the Streaming API also delivers retweets. @JustinBieber, for example, can get 10,000 to 20,000 retweets for a single tweet, and @BarackObama has gotten as many as 40,000 retweets. So if you follow the maximum of 5,000 accounts with the Streaming API, the flow could range from an average of 30,000 a tweets a day up to the streaming limit of 3.5 million tweets.
The truly variable flow is when tracking keywords with the Streaming API. You can get tweets for up to 400 keywords or phrases, but there is no reliable way to predict the amount. You have to collect tweets for a week or two, and see what you get. One way to speed up this evaluation process by using the search API to see what the daily average has been for the last few days. The search API only handles about 10 keywords at a time, so you will have to break up your queries into pieces of that size.
Even when you have some data on the normal flow for keywords, you have to be prepared for bursts. I’ve written about bursts before. There are lots of techniques for handling them, ranging from getting the biggest server you can afford to dropping any tweets that exceed a predetermined hourly limit.
So how do I synthesize all these ideas to tell a client what they will need to spend on servers for their Twitter application? My general approach is to use a cloud service, like Rackspace, and start with the smallest server instance possible. Then I build a first version of the tweet collection code and start collecting stats on the flow from each user and keyword. Once I have a good handle on the average, I upsize the server to an amount of memory, disk and CPU that I know will handle that average. Then I add an initial set of burst control techniques until I get a better idea of the long-term variability. If the slow is high enough to require more than 4G of RAM, I find that a dedicated server is more cost effective, but starting with the cloud server is a good way to ramp up slowly.
The important take away for Twitter consultants is that you cannot know what you will need to handle a tweet collection project until you do the real world testing.