Yesterday’s streaming API post described a multiple server model for handling high rate tweet collection. Today I’d like to cover a different architecture that addresses this problem with a single server running multiple databases.
Let’s say you want to display tweets for the most active stocks each day. The streaming API lets you collect tweets for 400 keywords, or in this case, the 400 most active stock symbols. That will be a high flow rate, and a large database to query if your site only needs to display tweets for 20 or 30 stocks at any one time.
A solution is to store all the tweets, users and related data you receive for all 400 stocks in one database, we’ll call it tweet_collect. You can then create a separate database, it can be called tweet_serve, and have your code copy just the tweets for active stocks to this database as they arrive. Your website only needs to read from tweet_serve, which will be much smaller and therefore deliver query results faster.
When a new stock becomes active, you will already have its tweets available in tweet_collect, so you can quickly copy its tweets to tweet_serve and be ready to display on the site. When the stock is no longer active, you can delete its data from tweet_serve.
The limitation of this technique is that you are limited topics that can be covered adequately within the limit of 400 keywords. As long as this fits your application needs, this model will produce a much faster website display.
When a keyword becomes active that isn’t in your normal collection list, you can fill in the data for this with the search API as needed. Search isn’t as powerful as streaming for large amounts of data, but if you need ad hoc collection of tweets for a few extra keywords, it does a good job. You can query it up to 720 times an hour and request tweets for about 10 to 15 keywords each time. These tweets would also go into the tweet_serve database.