Now that I’ve upgraded the streaming API framework to make it easier to manage keyword tweet collection, the next step is handling the increased data flow that results from more keywords. One simple solution is to upgrade your server. MySQL loves as much RAM as it can be given, and switching to a solid state drive is another fix that I highly recommend. But building one monstrous server may not be the most cost effective solution, especially if you are operating “in the cloud”. Cloud servers get really expensive when you try to load up lots of RAM.
An alternative solution that should be considered is to distribute your tweet collection across more than one server, each of which may not be that powerful. The result is often more bang for the buck. I’m going to cover some possible multiple server architectures that I’ve built for various projects over the past few years.
One solution is to dedicate one server to tweet collection, and another to data mining and data processing. I tend to call the first one the collection server, and the second the db server. In terms of my streaming API code, I would put a database with just the json_cache table on the collection server. The only code running on this machine would be get_tweets.php, which writes new tweets to its copy of json_cache. The db server would have the complete database schema, including its own copy of json_cache. It would run parse_tweets.php and any other database code you need, such as queries for a web interface to display the tweets.
The goal is to only give the db server as many new tweets as it can handle while maintaining good parsing and query performance. This can be done by a script that copies new tweets from json_cache on the collection server to json_cache on the db server, then deletes these tweets from the collection server. The db server would parse the new tweets it finds in its copy of json_cache, just the way it normally does. The nice thing is that other than the code to transfer tweets between servers, none of the other code changes.
In effect the collection server is now a buffer, holding new tweets as they arrive from the streaming API and protecting the db server from being crushed by too high a flow, or a sudden burst. The tweet transfer rate from collection server to db server can be managed by a timetable that transfers more tweets at night when the db server is unlikely to be running user requests. During the day the amount of tweets stored on the collection server would rise, if the flow was too fast to parse. Then at night the higher transfer rate would draw down the buffer.
For maximum performance and minimum cost, you have to make sure the two servers can communicate through the webhost’s internal network. You don’t want to pay for bandwidth costs to move this data across the public internet, which would also be a lot slower.
The benefit of this model is that as long as you only transfer new tweets to the db server at a rate it can handle, you are guaranteed an acceptable level of performance. A sudden trending topic or other increase in flow would impact the collection server, but have no effect on the db server. You don’t have to build up the db server’s hardware to handle the largest possible burst. That can save money, even with the addition of the collection server. The collection server can be kept small, since all it does is grab tweets from the API and insert them into json_cache.
The obvious downside of this architecture is that there would be a lag between the time tweets arrived from the API and when they were available for queries on the db server. This is fine for an application that did long-term analysis, but may not be acceptable for a site that needs to display new tweets in real time.
I’ll cover other possible server architectures in future posts that can fit different application requirements.