140dev » Server configuration

Streaming API: Multiple server collection architecture

Adam Green — Wed, 12 Feb 2014 21:33:58 +0000

Now that I’ve upgraded the streaming API framework to make it easier to manage keyword tweet collection, the next step is handling the increased data flow that results from more keywords. One simple solution is to upgrade your server. MySQL loves as much RAM as it can be given, and switching to a solid state drive is another fix that I highly recommend. But building one monstrous server may not be the most cost effective solution, especially if you are operating “in the cloud”. Cloud servers get really expensive when you try to load up lots of RAM.

An alternative solution that should be considered is to distribute your tweet collection across more than one server, each of which may not be that powerful. The result is often more bang for the buck. I’m going to cover some possible multiple server architectures that I’ve built for various projects over the past few years.

One solution is to dedicate one server to tweet collection, and another to data mining and data processing. I tend to call the first one the collection server, and the second the db server. In terms of my streaming API code, I would put a database with just the json_cache table on the collection server. The only code running on this machine would be get_tweets.php, which writes new tweets to its copy of json_cache. The db server would have the complete database schema, including its own copy of json_cache. It would run parse_tweets.php and any other database code you need, such as queries for a web interface to display the tweets.

The goal is to only give the db server as many new tweets as it can handle while maintaining good parsing and query performance. This can be done by a script that copies new tweets from json_cache on the collection server to json_cache on the db server, then deletes these tweets from the collection server. The db server would parse the new tweets it finds in its copy of json_cache, just the way it normally does. The nice thing is that other than the code to transfer tweets between servers, none of the other code changes.

In effect the collection server is now a buffer, holding new tweets as they arrive from the streaming API and protecting the db server from being crushed by too high a flow, or a sudden burst. The tweet transfer rate from collection server to db server can be managed by a timetable that transfers more tweets at night when the db server is unlikely to be running user requests. During the day the amount of tweets stored on the collection server would rise, if the flow was too fast to parse. Then at night the higher transfer rate would draw down the buffer.

For maximum performance and minimum cost, you have to make sure the two servers can communicate through the webhost’s internal network. You don’t want to pay for bandwidth costs to move this data across the public internet, which would also be a lot slower.

The benefit of this model is that as long as you only transfer new tweets to the db server at a rate it can handle, you are guaranteed an acceptable level of performance. A sudden trending topic or other increase in flow would impact the collection server, but have no effect on the db server. You don’t have to build up the db server’s hardware to handle the largest possible burst. That can save money, even with the addition of the collection server. The collection server can be kept small, since all it does is grab tweets from the API and insert them into json_cache.

The obvious downside of this architecture is that there would be a lag between the time tweets arrived from the API and when they were available for queries on the db server. This is fine for an application that did long-term analysis, but may not be acceptable for a site that needs to display new tweets in real time.

I’ll cover other possible server architectures in future posts that can fit different application requirements.

The full streaming API stack

Adam Green — Wed, 29 Jan 2014 16:07:47 +0000

I’ve been spending the last few days helping people install the latest version of the streaming API framework. This has reminded me of how many moving parts there are, and how this can get in the way of building a mental model of what is actually going on. One of the biggest confusions seems to be the idea that I wrote Twitter’s streaming API. Actually, all I’ve done is put a thin layer of code on top of a very deep stack. That code may tie things together, but there are many levels of code that need to be installed and configured. Let’s work our way up from the basic server level:

- Operating system. The streaming API code will run on *nix variants, Windows, and Mac OS X machines. Windows has it own unique quirks, but if you are willing to run a Windows machine as a Web server, you have already discovered that.
- Apache must be installed and configured to run PHP. You should also configure Apache to run PHP within HTML pages. This is not always set by default.
- PHP runs within Apache. You will need version 5.2 or greater. I’ve recently seen problems on Windows servers unless 5.2.17 or greater of PHP is installed.
- cURL is a library that runs within PHP and allows connections to remote servers, such as the Twitter API. You won’t need to call cURL directly in your code, but it is used by the Phirehose and tmhOAuth libraries. cURL should be enabled by default, but some webhosts turn it off.
- MySQL. I try to use version 5.0 or greater.
- The db_lib.php code in the framework uses the mysqli PHP library to communicate with MySQL, so that must be installed within PHP.
- Phirehose is the library that makes the actual connection to the streaming API in get_tweets.php. I didn’t write this, but the author, Fenn Bailey allows me to include it in the framework’s source code. It lives here.
- tmhOAuth is a library that lets you make OAuth calls to Twitter’s REST API, such as searching and reading timelines. It isn’t used by the streaming API framework, but is part of my engagement programming code, and many sample scripts on this site, so I’m including it here, It is written by Matt Harris and lives here.
- Finally we get to my streaming API framework code, which rests on all this work by thousands of other people. Open source is an amazing thing, but finding the right path to an app isn’t easy at first.

Initial impressions of Rackspace’s cloud servers

Adam Green — Mon, 20 Feb 2012 15:06:50 +0000

I have a bad habit acquired from my years as a Dot Com CTO. When the time comes to pick a server for a new project, I always overbuy. I’d rather pay a hundred dollars more per month then have a server that can’t take the load. One of the driving forces behind this decision is the time it takes to migrate to a more powerful server, if I discover that it is needed. So I have a collection of dedicated servers that I lease from various webhosts that collectively waste about $500 a month. That’s a small percentage when considered as a cost of doing business across all my clients, but it is still wasteful.

I’ve thought about moving to Amazon’s AWS system, but every time I look at the docs I get turned off. I’m an application builder, not a professional sysadmin. I have no problem managing a Linux based server, but when it comes to configuring and optimizing Apache and MySQL, I turn to professionals. The AWS docs make it clear that this system is built by people who LOVE tweaking servers. They also seem to love really detailed command driven operations spanning several lines, with very complex parameter names, with very odd capitalization. I couldn’t care less about that. If I could just say, “Create a new server instance, and make it this big.” I’d be thrilled.

That is what I have now found with Rackspace.com. Their cloud servers let me clone multiple server instances, and upsize or downsize them with a menu. I’m testing this for a client who wants to collect tweets that have a lot of flow. His search terms for the Twitter streaming API retrieve about 60,000 tweets an hour. If I had to lease a dedicated server for this, I would have spent at least $150 to $200 a month to be sure I could handle the load. Instead I had my sysadmin create the cheapest server instance at Rackspace, at $11 per month. The entire pricing structure is here.

Once the basic server configuration with all of my code was set up, I made a server image with Rackspace’s control panel that could be used to create a new server instance in minutes without having to pay my sysadmin again. I ran the tweet collection for a few hours, and found that this server size was too small. The server load went up above 3.0, and queries were completely stalled. All I had to do was ask for the next size server ($22 a month) using the menu, and 10 minutes later I was up again with the new configuration. This ran much better for inserting tweets, but queries were still too slow. So after a few hours of watching the server, I decided to bump it up again to the next size at $44. This configuration looks like it will work. Server load is about 0.5, and the queries we need to run complete in a few seconds.

Overall this has been a great experience. I love the idea of being able to size up gradually until I see the server handling a real world load, and then downsizing if that load drops. I’m going to start moving some of my existing sites across to Rackspace next with the hope of saving at least $300 to $400 a month. Then I’l be experimenting with clusters of different sized servers to handle more complex site requirements.