The previous post had an overview of my planned enhancements to the streaming API framework. The first step is adding some new tables to the 140dev tweet collection database:
collection_words
The words in this table will be used to collect matching tweets. We’ll see how to add this to the get_tweets.php script so that the collection list is automatically updated for the streaming API when the table changes. This means that you can add and remove words from the table, and have the new list collected without having to restart the get_tweets.php script.
CREATE TABLE IF NOT EXISTS `collection_words` ( `words` varchar(60) NOT NULL, `type` enum('words','phrase') NOT NULL DEFAULT 'words', `out_words` varchar(100) DEFAULT NULL, KEY `words` (`words`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The streaming API automatically ANDs multi-word entries, and ORs the full set of keywords. Let’s assume this table is given the following entries:
pizza recipe
cookbook
This will deliver tweets that contain:
pizza recipe OR cookbook
Multi-word phrases return tweets that contain all the words even if they are not next to each other. An entry of pizza recipe will return tweets that contain both of these words, no matter where they are in the tweet. The Twitter docs have more examples.
The table also contains a enum type field, so you can restrict results to a complete phrase. Let’s say you want to match “I love apple pie”, but not “I don’t want an apple in my blueberry pie.” You can set the words field to apple pie and the type to phrase.
False positives can be a real problem with keyword collection. One of the first tweet aggregation systems I built was for an intellectual property lawyer, and I ran into the problem of searching for the word patent and getting matches for patent leather. The out_words field is included to handle these types of false positives. A collection word of patent, could have the word leather placed in the out_words field to block this false positive. The out_words field is optional. It is only needed for collection words that may return unintended tweets.
There is a one-to-many relationship between collection words and out words, and I’m a strong relational database guy, as you may have noticed. The rule I follow is to normalize relationships into linked tables, except when it makes life harder and queries slower. In this case, creating a separate out_words table would be a pain. Looking up each out word for the matching collection words would be slow, and there is the risk of deleting the collection word while leaving the matching out word in the database. To simplify this data structure, I’ve made the out_words field large, and expect to put all the words into the field with comma delimiters. You’ll see how this is used when we put this table into use in the new version of parse_tweets.php.
exclusion_words
This table contains words that cause tweets to be rejected. It will be used by parse_tweets.php to test each tweet before adding it to the database. Because this is done in our code, rather than by the API, we can add logic to do partial or exact matches based on the type field. For example, if fuck is added along with the type of partial, parse_tweets.php can exclude tweets with: fuck, fucks, fucker, and fucking. You can use this table to make sure that your tweet display system doesn’t display tweets that embarrass you or your client.
CREATE TABLE IF NOT EXISTS `exclusion_words` ( `words` varchar(60) NOT NULL, `type` enum('partial','exact') NOT NULL DEFAULT 'partial', KEY `words` (`words`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
tweet_words
Once you have collected tweets based on keywords, you’ll want a fast way of retrieving just the tweets that match specific keywords. You’ll also probably want to report on which keywords are used the most in the tweets you collect. Parse_tweets.php can accomplish this by recording the tweet_id and keywords found in this table. In effect, you are creating an index of all the tweets based on keywords. This is much faster than searching within the text of each tweet, especially when the number of tweets gets large.
CREATE TABLE IF NOT EXISTS `tweet_words` ( `tweet_id` bigint(20) unsigned NOT NULL, `words` varchar(60) NOT NULL, KEY `tweet_id` (`tweet_id`), KEY `words` (`words`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
This will let you run the following MySQL queries:
SELECT tweets.* FROM tweets, tweet_keywords WHERE tweets.tweet_id = tweet_keywords.tweet_id AND tweet_keywords.words = "pizza recipe"
SELECT count(*) as cnt, words FROM tweet_keywords GROUP BY words ORDER BY cnt DESC
Part 3 of this series with a new version of get_tweets.php is here.