The next few posts will describe a set of enhancements to the streaming API framework that will greatly expand the capabilities of the code for collecting tweets based on keywords. I thought I’d start with an overview of what I want to accomplish:
- Add a collection_keywords table to the database to hold keywords to be used for collection.
- Add an exclusion_keywords table to the database to hold words (typical curse words) that identify tweets to be rejected.
- Add a tweet_keywords table to the database to record the tweet_id of any tweet with a collection keyword. This will greatly speed up queries that get tweets for specific keywords.
- Modify get_tweets.php to collect tweets that contain the collection_keywords.
- Modify parse_tweets.php to test each tweet and reject it if an exclusion_keyword is found.
- Modify parse_tweets.php to record any keywords found in the tweet_keywords table.
I’m going to leave the current version of the framework code unchanged, so the enhanced scripts will be called get_tweets_keyword.php and parse_tweets_keyword.php. Once people have had a chance to test this code, I will integrate it into a new release version of the framework.
The next post in this series is available here.