140dev » Search Results » search api

The Twitter economy will employ a diverse labor force

Adam Green — Thu, 20 Mar 2014 20:26:00 +0000

This is the third part of a series on the future of Twitter development: Part I, Part II.

In order for Twitter to reach everywhere, a skilled labor force of API developers is needed around the world. Developers make it possible to integrate Twitter into businesses in a more useful and personalized manner. This type of integration will give Twitter ubiquity and longevity, two attributes that are almost impossible for competitors to overcome.

The API developer community is more complex than most realize. New people enter continually as students and self-taught programmers. Others come in as corporate developers who are told to work on Twitter projects. As Twitter related companies in the top tier continue to grow multi-million dollar products, they bring in experienced coders from outside the Twitter world to manage big databases and build enterprise-level tools. This is a dynamic and growing group.

Developer Labor Force

One of the great strengths of the Twitter API is that it can be run by self-taught programmers who quickly turn themselves into productive tool builders. Self-taught programmers often come out of an actual need within a business or other organization. Software start-ups and consulting companies are born as a result of this group’s work.

Students are the other influx of talent I often see in the Twitter world. Doing a research project on Twitter data now seems to be a standard task for Computer Science students. That skill set will mean that Twitter will be an obvious place for this cohort to reach for when adding features to any system. It becomes a generation’s standard for social media data.

Marketing automation companies are doing so well that HubSpot has IPO buzz. Big Data companies based on Twitter are also booming and signing multi-million dollar partnerships.

So we have an established API infrastructure, a billion dollar ad revenue stream to work beside, and a rich set of tools built by a growing and diverse tech community. It sounds like my road trip view of economics is about to play out big-time with Twitter.

Streaming API: Multi-level tweet collection databases

Adam Green — Fri, 14 Feb 2014 01:13:40 +0000

Yesterday’s streaming API post described a multiple server model for handling high rate tweet collection. Today I’d like to cover a different architecture that addresses this problem with a single server running multiple databases.

Let’s say you want to display tweets for the most active stocks each day. The streaming API lets you collect tweets for 400 keywords, or in this case, the 400 most active stock symbols. That will be a high flow rate, and a large database to query if your site only needs to display tweets for 20 or 30 stocks at any one time.

A solution is to store all the tweets, users and related data you receive for all 400 stocks in one database, we’ll call it tweet_collect. You can then create a separate database, it can be called tweet_serve, and have your code copy just the tweets for active stocks to this database as they arrive. Your website only needs to read from tweet_serve, which will be much smaller and therefore deliver query results faster.

When a new stock becomes active, you will already have its tweets available in tweet_collect, so you can quickly copy its tweets to tweet_serve and be ready to display on the site. When the stock is no longer active, you can delete its data from tweet_serve.

The limitation of this technique is that you are limited topics that can be covered adequately within the limit of 400 keywords. As long as this fits your application needs, this model will produce a much faster website display.

When a keyword becomes active that isn’t in your normal collection list, you can fill in the data for this with the search API as needed. Search isn’t as powerful as streaming for large amounts of data, but if you need ad hoc collection of tweets for a few extra keywords, it does a good job. You can query it up to 720 times an hour and request tweets for about 10 to 15 keywords each time. These tweets would also go into the tweet_serve database.

Lead Generation: Data mining Twitter lists for the best leads

Adam Green — Wed, 05 Feb 2014 18:38:42 +0000

Twitter lists don’t get the respect they deserve, but here is a way to use a well curated list as the source of great leads. You could just use the members of a list as leads themselves, but that will only give you a few hundred accounts. If you use the following procedure, you can turn a list into tens of thousands of accounts that are all interested in a specific subject.

First you start with a good Twitter list whose members have been carefully selected. Ironically, Twitter doesn’t offer a method of searching for its own lists, but Google does a good job of this. I find that a search like “twitter list [subject]” works best. With the Olympics in the news, let’s try to find a great list of sports journalists as a starting point. This can be used to find thousands of Twitter users who are avid consumers of sports news, a great lead list for anyone promoting sports related products.

A Google search for twitter list sports journalists leads me to the @SportSJA account. Clicking on the account’s Lists link gave me this excellent list.

The assumption I will make when data mining this list is that the more members of this list someone follows, the more they are interested in sports. Someone who follows a few members may just be an accident, but if an account follows a couple of dozen list members, they are a sports fanatic. Those are the people I’m looking for.

The next step is to collect all the followers of every member of this list. I do this by writing code that first collects all the lists members with the /lists/members API call. This gives me the user ids of the list members, which are stored in a list_members table.

Then I have a script loop through the user ids in the list_members table, and collect all the followers of each one with the /followers/ids request. This is the slowest part of the process, because only 60 follower requests can be made per hour, retrieving up to 5,000 followers each time. Depending on the number of list members and average followers of each, this can take a few hours up to several days. Running a script that does 15 requests as a cron job once every 15 minutes will eventually chew through all the data.

All of the follower user ids I get in this stage are added to a list_member_followers table. The important trick is that I allow duplicates. If someone follows 10 members of the list, their user id is added to the list_member_followers table 10 times.

Since users are added multiple times to this table, you can get a count of the total unique users with the query:

SELECT distinct user_id
FROM list_member_followers

From my experience, a list with close to 500 members (this one has 485), will deliver on the order of 500,000 to 1M unique followers.

I don’t want all the followers, just the best ones. The top 10% is usually enough, so if there are 500,000 followers, the 50,000 who follow the most are my target. I can get these with:

SELECT count(*) as cnt, user_id
FROM list_member_followers
GROUP BY user_id
ORDER BY cnt DESC
LIMIT 50000

To make sure these are all good leads, I need to review their account stats, such as follower count and number of tweets. I can have a script read through these 50,000 accounts, use /users/lookup to collect the account info, and store it in a list_member_leads table. This works on 100 users at a time, and can be run 720 times an hour. This means that collecting all the account data will take (50000/100/720) or .7 hours. Not bad for 50,000 potential leads.

Finally, I can run a SQL request that deletes any of the lead accounts that have poor stats, such as less than 50 friends, 50 followers, or 50 tweets. I also tend to delete accounts with egg avatars and no descriptions. This usually eliminates about 20% of the total.

The result is about 40,000 leads, all of whom have shown that they really want info on sports. And best of all, this data is free. Pretty amazing. Remember, automated following of a list like this is forbidden by Twitter’s terms of service, but there are many ways to track and engage with these users. What you need is a solid reporting and engagement management system to approach these users and record your engagement. Sort of like a CRM for Twitter leads. Luckily, I happen to have a book on just this subject.

Streaming API enhancements, part 2: Keyword collection database changes

Adam Green — Tue, 04 Feb 2014 20:48:48 +0000

The previous post had an overview of my planned enhancements to the streaming API framework. The first step is adding some new tables to the 140dev tweet collection database:

collection_words
The words in this table will be used to collect matching tweets. We’ll see how to add this to the get_tweets.php script so that the collection list is automatically updated for the streaming API when the table changes. This means that you can add and remove words from the table, and have the new list collected without having to restart the get_tweets.php script.

CREATE TABLE IF NOT EXISTS `collection_words` (
  `words` varchar(60) NOT NULL,
  `type` enum('words','phrase') NOT NULL DEFAULT 'words',
  `out_words` varchar(100) DEFAULT NULL,
  KEY `words` (`words`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

The streaming API automatically ANDs multi-word entries, and ORs the full set of keywords. Let’s assume this table is given the following entries:
pizza recipe
cookbook

This will deliver tweets that contain:
pizza recipe OR cookbook

Multi-word phrases return tweets that contain all the words even if they are not next to each other. An entry of pizza recipe will return tweets that contain both of these words, no matter where they are in the tweet. The Twitter docs have more examples.

The table also contains a enum type field, so you can restrict results to a complete phrase. Let’s say you want to match “I love apple pie”, but not “I don’t want an apple in my blueberry pie.” You can set the words field to apple pie and the type to phrase.

False positives can be a real problem with keyword collection. One of the first tweet aggregation systems I built was for an intellectual property lawyer, and I ran into the problem of searching for the word patent and getting matches for patent leather. The out_words field is included to handle these types of false positives. A collection word of patent, could have the word leather placed in the out_words field to block this false positive. The out_words field is optional. It is only needed for collection words that may return unintended tweets.

There is a one-to-many relationship between collection words and out words, and I’m a strong relational database guy, as you may have noticed. The rule I follow is to normalize relationships into linked tables, except when it makes life harder and queries slower. In this case, creating a separate out_words table would be a pain. Looking up each out word for the matching collection words would be slow, and there is the risk of deleting the collection word while leaving the matching out word in the database. To simplify this data structure, I’ve made the out_words field large, and expect to put all the words into the field with comma delimiters. You’ll see how this is used when we put this table into use in the new version of parse_tweets.php.

exclusion_words
This table contains words that cause tweets to be rejected. It will be used by parse_tweets.php to test each tweet before adding it to the database. Because this is done in our code, rather than by the API, we can add logic to do partial or exact matches based on the type field. For example, if fuck is added along with the type of partial, parse_tweets.php can exclude tweets with: fuck, fucks, fucker, and fucking. You can use this table to make sure that your tweet display system doesn’t display tweets that embarrass you or your client.

CREATE TABLE IF NOT EXISTS `exclusion_words` (
  `words` varchar(60) NOT NULL,
  `type` enum('partial','exact') NOT NULL DEFAULT 'partial',
  KEY `words` (`words`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

tweet_words
Once you have collected tweets based on keywords, you’ll want a fast way of retrieving just the tweets that match specific keywords. You’ll also probably want to report on which keywords are used the most in the tweets you collect. Parse_tweets.php can accomplish this by recording the tweet_id and keywords found in this table. In effect, you are creating an index of all the tweets based on keywords. This is much faster than searching within the text of each tweet, especially when the number of tweets gets large.

CREATE TABLE IF NOT EXISTS `tweet_words` (
  `tweet_id` bigint(20) unsigned NOT NULL,
  `words` varchar(60) NOT NULL,
  KEY `tweet_id` (`tweet_id`),
  KEY `words` (`words`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

This will let you run the following MySQL queries:

SELECT tweets.*
FROM tweets, tweet_keywords
WHERE tweets.tweet_id = tweet_keywords.tweet_id
AND tweet_keywords.words = "pizza recipe"

SELECT count(*) as cnt, words
FROM tweet_keywords
GROUP BY words
ORDER BY cnt DESC

Part 3 of this series with a new version of get_tweets.php is here.

Gaining Followers: Use the right tags in your account bio

Adam Green — Tue, 04 Feb 2014 12:53:21 +0000

My last post on lead generation described the best techniques for identifying popular tags for any subject. You can use this approach to select accounts to follow, but that is just half of the solution. You also want these accounts to follow you back.

When you follow someone, the most likely thing that person will do is check out your account quickly to see if you are worth a follow back. If they see a tag in your description that shows you are passionate about that subject, they are probably going to follow you in return. It is a way of proving that you are a member of their community.

The other place that description tags are helpful is in a Twitter search. Most people think of search.twitter.com as a way of finding tweets, but when someone clicks the people link at the top of the search results, Twitter is going to rank your account high if it finds the right tag in your description. Think of this as Twitter SEO for gaining followers.

Lead Generation: Discover the most popular tags in user descriptions

Adam Green — Tue, 04 Feb 2014 12:38:09 +0000

Searching for hashtags is a common method of finding new leads on Twitter, but tags in tweets can be misleading. Sarcasm and irony are so often used in tweets that finding a tag doesn’t tell you if that user is for or against the tag’s meaning. In fact, tags are often used in tweets to make sure the opposite side sees the tweet in search. It is a way of saying “in your face” in a tweet.

A much more accurate indicator is the use of a tag within an account description. After this last Super Bowl, I’m sure lots of Seahawks fans sent derisive tweets with the #broncos tag, but I doubt if any of them put #broncos in their account bio. From my experience, a tag used in an account bio is the most accurate way of qualifying a lead for any issue.

My Engagement Programming book has a couple of scripts that can help you find the most popular tags for any subject. You can use this script to extract all the tags found in user descriptions you have collected with the API, and this report to see which tags are most popular.

Once you have identified the best tags for any subject, you can qualify a list of users by finding those who have these tags in their description.

Lead Generation: Qualifying leads by age of account and last tweet date

Adam Green — Tue, 04 Feb 2014 12:08:03 +0000

Getting thousands of potential leads for any topic is no problem with the Twitter API. Identifying the best leads is the real challenge. There are two dates that can be effective in this task: the date when the Twitter account was created, and the date of the last tweet sent by the account. I like to qualify leads by making sure that the account is at least 30 days old, and has tweeted within the last 30 days. The Twitter API makes it easy to get both of these values. Each API request, such as the search or streaming APIs, includes these dates for each user it returns.

A simple way to see this in action is with the /users/show API call. Here is an example for @LadyGaga using this site’s API Console. You’ll have to log in with your Twitter account to use this tool, but if you just want a quick look at the results, here is a screenshot:

You can apply this technique when you extract results from either the search or streaming API. For example, if you have the user’s data in a variable called $user, you can get the account’s creation date with $user->created_at, and the last tweet date with $user->status->created_at.

The full streaming API stack

Adam Green — Wed, 29 Jan 2014 16:07:47 +0000

I’ve been spending the last few days helping people install the latest version of the streaming API framework. This has reminded me of how many moving parts there are, and how this can get in the way of building a mental model of what is actually going on. One of the biggest confusions seems to be the idea that I wrote Twitter’s streaming API. Actually, all I’ve done is put a thin layer of code on top of a very deep stack. That code may tie things together, but there are many levels of code that need to be installed and configured. Let’s work our way up from the basic server level:

- Operating system. The streaming API code will run on *nix variants, Windows, and Mac OS X machines. Windows has it own unique quirks, but if you are willing to run a Windows machine as a Web server, you have already discovered that.
- Apache must be installed and configured to run PHP. You should also configure Apache to run PHP within HTML pages. This is not always set by default.
- PHP runs within Apache. You will need version 5.2 or greater. I’ve recently seen problems on Windows servers unless 5.2.17 or greater of PHP is installed.
- cURL is a library that runs within PHP and allows connections to remote servers, such as the Twitter API. You won’t need to call cURL directly in your code, but it is used by the Phirehose and tmhOAuth libraries. cURL should be enabled by default, but some webhosts turn it off.
- MySQL. I try to use version 5.0 or greater.
- The db_lib.php code in the framework uses the mysqli PHP library to communicate with MySQL, so that must be installed within PHP.
- Phirehose is the library that makes the actual connection to the streaming API in get_tweets.php. I didn’t write this, but the author, Fenn Bailey allows me to include it in the framework’s source code. It lives here.
- tmhOAuth is a library that lets you make OAuth calls to Twitter’s REST API, such as searching and reading timelines. It isn’t used by the streaming API framework, but is part of my engagement programming code, and many sample scripts on this site, so I’m including it here, It is written by Matt Harris and lives here.
- Finally we get to my streaming API framework code, which rests on all this work by thousands of other people. Open source is an amazing thing, but finding the right path to an app isn’t easy at first.

Search API: Are search results filtered for user quality?

Adam Green — Fri, 29 Nov 2013 18:08:35 +0000

A continual question on the Twitter developer mailing list is why certain tweets and even entire accounts don’t show up in search results. The standard answer is that the search API filters out tweets that don’t meet a minimum quality threshold. That makes a lot of sense, and should definitely be done, if it returns in better results.

I decided to test the Search API using the quality metrics I would typically apply in my own code to filter out spam accounts: account age, number of followers, and number of tweets. I wrote the following search_quality.php script, and ran it against 100 different query terms. Each execution of the script collected up to 100 tweets and returned the minimum account values found for these quality criteria. What I found was very surprising. For most queries I was able to get back tweets from accounts that were as little as 1 day old, had zero followers, and had sent only 1 or 2 tweets.

The test script uses the tmhOAuth.php OAuth library, as I do in all my code. If you don’t already have a copy of this library, you can download it along with the search_quality.php script. You will need to fill in a set of OAuth tokens to make the API request. You also need to fill in your own query. Try different queries and see what you get.

search_quality.php

 '*****',
'consumer_secret' => '*****',
'user_token' => '*****',
'user_secret' => '*****'
));

// Get up to 100 tweets
// You must fill in the query term
$connection->request('GET', $connection->url('1.1/search/tweets'),
array('q' => '*****',
'result_type' => 'recent',
'count' => 100));

// Extract tweets
$results = json_decode($connection->response['response']);
$tweets = $results->statuses;

if (sizeof($tweets)==0) {
  print "No tweets found for: $query";
  exit;
}

// Loop through all tweets found
$tweets_found = 0;
$min_account_age = account_age($tweets[0]->user->created_at);
$min_followers_count = $tweets[0]->user->followers_count;
$min_statuses_count = $tweets[0]->user->statuses_count;
foreach($tweets as $tweet) {
  ++$tweets_found;

  if ($min_account_age > account_age($tweet->user->created_at)){
    $min_account_age = account_age($tweet->user->created_at);
  }
  if ($min_followers_count > $tweet->user->followers_count) {
    $min_followers_count = $tweet->user->followers_count;
  }
  if ($min_statuses_count > $tweet->user->statuses_count) {
    $min_statuses_count = $tweet->user->statuses_count;
  }
}

print "Tweets found: $tweets_found Minimum account age: $min_account_age " .
"Minimum followers: $min_followers_count Minimum tweets: $min_statuses_count";

// Return number of days since start date
function account_age($start) {
  date_default_timezone_set('America/New_York');
  $end = date('Y-m-d H:i:s',time());
  return round(abs(strtotime($start)-strtotime($end))/86400) + 1;
}

?>

After you run this, tweet your results to me @140dev, and I’ll pass them along to the rest of the 140dev community.

Of course, if you write your own search API code, and collect the results, you can filter out the tweets based on any quality control rules you want. This is one of the ways developers can add value to Twitter API results.

Test the search API to see if poor quality users are filtered out.

Search API: More secret search operators

Adam Green — Tue, 26 Nov 2013 14:47:17 +0000

Last week I pointed out the undocumented search operator min_retweets. I’ve been searching tweets about this operator (yes, that is pretty meta) and found two more operators that aren’t in the official docs: min_replies and min_faves. You’ll have to experiment with these to see which are best for different needs. Before we get into the details, you should be aware that as undocumented features there is no guarantee they will continue to be available. I guess that is the true for documented features of the API as well.

Personally, I find min_retweets most useful for identifying influential users. Min_replies also implies influence, but an even better use is finding a group of people who know each other well enough to carry on an extended conversation. You can collect all the replies to a tweet and extract the user names. If this is done repeatedly, you can build up a highly engaged circle of friends. Min_faves implies that the tweets are some of the most informative, even if their authors aren’t that influential.

You can use these operators directly in search.twitter.com:
https://twitter.com/search?q=obama%20min_retweets%3A5&src=typd&f=realtime
https://twitter.com/search?q=obama%20min_replies%3A5&src=typd&f=realtime
https://twitter.com/search?q=obama%20min_faves%3A5&src=typd&f=realtime

There is a trick when you use them with the search API. You have to include the operators as part of the query parameter, not as a separate parameter. To get tweets with “obama” and a minimum of 5 retweets, you set the q parameter to “obama min_retweets:5″. Here is an example in the API Console:
http://140dev.com/twitter-api-console/?method=GET&url=1.1/search/tweets&q=obama%20min_retweets:5&run=1

The complete search API call when using the tmhOAuth.php library is:
$connection->request(‘GET’,
$connection->url(’1.1/search/tweets’),
array( ‘q’ => ‘obama min_retweets:5′));