SIDEBAR
»
S
I
D
E
B
A
R
«
Bias in Twitter API measurements
Apr 18th, 2011 by Maurice

I’m working on the analysis of the tweets on the Dutch general elections of 2010. Because in the five week prior to Election Day, there is a considerable amount of tweets to analyse, 4,585,614 to be precise. Because the large amount of tweets I use SPSS to organize the data and create variables. While working on the data, what struck me was that Twitter’s identification of, for instance, retweets it quite sloppy to say the least.

As you may know, Twitter provides an API to allow access to their data re-using the tweets, location etc in mashups. A few of these variables are of interest for researchers, for instance user characteristics such as name, location, follower and following network size. I use these in my research as well, combined with data from other sources (see the upcoming issue of Party Politics on Twitter use by candidates in the EU elections of 2009).

As I was working on the actual tweets content I found some curious discrepancies between Twitter’s measurements and mine. For instance, retweets, identified as “RT” in the tweet text, are only flagged as a retweets by Twitter when they are positioned right at the beginning of the tweet. Even if there is a blank space before the “RT” it fails to flag it as a retweet. Furthermore, “RT” codes put somewhere in the middle of the tweet are not identified correctly either. You could consider these occurrences as false negatives.
There are also false positives: “RT” codes identified as tweets but actually are not. Consider this: in the Netherlands, there’s this broadcasting organisation called RTL. Indeed, the first to characters are identified as a retweet by Twitter. Similarly, tweets starting with the text RTV (often used as an abbreviation of for Radio and Television) is also identified as a retweet.

So, to what extent does this influence the findings? In the table below I cross-tabulated the original classification against my corrected version.

 

What we see here, is that there are 787 false positives and 127994 (=127987+7) false negatives. That’s 2.8% of tweets incorrectly classified. Well, this small fraction seems not too disturbing. Or does it? Well, in terms of descriptive analysis it might be negligable. As long as the miss-classifications are at random (which I didn’t check yet).

At the same time, if one wants to use these retweets to construct a social network of people retweeting each other (yes you can do that), things might be different. Even if the miss-classifications are at random, it might seriously affect network structure indicators.

Similar classification issues are at play for mentions and replies. Only the first mentioned names are identified, whereas many tweets mention multiple names. Furthermore, a reply is only a reply when the twittername begins at the first position (i.e. when one includes the @-sign).

In my opinion, it’s surprising that the programmers at Twitter Inc. haven’t used something like regular expressions to classify the data correctly. To be fair to Twitter, they acknowledge that their retweet count is an approximation. In the mean time it’s better to be safe than sorry: classify them yourself if you can.

Still, if there are tools available – I already mentioned regular expressions or think of the string functions available in SPSS – researchers studying the actual tweets might consider these to get more accurate results.

http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/digg_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/delicious_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/technorati_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/google_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/myspace_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/facebook_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/twitter_32.png
Surfing the web in South Korea
Nov 11th, 2009 by Maurice

As we all know, South Korea is famous for its Internet landscape: high adoption rates and high bandwidth. However, on browsing the web in South Korea I have some mixed feelings. Of course the speed is very high. Up- and downloading speed is very high. Even higher than my subscription (one of the fastest available from UPC)  in the Netherlands. However, the speed at YU considerably slower. In the Netherlands, in my experience, speed at universities are higher than in regular homes.

The websites in South Korea have an altogether different feel, appearance (whatchamacallit)  than western web pages. Important is of course the Hangeul character set. These characters seem to make the design more delicate. This is enhanced by the use of light and pastel colours on the page. Furthermore the pages often have some animation as the builds up quickly. See for instance these pages of YeungNam University, Lotte department store, and Ohmynews.

Unfortunately, there is a big problem, especially for people coming from abroad. All websites and computers are totally dedicated to Microsoft. First of all, you will not find any other browser than Internet Explorer. While the European Union forces Microsoft to show all different browsers while starting up the newly bought computer, in South Korea you are forced to use Internet Explorer. Without it you will not venture far in Korean cyberspace. Not only that, the dominant IE version is 6,  whereas Internet Explorer 8 is already released some time ago. There is an important reason for this: it turns out many Korean websites use ‘Active X’ controls for loading apps into the browser. Because it is not supported for other browsers, people only can use IE. Apart from this, ‘Active X’ is been said to suffer from security issues.

Coming to South Korea, oblivious of this issue and assuming browsing is browsing, using Firefox became quite frustrating. Gmail simply didn’t work on my university computer: I’d see my list of emails but was not able to open any of them. So what do you do when you’re an avid Firefox user with many add-ons installed, do you switch to Internet Explorer? I didn’t. Why? Well, when using Internet Explorer for all more special navigation activities, I was asked to install all kind of ActiveX things. What they were I still don’t know, because it was in Korean. My colleagues at YU assured me it was OK to install them. Also, it appears slow, and has little additional functionality. So, I decided to abort the use of IE altogether. Now, I use Google Chrome, and it works fine but not perfect. OK it’s lightweight, but little additional functionality. And in South Korea it is also a bit buggy: sometimes, I navigate to another website by clicking a link, which is not an uncommon thing to do. Then, Chrome tells me the link is broken. Oh? Then, copy-pasting the URL in the address bar subsequently shows no problems with the link whatsoever.

So, I still miss my Firefox, especially because of some essential Firefox add-ons, such as Zotero, Delicious, Downthemall! and Mouse-gestures. I hope for the Korean people and for researchers in general things change rapidly, because although Korea is famous for its Internet speed and Internet adoption, it is also infamous for the Microsoft monopoly.

http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/digg_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/delicious_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/technorati_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/google_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/myspace_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/facebook_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/twitter_32.png
»  Substance:WordPress   »  Style:Ahren Ahimsa