US presidential debates analyzed with VOSViewer

These last few weeks the US was treated with three presidential and one vice-presidential debates. These are the most watched and most tweeted about events of the year. Still, the year is not over yet. A way to get partial insight into the debates, the maps show words being used in the debate: the closer the words in the Euclidean space, the more often they are used in the same line. The software can be downloaded from this website vosviewer.com. The manual is short but contains relevant references on algoritms and how to use it. The transcripts were downloaded from Debates.org. Before importing the texts in VOSViewer, they were stripped from anchor elements such as who is speaking, whether there is applause or laughter or crosstalk.

Below are the maps showing the results for each of the debates. The first map representing the first presidential debate shows four clusters. The first cluster (green) indicates the debate focused on the tax system regarding the small business enterprises. Because “governor” is part of this cluster, it appears that Obama or the interviewer directly addressed Romney on these issues. The second cluster (purple) shows the debate also strongly focused on Obamacare, insurance and the elderly. The red cluster shows healthcare issues, education and the Dodd Frank reform. The fourth cluster (yellow) is difficult to interpret.

First debate on national politics

The map of the second presidential debate (the “townhall” format) shows that the first (green) cluster focuses on businesses, small deduction the economy and woman. the second one (red) focuses on the younger citizens of the US, judging from the words: school, kid, candy, college, chance. the third cluster (blue) is difficult to interpret from these words: day, time, question, lot, mr. president, governor. The fourth and final cluster (brownish) focuses on the US-China relation.

Debate on national issues. The “Townhall” meeting

The map of the third presidential debate shows that the debate is clustered around four topics. The first cluster revolves around the relation between the government and American businesses (red). The second cluster is about the Middle East and the resent unrests (Syria and Libya) (green). The third clusteris about the Amnerican economy and the role of China as the culprit taking away American jobs (blue). The final cluser (yellow) deals with another part of Asia: Iraq, Pakistan and Afgahanistan and the American troops that stay over there to prevent war.

Third debate on foreign politics

http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/digg_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/delicious_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/technorati_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/google_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/myspace_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/facebook_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/twitter_32.png

Bias in Twitter API measurements

I’m working on the analysis of the tweets on the Dutch general elections of 2010. Because in the five week prior to Election Day, there is a considerable amount of tweets to analyse, 4,585,614 to be precise. Because the large amount of tweets I use SPSS to organize the data and create variables. While working on the data, what struck me was that Twitter’s identification of, for instance, retweets it quite sloppy to say the least.

As you may know, Twitter provides an API to allow access to their data re-using the tweets, location etc in mashups. A few of these variables are of interest for researchers, for instance user characteristics such as name, location, follower and following network size. I use these in my research as well, combined with data from other sources (see the upcoming issue of Party Politics on Twitter use by candidates in the EU elections of 2009).

As I was working on the actual tweets content I found some curious discrepancies between Twitter’s measurements and mine. For instance, retweets, identified as “RT” in the tweet text, are only flagged as a retweets by Twitter when they are positioned right at the beginning of the tweet. Even if there is a blank space before the “RT” it fails to flag it as a retweet. Furthermore, “RT” codes put somewhere in the middle of the tweet are not identified correctly either. You could consider these occurrences as false negatives.
There are also false positives: “RT” codes identified as tweets but actually are not. Consider this: in the Netherlands, there’s this broadcasting organisation called RTL. Indeed, the first to characters are identified as a retweet by Twitter. Similarly, tweets starting with the text RTV (often used as an abbreviation of for Radio and Television) is also identified as a retweet.

So, to what extent does this influence the findings? In the table below I cross-tabulated the original classification against my corrected version.

 

What we see here, is that there are 787 false positives and 127994 (=127987+7) false negatives. That’s 2.8% of tweets incorrectly classified. Well, this small fraction seems not too disturbing. Or does it? Well, in terms of descriptive analysis it might be negligable. As long as the miss-classifications are at random (which I didn’t check yet).

At the same time, if one wants to use these retweets to construct a social network of people retweeting each other (yes you can do that), things might be different. Even if the miss-classifications are at random, it might seriously affect network structure indicators.

Similar classification issues are at play for mentions and replies. Only the first mentioned names are identified, whereas many tweets mention multiple names. Furthermore, a reply is only a reply when the twittername begins at the first position (i.e. when one includes the @-sign).

In my opinion, it’s surprising that the programmers at Twitter Inc. haven’t used something like regular expressions to classify the data correctly. To be fair to Twitter, they acknowledge that their retweet count is an approximation. In the mean time it’s better to be safe than sorry: classify them yourself if you can.

Still, if there are tools available – I already mentioned regular expressions or think of the string functions available in SPSS – researchers studying the actual tweets might consider these to get more accurate results.

http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/digg_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/delicious_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/technorati_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/google_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/myspace_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/facebook_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/twitter_32.png

Surfing the web in South Korea

As we all know, South Korea is famous for its Internet landscape: high adoption rates and high bandwidth. However, on browsing the web in South Korea I have some mixed feelings. Of course the speed is very high. Up- and downloading speed is very high. Even higher than my subscription (one of the fastest available from UPC)  in the Netherlands. However, the speed at YU considerably slower. In the Netherlands, in my experience, speed at universities are higher than in regular homes.

The websites in South Korea have an altogether different feel, appearance (whatchamacallit)  than western web pages. Important is of course the Hangeul character set. These characters seem to make the design more delicate. This is enhanced by the use of light and pastel colours on the page. Furthermore the pages often have some animation as the builds up quickly. See for instance these pages of YeungNam University, Lotte department store, and Ohmynews.

Unfortunately, there is a big problem, especially for people coming from abroad. All websites and computers are totally dedicated to Microsoft. First of all, you will not find any other browser than Internet Explorer. While the European Union forces Microsoft to show all different browsers while starting up the newly bought computer, in South Korea you are forced to use Internet Explorer. Without it you will not venture far in Korean cyberspace. Not only that, the dominant IE version is 6,  whereas Internet Explorer 8 is already released some time ago. There is an important reason for this: it turns out many Korean websites use ‘Active X’ controls for loading apps into the browser. Because it is not supported for other browsers, people only can use IE. Apart from this, ‘Active X’ is been said to suffer from security issues.

Coming to South Korea, oblivious of this issue and assuming browsing is browsing, using Firefox became quite frustrating. Gmail simply didn’t work on my university computer: I’d see my list of emails but was not able to open any of them. So what do you do when you’re an avid Firefox user with many add-ons installed, do you switch to Internet Explorer? I didn’t. Why? Well, when using Internet Explorer for all more special navigation activities, I was asked to install all kind of ActiveX things. What they were I still don’t know, because it was in Korean. My colleagues at YU assured me it was OK to install them. Also, it appears slow, and has little additional functionality. So, I decided to abort the use of IE altogether. Now, I use Google Chrome, and it works fine but not perfect. OK it’s lightweight, but little additional functionality. And in South Korea it is also a bit buggy: sometimes, I navigate to another website by clicking a link, which is not an uncommon thing to do. Then, Chrome tells me the link is broken. Oh? Then, copy-pasting the URL in the address bar subsequently shows no problems with the link whatsoever.

So, I still miss my Firefox, especially because of some essential Firefox add-ons, such as Zotero, Delicious, Downthemall! and Mouse-gestures. I hope for the Korean people and for researchers in general things change rapidly, because although Korea is famous for its Internet speed and Internet adoption, it is also infamous for the Microsoft monopoly.

http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/digg_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/delicious_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/technorati_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/google_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/myspace_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/facebook_32.png http://blog.mauricevergeer.nl/wordpress/wp-content/plugins/sociofluid/images/twitter_32.png