Visualising Twitter activity – a work i progress part IX: Collecting more data

So. I want to collect more data. That means I have to add it to the spreadsheet. I also want to enumerate the tweets. Most importantly – I need to be able to handle missing fields in the json I get from Twitter. If there are no hashtags in the tweet, the field “hashtags” wont be there. The script should be able to handle that. A lot of the tweets I have collected, are retweets. Thats fine. But they begin with the string “RT”. Should I remove that?

It is tempting to collect all the data. But I don’t think I’m going to do that. A lot of the fields does not appear in every tweet. And I’m not sure I’ll ever figure out what to do with it. So it’s going to be a selection of them.

Oh, and by the way – why should I break the current spreadsheet? I’ll just add new columns. No need to rework previous graphs. The actual content may vary, but what the heck. And those coordinates that are more or less useless (because no-one uses them)? Lets replace them with the counter.

Anyways which fields?

column What code Note
A  Timestamp  data[‘timestamp_ms’]
B Counter =row()  Replaces coordinates
C  Screenname  data[‘user’][‘screen_name’]
D  Link for image  data[‘user’][‘profile_image_url’] Profile image for the tweeter
E  Text  data[‘text’]  The actual tweet
F  ID-string  id_str  ID of the tweet – as a string
G  Location  user:location  Registered location of the user
H  Hashtags  The hashtags of the tweet
I Language lang Language of the tweet, as identified by Twitter

I think thats gonna be it.

Getting to the hashtags is a bit difficult. First of all, there might be none. Secondly, there might be more than one. Remember that I get the tweet in JSON-format? And I call the variable “data”.

The first hashtag – given that there is one, can be called like this:

data[‘entities’][‘hashtags’][0][‘text’]

The second of course is found simply by replacing “0” with “1”. The third with “2” etc. The number of hashtags can be found like this:

len(data[‘entities’][‘hashtags’])

Putting it together:

try:
for x in range(0, len(data[‘entities’][‘hashtags’])):
hashtag = hashtag + data[‘entities’][‘hashtags’][x][‘text’] + ” ”
except:
hashtag = ” ”
til_regneark.append(hashtag)

Done. I should put some timers in. It does take quite some time from the tweet is caught until it is written to the spreadsheet. Thats simple. Let me just try that.

Yeah. It is writing to the spreadsheet that takes time. About 4 seconds. Not much I can do about that.

And the timing? I already have the “time” library imported. Just place a “start_a = time.clock()” before what you want to time. And an “end_a = time.clock()” right after. And then “print (end_a – start_a)”.

Done. Now I have the data I need. Or I will have in a little while. All that is left is to plot it.

Visualising Twitter activity – a work in progress part VII: Getting started on visualising stuff

Now we’re getting somewhere! At the bottom of this post will be links to the previous parts of this odyssey.

I now have a spreadsheet with some data. How do I make that interesting to look at? First of all, I have to decide on a framework for graphing this. In part IV, I made a choice. I chose to save the data in a Google Spreadsheet. For two reasons. One is that I know something about how to work with it. As the previous parts have revealed, I dont know much. But I know enough. The other reason is, that Google has a chart API, that I should be able to use. That in itself is not a very good reason. I know how to work with HighCharts, so why not use that, instead of learning something new? Because Google promises, that the graphs will update live, as I fill in more data in the spreadsheet.

How cool is that?

Where to begin? I don’t think that Google actually provides any good examples. A lot of other people does – it just takes at bit of googling to find somthing that is helpful. I think this example is a good beginning. Go on, take a look.

OK, first of all it begins like this:

function drawChart() {
var query = new google.visualization.Query(‘http://docs.google.com/spreadsheet/tq?key=0AhCv9Xu_eRnSdHBELWZoZHRaQTl5VXdRa3JLTzByUlE&gid=1’);

That tells the chart software where to find the data. It does not look like the ordinary link to a spreadsheet. That looks a bit more like this:

https://docs.google.com/spreadsheets/d/14kCXRFEOxBMgbGDhwBP1Zy8yhZhC7ty54iBVlCovKxc/edit#gid=1

I’m not sure exactly what the difference is, but when I follow the URL in the script, my browser downloads a JSON-file. I think it contains the data from the spreadsheet. So I need to take the URL I already have for my spreadsheet, and format it like in the example.

https://docs.google.com/spreadsheet/tq?key=1kDIxZ5jBWYB2eiAjfsM_jRBr2mYWxudORECFApXBULE&gid=3

I’m almost certain that the gid=3 points to sheet number 3 in the spreadsheet. I’ve accessed the spreadsheet from the example, and copied the data to sheet no. 3 in my own sheet. If I change the long string of random letters in the example, nothing should happen.

It did’nt. The graph on that fiddle looks exactly the same. Now, lets try to the real data. That is in sheet number 1.
Lets also change the “query.setQuery(‘SELECT A, B, etc’)” to “query.setQuery(‘SELECT B, A’)”. That should collect the data from colums B and A in the sheet.
Try again. Yep. I’m now succesfully plotting the data.

graph

It does not, however appears to update live…

It appears that Google does not live up to the promises. I’ll need HighCharts for that. If, that is, I can figure out how to get live data.
There is another problem. This is a very bad graph. It plots the timestamp of the tweet, against the coordinates from the tweet (remember that those were allmost always “None”?). Not that interesting. Maybe Google supports some way of counting? Quick test: Exchange ‘SELECT B, A’ with ‘SELECT COUNT(A), A)’. In theory, is that I should get the count of column A on the x-axis, and the timestamp on the y-axis.

Nope. Better take a look at the query language. Its some variant of SQL. The documentation can be found here. Note – the graph above is not the real thing. I might have to change the spreadsheet, and that will change the graph. Therefore it is an image of the graph. The code will be given below.

And now, before continuing, I have to do something about the laundry.

And now, the promised list of the previuos parts:

Part I: The idea, and the hardware
Part II: Getting my hands on tweets
Part III: Deciding what data to save (that is going to change over the next few days)
Part IV: Storing the data
Part V: Fixing some problems
Part VI: Problems that can wait

And the code for the graph:

<script src=”http://www.google.com/jsapi?fake=.js” type=”text/javascript”></script>
<script type=”text/javascript”>// <![CDATA[
function drawChart() { var query = new google.visualization.Query(‘http://docs.google.com/spreadsheet/tq?key=1kDIxZ5jBWYB2eiAjfsM_jRBr2mYWxudORECFApXBULE&gid=1’); query.setQuery(‘SELECT B, A’); query.send(function (response) { if (response.isError()) { alert(‘Error in query: ‘ + response.getMessage() + ‘ ‘ + response.getDetailedMessage()); return; } var data = response.getDataTable(); var chart = new google.visualization.LineChart(document.querySelector(‘#google_test_container’)); chart.draw(data, { chartArea: {width: ‘70%’, height: ‘50%’}, fontName: [“Arial”], colors:[‘#274358′,’#5e87a5′,’#a2cdf6’], curveType: [‘none’], fontSize: [’11’], hAxis: {title: ‘Tweets over time’, titleTextStyle: {italic: false, color: ‘black’, fontSize: 12}}, legend: {position: ‘right’, textStyle: {color: ‘black’, fontSize: 12}}, lineWidth: 2, pointSize: 7, tooltip: {textStyle: {color: ‘Black’}, showColorCode: false} }); }); } google.load(‘visualization’, ‘1’, {packages:[‘corechart’], callback: drawChart});
// ]]></script>
<div id=”google_test_container”></div>

Yep, its messy.

Visualising Twitter activity – a work in progress part VI: Work to be postponed

Done. The test at the end of last chapter was a success. An ugly hack to be sure. As long as it works. This is going to be a shorter part in the series. There are a number of different things I should do. But now I’m getting close to the interesting part. A list of stuff that can wait, because I’m impatient:

  • I really should have more exception handling. I’m communicating with the internet, and a lot of things can go wrong. During testing I got an Internal Server error from Google. I should be able to handle that, without crashing
  • Having handled that, I should be able to collect the tweets that were tweeted during the connectionproblem, and add them to the spreadsheet later.
  • I should also be able to handle getting kicked of the connection to Twitter. How to reestablish it?
  • There is a potential problem: Twitter does not guarantee that all fields will be be in all tweets. I should harden the script to be able to handle cases where there is data missing from a tweet.
  • I know what the hashtag will be. But as I’m going to leave the Raspberry Pi at home, I should be able to handle a situation, where I’m at work at 8 am, the hashtag is announced at 9am, and I have to enter the hashtag to the code, and start collecting.
  • Remember the exceptions back in part V? That exception gets thrown each and every time. Of couse the tweets are saved, but I re-authorize the connection to Google every time. Its not important enough to crash anything. But wks should be declared as a global variable. And apparently I can’t figure out how to do that correctly.

That will have to wait. Apart from the last one, that is actully somewhat critical. I’m getting close to being able to visualise some data. I’ll begin playing in a short while, but right now I think I need to spend some time on the couch.piles

 

Oh, and for those just joining the party:

Part I: The idea, and the hardware
Part II: Getting my hands on tweets
Part III: Deciding what data to save (that is going to change over the next few days)
Part IV: Storing the data
Part V: Fixing some problems

Visualising Twitter activity – a work in progress part II: Getting the tweets

In part I I covered the basic idea, and the hardware. Now I’m ready to turn my attention on actually getting my hands on some tweets.

Twitter has an API, an Application Program Interface, that allows programs to talk directly with the Twitter-servers. In order to gain access, I need to have an authenticated Twitter account. I already have that, it was not easy, as Twitter requires a phonenumber to be associated with the account. That can only happen if the phone company supports receiving SMS’es from Twitter. My phone company did not until recently.

That hurdle tackled, I go to the Twitter Apps site. I have to be logged in, and click “Create New App”. After I’ve filled out the required information, I’m provided with four long random strings. I’m not going to show them here. Lets just say that they are long, and contains both digits and letters, both upper and lower case. I get a Consumer Key and a Consumer Secret, basically a login to Twitter. And I get an Access Token and an Access Token Secret – that are specific to the given application I’m gonna write. Those four strings are my credentials, that I’m going to use to gain access.

Next step. I could write a program that sends a request to the Twitter API, and receives data. Or rather, I could’nt, because I’m not that good a programmer. As luck would have it, I’m not the first person in this situation.

But I’m getting ahead of myself. The first question should actually be: Which language should I write this in?

The de facto preferred language on the Raspberry Pi is Python. Python is not the fastest language in the world – its an interpreted program rather than a compiled. But it has a very large userbase, the fact that it is interpreted makes development quick. And it is very popular in data science circles. Propably because it is pretty accessible to people that are not really programmers.

I could write a program – script actually – from the bottom up, that handles all the connectivity with Twitter. Or rather I can’t. I’m not that good a programmer. But someone else has already written a library, a collection of code, that handles all the really difficult parts. Python comes with a LOT of libraries. One of the libraries is called Twython, a portmanteau of Twitter and Python. The intelligent reader of this post, can guess what Twython does.

I do not need to understand everything myself. Others have done that, and on this page, I find an example of how to get tweets with Twython in Python. The code is as follows:

from twython import TwythonStreamer

class TweetStreamer(TwythonStreamer):
    def on_success(self, data):
        if 'text' in data:
            print data['text'].encode('utf-8')

    def on_error(self, status_code, data):
        print status_code
        self.disconnect()
# replace these with the details from your Twitter Application
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''

streamer = TweetStreamer(consumer_key, consumer_secret,
                         access_token, access_token_secret)

streamer.statuses.filter(track = 'python')


OK. I enter my credentials from Twitter, and runs the code. sudo python collect.py.

And tweets containing the word “python” is streaming down my screen:

kvidder

Actually it was a bit more difficult. The Twitter API requires a safe connection, and none of the requisite libraries handling safe connections were installed. That took quite a lot of googling. A pro-tip: Read the error messages you get. Some of them contains links to solutions.

I must admit that I’m not quite sure what is actually happening. Object-based programming is not my strong suit. Never mind, I just need to understand enough.

The line

streamer.statuses.filter(track = 'python')

Tells the API that I’m interested in tweets containing the word “python”

And this part of the script:

    def on_success(self, data):
        if 'text' in data:
            print data['text'].encode('utf-8')

Does stuff to the tweet I receive. The tweet is returned from Twitter in a format called JSON. And it is saved in the variable called “data”. There are different fields in the format, one of them is “text”. I can access the content of that field by calling data[‘text’]. I may need to do something about the encoding. People tweet in a lot of different languages, so tweets are encoded in the character set utf-8. As you can see in the screenshot, my computer does not really know what to do with exotic characters.

There are a lot of other fields containing data in the tweets. But the main objective is achieved – I can get tweets on my screen in realtime. I you tweet something containing the word “python”, I will get it on my screen moments after.

The second part of the mission is accomplished. On to the third. But first – we’re going to the mall for a burger.

Visualising Twitter activity – a work in progress part VIII: Getting serious about visualising stuff

Now we’re getting somewhere! It took most of last night reading up on long forgotten SQL-skills. I present to you, the most prolific tweeters:

OK, I need to do something about the graphical stuff. The text is pretty cramped.

The magic? This SQL-style selection.

SELECT C, COUNT(A) GROUP BY C ORDER BY COUNT(A) DESC LIMIT 15

Combined with choosing BarChart rather than lines. It really is that simple. Although it certainly was anything but simple to figure out (until I gave up, and consulted my sql-books).

What else? I want a graph showing the number of tweets as a function of time. I would also like a wordcloud. And a tagcloud. I dont think I can make that with the google framework. Or maybe I can: I’d better take a look at this page. The tag-cloud would be easier if I import other fields from the tweets to the spreadsheet. 

And if I want a count – it appears that I have to add “=ROW()” to a column in the spreadsheet. That will, I hope, give me a counter.

Of course, that is going to seriously mess up the chart above. But it will only take a change of two letters, to fix. Next step will be to import all fields from the tweets to the database. Just a moment 🙂