Harvesting and visualizing data | Nusses hjem på nettet

OK. The datalab should know how to harvest data. And we should know how to visualize them. Otherwise – how will we be able to help our users?

Libraries should have HUGE amounts of data. Unfortunately, we have not been able to make much headway making them publicly accessible. What to do? Get some data from another source. Twitter appears to be a very important way to communicate. I do. Why not harvest the number of tweets I tweet, and visualize them?

The correct way to do this, would be to harvest the data from Twitters API. That requires a developer account, which requires a regular account that has been verified by SMS. My phone company does not support this feature. However – the data is freely available on the web. Just go to my twitterprofile, and take a look.

And how do I get this data? Enter Python. Use the urllib2 library to harvest the site:

import urllib2 
page = urllib2.urlopen('http://www.twitter.com/chrbknudsen')

Then convert it to something that we can work with:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(page.read())

page.read() returns the content of the site we read. BeautifulSoup converts that content into an object with a lot of nifty methods. Take a look:

tweets = soup.find('li', {'class' : 'ProfileNav-item ProfileNav-item--tweets is-active'}).find('span', {'class' : 'ProfileNav-value'})

BeautifulSoup lets us find content that is enclosed by specific tags. The number of tweets is in a LI-tag, with the class “ProfileNav-item ProfileNav-item–tweets is-active”. But there is more under that tag than just the number of tweets. Therefore we use the method “find” twice, now with the tagt SPAN, and the class “ProfileNav-value”.

Now we have the number of tweets (the moment I reach a thousand tweets, we will have to do something – twitter shows two thousand tweets as 2,000. And if I have even more, they might show 14563 tweets as 14.6td. Simple conversions to remove punctuation and changes 14,6td to 14600. We loose some precision, but nevermind).

OK. Now we have a Python script, that can access a twitter-profile, and harvest the number of tweets. On a Raspberry Pi (or other *nix-boxes) we can run a cron-job to harvest once every hour. Note that this method of harvesting is not exactly approved by Twitter.

We need to save the data. Enter Google Spreadsheets. I wont go into details about how to aquire the json-file that is used to authorize your use of Google Spreadsheets. Maybe I’ll get back to it. But we need some libraries:

import json
import gspread
import ssl
from oauth2client.client import SignedJwtAssertionCredentials

Those are used by this code:

json_key = json.load(open('client_secret_2.json'))
scope = ['https://spreadsheets.google.com/feeds']

ssl._create_default_https_context = ssl._create_unverified_context

credentials = SignedJwtAssertionCredentials(json_key['client_email'], json_key['private_key'].encode(), scope)

gc = gspread.authorize(credentials)

wks = gc.open("twitterdata").sheet1

Note that you will need to provide a full path to the json-file in order to run the cron-job. That fact gave me an ulcer yesterday…

We have connected to Google Spreadsheets through the Google API, and we have opened the spreadsheet “twitterdata”, and the first worksheet. We now have an object “wks”, that we can manipulate.

We use this method:

wks.append_row(rowToAdd)

That appends a row with some data to the active sheet. The data is contained in a list, “rowToAdd”. Now we just have to make sure that there is some data there. Before calling rowToAdd, we populate the variable:

rowToAdd = [time.time()]
rowToAdd.append(tweets)

Yeah, we need to import time as well.

That was it. Not exactly a working example, but a pretty accurate description of my thought processes writing it.

So far so good. Theres a cronjob running on an RPi back home, it harvest the number of tweets once every hour, and writes it to a google spreadsheet.

Now we need to visualize it.

Enter Highcharts. It’s free, its written in javascript and it can be embedded.

<div id="chart_container">
</div>
<script src="//code.highcharts.com/adapters/standalone-framework.js"></script>
<script src="//code.highcharts.com/highcharts.js"></script>
<script src="http://code.highcharts.com/modules/data.js"></script>

<script>
new Highcharts.Chart({
 "chart":{
 "backgroundColor":"#fefefe",
 "renderTo":"chart_container",
 "type":"spline"
 },
 "title":{
 "text":"Title"
 },
 "colors":["#476974","#3ca1c1","#4ccbf4","#96dff6","#c9e8f6"],
 "legend":{
 "enabled":true,
 "margin":30
 },
 data: { googleSpreadsheetKey: '[enter spreadsheetkey here]', 
 googleSpreadsheetWorksheet: 2, 
 },
 });
</script>

That was it. This is the result:

Some work still needs to be done. The UNIX-time is not formatted as dates – it needs to be converted to javascript-time. This is not actually live data. And theres a bunch of other details I would like to tinker with. But I’m still rather satisfied.