Visualising Twitter activity – a work i progress part IX: Collecting more data

So. I want to collect more data. That means I have to add it to the spreadsheet. I also want to enumerate the tweets. Most importantly – I need to be able to handle missing fields in the json I get from Twitter. If there are no hashtags in the tweet, the field “hashtags” wont be there. The script should be able to handle that. A lot of the tweets I have collected, are retweets. Thats fine. But they begin with the string “RT”. Should I remove that?

It is tempting to collect all the data. But I don’t think I’m going to do that. A lot of the fields does not appear in every tweet. And I’m not sure I’ll ever figure out what to do with it. So it’s going to be a selection of them.

Oh, and by the way – why should I break the current spreadsheet? I’ll just add new columns. No need to rework previous graphs. The actual content may vary, but what the heck. And those coordinates that are more or less useless (because no-one uses them)? Lets replace them with the counter.

Anyways which fields?

column	What	code	Note
A	Timestamp	data[‘timestamp_ms’]
B	Counter	=row()	Replaces coordinates
C	Screenname	data[‘user’][‘screen_name’]
D	Link for image	data[‘user’][‘profile_image_url’]	Profile image for the tweeter
E	Text	data[‘text’]	The actual tweet
F	ID-string	id_str	ID of the tweet – as a string
G	Location	user:location	Registered location of the user
H	Hashtags		The hashtags of the tweet
I	Language	lang	Language of the tweet, as identified by Twitter

I think thats gonna be it.

Getting to the hashtags is a bit difficult. First of all, there might be none. Secondly, there might be more than one. Remember that I get the tweet in JSON-format? And I call the variable “data”.

The first hashtag – given that there is one, can be called like this:

data[‘entities’][‘hashtags’][0][‘text’]

The second of course is found simply by replacing “0” with “1”. The third with “2” etc. The number of hashtags can be found like this:

len(data[‘entities’][‘hashtags’])

Putting it together:

try:
for x in range(0, len(data[‘entities’][‘hashtags’])):
hashtag = hashtag + data[‘entities’][‘hashtags’][x][‘text’] + ” ”
except:
hashtag = ” ”
til_regneark.append(hashtag)

Done. I should put some timers in. It does take quite some time from the tweet is caught until it is written to the spreadsheet. Thats simple. Let me just try that.

Yeah. It is writing to the spreadsheet that takes time. About 4 seconds. Not much I can do about that.

And the timing? I already have the “time” library imported. Just place a “start_a = time.clock()” before what you want to time. And an “end_a = time.clock()” right after. And then “print (end_a – start_a)”.

Done. Now I have the data I need. Or I will have in a little while. All that is left is to plot it.