:::: MENU ::::

Mapping Twitter Data: Sherlock vs NYE

Perusing New Years Day on social media one couldn’t help but notice the large amount of tweets about the beginning of Season 3 of the BBC TV series ‘Sherlock’ ; nearly surpassing those of the New Year and reflections of the previous.

It seemed appropriate to use some of my new social media toolkit to collect this data. From 3:00pm until midnight the data was collected – the show being broadcast from 9:00pm until 10:00pm.

Below is the result showing the widespread hype throughout London during the day. Approximately 1800 geo-located tweets, which seems is only a very small subset of the tweets that actually appeared.

Nonetheless an interesting map to observe (mouseover green dot to view tweet text).

2723View Fullscreen


Data mining with XPath

 

At the recent CASA Hackathon I was part of a team that developed a Social Radar (originally forecast to be a Google Glass app), retrieving data from Twitter and producing maps led by Steven Gray. Steve has produced a ergonomic and very useful presentation on how to retrieve these feeds here.

While working on this project I thought that it would be useful for the radar to display what events are currently on in the city.

One useful way to retrieve data when there is no API, or you have no time to learn an API, is by using XPath.

In my most simple definition, XPath is a way to retrieve specified nodes within XML (or simply HTML) pages.

For example, if you have the following code:


< div class = "name" >
< firstname > John < / firstname >
< surname > Smith < /surname
< div >

You could query the ‘firstname’ element and it would return ‘John’.

One can write a script in a language such as Python to automatically retrieve data from webpages, to either display it live or collect it over time.

Google Drive have made it very simple to undertake such tasks, and here I will post a short tutorial.

Using Google Drive

1. Create a Spreadsheet in Google Drive
2. In Sheet1 fill the first two column headings (A1, B1) with ‘Source’ and ‘XPath’ respectively.
3. In the Source column (B2), put in the name of a website with some data you would like to retrieve (e.g. TimeOut Events : http://www.timeout.com/london/search?_source=global&profile=london&_dd=&page_zone=events&keyword=&section=events&on=today&locationText=&_section_search=events )
4. In the XPath column, place the XPath (e.g. //div[@class=’topSection’]/h3 ). This is the hardest bit – reading this might help.
5. Create Sheet2, and use the importxml function to run the XPath query e.g. ( importxml(Sheet1!A2,Sheet1!B2)
6. Once you have tried a few of these feeds you can publish the document to the web, or geocode the results if you have chosen a dataset with addresses (such as this TimeOut dataset). Geo for Google Docs is quite useful for this, particularly if you use TileMill/Mapbox. Google Fusion tables also has good mapping capabilities.

Click here for an example of one of these documents.

Using Python

One of the limitations of the method mentioned above is it is limited to a maximum of 50 XPath queries (last time I checked) per document. If you are interested in harvesting a large dataset (for example, parsing through a real estate website for all houses currently on sale in a city, or collecting them continuously over time) one can utilise XPath in Python. The following script was produced at the event to retrieve data from TimeOut, clean the data, Geocode it and place it in a CSV file.

import lxml.html
import urllib
import urllib2
import pprint
import json
import csv
invalid = 0
with open('timeout-london.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(['title','time','venue','location','lat','lon'])
print 'title;time;venue;location;lat;lon'
while invalid < 5:
for num in range(1,500):
try:
s1 = "http://www.timeout.com/london/search?"
s2 = "language=en_GB&profile=london&order=popularity&page="+str(num)+"&page_size=1&source=admin&type=event&on=today&_section_search=&section="
s3 = str(s1+s2)
page = urllib.urlopen(s3)
doc = lxml.html.document_fromstring(page.read())
xme = doc.xpath("//h3/a")
xmf = doc.xpath("//h4")
title = xme[0].text_content()
location = xmf[0].text_content()
breakdown = xmf[1].text_content().split('|')
line = breakdown
time = line[0]
loc_name = line[1]
try:
loc_area = line[2]
except:
loc_area = 'London'
time = " ".join(time.split())
loc_name = " ".join(loc_name.split())
loc_area = " ".join(loc_area.split())
locstr = loc_name+', '+loc_area+', '+'London'
add = str(locstr)
add = urllib2.quote(add)
geocode_url = "http://maps.googleapis.com/maps/api/geocode/json?address=%s&sensor=false&region=uk" % add
req = urllib2.urlopen(geocode_url)
jsonResponse = json.loads(req.read())
x = json.dumps([s['geometry']['location']['lat'] for s in jsonResponse['results']], indent=0)
x = x.strip('[')
x = x.strip(']')
y = x.strip("'")
x = x.split()
x = x[0]
x = x.strip('[')
x = x.strip(']')
x = x.strip(',')
y = json.dumps([s['geometry']['location']['lng'] for s in jsonResponse['results']], indent=0)
y = y.strip('[')
y = y.strip(']')
y = y.strip("'")
y = y.split()
y = y[0]
y = y.strip('[')
y = y.strip(']')
y = y.strip(',')
spamwriter.writerow([title,time,loc_name,loc_area,x,y])
print title+';'+time+';'+loc_name+';'+loc_area+';'+x+';'+y
except:
invalid +=1

The output looks something like this, with lat/long columns suitable for mapping.


title;time;venue;location;lat;lon
New Year's Eve 2013 Firework Display;Tue Dec 31;EDF Energy London Eye;Waterloo;51.5033;-0.11475
Christmas at Kew Gardens 2013;Until Sat Jan 4 2014;Kew Gardens;Kew, Surrey;51.4782342;-0.2984129
Carnaby Christmas 2013: The Year of the Robin;Until Mon Jan 6 2014;Carnaby Street;Soho;51.5148445;-0.1413416
The Book of Mormon;Until Sat Apr 5 2014;Prince of Wales Theatre;Leicester Square;51.51121389999999;-0.1198244
Coriolanus;Until Sat Feb 8 2014;Donmar Warehouse;Leicester Square;51.51121389999999;-0.1198244
Mojo;Until Sat Feb 8 2014;Harold Pinter Theatre;Trafalgar Square;51.51121389999999;-0.1198244
Winter Lights at Canary Wharf 2013;Tue Dec 31 - Sat Jan 25 2014;Canary Wharf;Docklands;51.50755299999999;-0.024526
......
......
......

I hope this code and examples prove useful for anyone trying to retrieve data from difficult pages on the web where no feeds are available. If you would like to appropriate this code for other webpages, feel free to contact me or try yourself. Quick example:

(example Social Event map using code from this tutorial ; data: TimeOut London, map: MapBox)


Re.Work Cities

Today I was fortunate enough to have attended a RE.WORK Cities conference at the stunning Tobacco Dock in London.

It was a great opportunity to meet people from fields as diverse as architecture, economics, data security, programming, planning, social work all with a shared, positive interest to make our cities better.

The conference covered topics as diverse as urban mobility, the internet of things, sensors, urban interaction design, synthetic biology and 3D printing.

While there were many highlights there were some topics in particular that stood out.

1) Getting there & shared mobility

Part of the conference included a voucher to use Uber. Uber, at first glance seems like a regular taxi service where one rings up, ask for a taxi, the taxi comes and takes you from A to B. However, experiencing it first hand makes you realise the difference. The whole process was seamless.

One simply opens the app, location is automatically detected, fare is calculated if you enter destination and all nearby drivers are mapped. Once ‘Go’ was pressed, I only had to wait a few minutes until an SMS arrived informing me of the sleek, black car waiting outside. The journey was minimal fuss – just a confirmation of destinations and automatically charged upon arrival.

This experience set the tone for the rest of the day. From this well-integrated process it was clear that some of the topics mentioned, such as driverless cars incorporated with car-sharing schemes, are not impossible future scernaios and are quite achievable with only minor modifications to this user-friendly, yet clearly comprehensive dispatch system offered by this application.

2) Heal-able materials

Erik Schlangen‘s self-healing asphalt and concrete ; not only reducing maintenance cost and time, but implemented on a much larger scale reducing transport disruptions and reduce cost in repairing buildings (such as affordable housing).

_63789067_63789066

Source : BBC

3) Drones

Earlier this year I was exposed to drone delivery through this news article.

China grounds world’s first CAKE DRONES over fears they might fall on someone’s head as novelty delivery service goes from sweet to sour

While this use of drones was a bit tongue-in-cheek, drones were brought up a few times in the conference.

For example for the creation of parametric structures and for assisted navigation.

4) Changing nature of work / Future Londoners

future-londoners-07-ty

The globalisation and Internet are changing the traditional notions of work and work hours. Work-life patterns changing and no longer fitting in the same paradigms. The Future Londoners project, for example, imagines some of these future citizens and how they live and work.

5) New approaches to sustainability

Collecting the carbon pollution and using it as a resource for cities – creating structures from thin air.


There were many more new ideas and projects that I would like to write about in future – hopefully a broadcast of the event comes out to be shared.