:::: MENU ::::

Data mining with XPath

 

At the recent CASA Hackathon I was part of a team that developed a Social Radar (originally forecast to be a Google Glass app), retrieving data from Twitter and producing maps led by Steven Gray. Steve has produced a ergonomic and very useful presentation on how to retrieve these feeds here.

While working on this project I thought that it would be useful for the radar to display what events are currently on in the city.

One useful way to retrieve data when there is no API, or you have no time to learn an API, is by using XPath.

In my most simple definition, XPath is a way to retrieve specified nodes within XML (or simply HTML) pages.

For example, if you have the following code:


< div class = "name" >
< firstname > John < / firstname >
< surname > Smith < /surname
< div >

You could query the ‘firstname’ element and it would return ‘John’.

One can write a script in a language such as Python to automatically retrieve data from webpages, to either display it live or collect it over time.

Google Drive have made it very simple to undertake such tasks, and here I will post a short tutorial.

Using Google Drive

1. Create a Spreadsheet in Google Drive
2. In Sheet1 fill the first two column headings (A1, B1) with ‘Source’ and ‘XPath’ respectively.
3. In the Source column (B2), put in the name of a website with some data you would like to retrieve (e.g. TimeOut Events : http://www.timeout.com/london/search?_source=global&profile=london&_dd=&page_zone=events&keyword=&section=events&on=today&locationText=&_section_search=events )
4. In the XPath column, place the XPath (e.g. //div[@class=’topSection’]/h3 ). This is the hardest bit – reading this might help.
5. Create Sheet2, and use the importxml function to run the XPath query e.g. ( importxml(Sheet1!A2,Sheet1!B2)
6. Once you have tried a few of these feeds you can publish the document to the web, or geocode the results if you have chosen a dataset with addresses (such as this TimeOut dataset). Geo for Google Docs is quite useful for this, particularly if you use TileMill/Mapbox. Google Fusion tables also has good mapping capabilities.

Click here for an example of one of these documents.

Using Python

One of the limitations of the method mentioned above is it is limited to a maximum of 50 XPath queries (last time I checked) per document. If you are interested in harvesting a large dataset (for example, parsing through a real estate website for all houses currently on sale in a city, or collecting them continuously over time) one can utilise XPath in Python. The following script was produced at the event to retrieve data from TimeOut, clean the data, Geocode it and place it in a CSV file.

import lxml.html
import urllib
import urllib2
import pprint
import json
import csv
invalid = 0
with open('timeout-london.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
spamwriter.writerow(['title','time','venue','location','lat','lon'])
print 'title;time;venue;location;lat;lon'
while invalid < 5:
for num in range(1,500):
try:
s1 = "http://www.timeout.com/london/search?"
s2 = "language=en_GB&profile=london&order=popularity&page="+str(num)+"&page_size=1&source=admin&type=event&on=today&_section_search=&section="
s3 = str(s1+s2)
page = urllib.urlopen(s3)
doc = lxml.html.document_fromstring(page.read())
xme = doc.xpath("//h3/a")
xmf = doc.xpath("//h4")
title = xme[0].text_content()
location = xmf[0].text_content()
breakdown = xmf[1].text_content().split('|')
line = breakdown
time = line[0]
loc_name = line[1]
try:
loc_area = line[2]
except:
loc_area = 'London'
time = " ".join(time.split())
loc_name = " ".join(loc_name.split())
loc_area = " ".join(loc_area.split())
locstr = loc_name+', '+loc_area+', '+'London'
add = str(locstr)
add = urllib2.quote(add)
geocode_url = "http://maps.googleapis.com/maps/api/geocode/json?address=%s&sensor=false&region=uk" % add
req = urllib2.urlopen(geocode_url)
jsonResponse = json.loads(req.read())
x = json.dumps([s['geometry']['location']['lat'] for s in jsonResponse['results']], indent=0)
x = x.strip('[')
x = x.strip(']')
y = x.strip("'")
x = x.split()
x = x[0]
x = x.strip('[')
x = x.strip(']')
x = x.strip(',')
y = json.dumps([s['geometry']['location']['lng'] for s in jsonResponse['results']], indent=0)
y = y.strip('[')
y = y.strip(']')
y = y.strip("'")
y = y.split()
y = y[0]
y = y.strip('[')
y = y.strip(']')
y = y.strip(',')
spamwriter.writerow([title,time,loc_name,loc_area,x,y])
print title+';'+time+';'+loc_name+';'+loc_area+';'+x+';'+y
except:
invalid +=1

The output looks something like this, with lat/long columns suitable for mapping.


title;time;venue;location;lat;lon
New Year's Eve 2013 Firework Display;Tue Dec 31;EDF Energy London Eye;Waterloo;51.5033;-0.11475
Christmas at Kew Gardens 2013;Until Sat Jan 4 2014;Kew Gardens;Kew, Surrey;51.4782342;-0.2984129
Carnaby Christmas 2013: The Year of the Robin;Until Mon Jan 6 2014;Carnaby Street;Soho;51.5148445;-0.1413416
The Book of Mormon;Until Sat Apr 5 2014;Prince of Wales Theatre;Leicester Square;51.51121389999999;-0.1198244
Coriolanus;Until Sat Feb 8 2014;Donmar Warehouse;Leicester Square;51.51121389999999;-0.1198244
Mojo;Until Sat Feb 8 2014;Harold Pinter Theatre;Trafalgar Square;51.51121389999999;-0.1198244
Winter Lights at Canary Wharf 2013;Tue Dec 31 - Sat Jan 25 2014;Canary Wharf;Docklands;51.50755299999999;-0.024526
......
......
......

I hope this code and examples prove useful for anyone trying to retrieve data from difficult pages on the web where no feeds are available. If you would like to appropriate this code for other webpages, feel free to contact me or try yourself. Quick example:

(example Social Event map using code from this tutorial ; data: TimeOut London, map: MapBox)