:::: MENU ::::
Browsing posts in: Tutorials

Creating spatial data analytics dashboards in Cartoframes

With the strength of Carto in terms of spatial science and location intelligence ; and the easy access to data science packages in Python, Carto’s new project ‘Cartoframes‘ has a lot of potential to provide excellent mapping dashboards for data-hungry workflows.

Below is a quick tutorial I have made which will hopefully help new users figure out how to use it. It is in no way comprehensive, and there are probably some pieces missing ; but it should be enough to go off to get started! The tutorial covers some of the elements of creating a ‘live’ weather data dashboard for New South Wales in Australia.

What is Cartoframes ? (from https://github.com/CartoDB/cartoframes)
A Python package for integrating CARTO maps, analysis, and data services into data science workflows.

Python data analysis workflows often rely on the de facto standards pandas and Jupyter notebooks. Integrating CARTO into this workflow saves data scientists time and energy by not having to export datasets as files or retain multiple copies of the data. Instead, CARTOframes give the ability to communicate reproducible analysis while providing the ability to gain from CARTO’s services like hosted, dynamic or static maps and Data Observatory augmentation.


Write pandas DataFrames to CARTO tables
Read CARTO tables and queries into pandas DataFrames
Create customizable, interactive CARTO maps in a Jupyter notebook
Interact with CARTO’s Data Observatory
Use CARTO’s spatially-enabled database for analysis

Step 1 – Install libraries
Install all of the relevant libraries. For me I’m using Canopy. Canopy provides Python 2.7 and 3.5, with easy installation and updates via a graphical package manager of over 450 pre-built and tested scientific and analytic Python packages from the Enthought Python Distribution. These include NumPy, Pandas, SciPy, matplotlib, scikit-learn, and Jupyter / IPython. You can get Canopy for free here.

Once installed, open the console and install the packages:
pip install cartoframes
pip install pandas

Step 2 – Import libraries

In a new Jupyter notebook, start by importing the libraries in the first block, these are the ones you’ll generally need (though you can go to town with other numerical / statistical packages here!):

import cartoframes
import pandas as pd
import numpy as np

Step 3 – Set up a Carto account and register for an API key

Start by going to Carto.com and signing up through the prompts.

Once you have signed up, in the top-right of your home page there should be setting toggle which show you:

View your public profile
Your account
Your API keys
Close session

Click on ‘Your API keys’ and copy what shows up on the next page. It should be a long string of text, looking something like this:


Step 4 – Connecting to your Carto account in Python
Try the following line of code in your next Jupyter code block, where xxxxxxxxxxxx is your new API key. This key allows Cartoframes to communicate directly with the data in your Carto account.

Where it says ‘oclock’ you should put your own username.

cc = cartoframes.CartoContext(base_url='https://oclock.carto.com',api_key='xxxxxxxxxxxx')

When you run this code and call ‘cc’ it should provide you with a message such as this:
cartoframes.context.CartoContext at 0x1ea3fa2c518

This means that cartoframes has successfully accessed your Carto account and you can call ‘cc’ to reference accessing this account from now on. Make sure you keep your API key safe!

Step 5 – Upload some data to Carto
For this task, I downloaded the shapefile components of weather locations from the Australian Bureau of Meteorology. This is all of the spatial files (.shp, shx, .dbf etc) for IDM 13 from:

These are all the files prefixed by IDM000013 and suffixed by .dbf,.prj,.sbn,.sbx,.shp,.shx,.shp.xml. Carto will need these all in a .zip file before you upload them.

The metadata for this dataset can be found here:

IDM00013 – point places (precis, fire, marine)

Once you have downloaded these you can upload the shapefile and it should give you a series of geolocated dots covering all of Australia, with many attributes as described in the metadata above. For this I called the dataset ‘idm00013’.

Step 6 – Read the data in jupyter
Let’s test if everything is working. The following should display a dataframe of all of the aspatial information stored in each weather location:

carto_df = cc.read('idm00013')

The following should give you a list of all of the variables available to you to access and change:

Step 7 – Making a map
Now for the exciting bit – creating a Carto map inside the Jupyter notebook.
Here I’ve picked the elevation column with a brown colour scheme, try:

from cartoframes import Layer, BaseMap, styling
cc.map(layers=[BaseMap('light'),Layer('idm00013',color={'column': 'elevation','scheme': styling.brwnYl(7)},size=5)],

The following map should display, with light brown showing where the weather points are a low elevation, and high points shown in a darker brown.

Extension – Accessing and parsing a live data feed

The code below retrieves the latest weather forecasts for the weekend ahead from the Bureau of Meteorology’s API. It is stored in a dataframe ‘df’.

I’ll leave the indentation as part of this tutorial!

import xml.etree.ElementTree as ET
import csv
import pandas as pd
import urllib.request
req = urllib.request.Request('ftp://ftp.bom.gov.au/anon/gen/fwo/IDN11060.xml')
with urllib.request.urlopen(req) as response:
xml_data = response.read()
list_dict = []
root = ET.XML(xml_data)
for element in root.findall('forecast'):
for area in element:
for forecast in area:
min_temp = ''
max_temp = ''
aac_id = area.get('aac')
forecast_date = forecast.get('start-time-local')
for element in forecast:
if element.attrib['type'] == 'air_temperature_minimum':
min_temp = element.text
elif element.attrib['type'] == 'air_temperature_maximum':
max_temp = element.text
list_dict.append({'aac':aac_id, 'forecast_date':forecast_date, 'low_temp': min_temp, 'max_temp':max_temp})
df = pd.DataFrame(list_dict)

Extension Part 1 – Joining in a live data source

We now want to join the geographical data from the first exercise with this live data feed.
This is done with a ‘left’ join, so we keep all of the weather forecast records and add the geographic data to them.

merged_data = pd.merge(df,carto_df,on='aac')

Extension Part 2 – Selecting some data

Now we filter out all records to get one particular day’s forecast (you will need to change the date here to current date).
The filtered data is then written to a new dataset in Carto called ‘merged_weathermap’.

one_forecast = merged_data[merged_data['forecast_date']=='2018-01-16T00:00:00+11:00']
cc.write(one_forecast, 'merged_weathermap',overwrite=True)

Extension Part 3 – Putting it all together
#Step 10

Now let’s add the data from the Weather feed API to a Cartoframes map. The following reads in the merged_weathermap dataset we just created and colours
in the maximum temperature for the forecast data for each weather point in New South Wales. Pink being a high temperature, and blue being a lower temperatue.

from cartoframes import Layer, BaseMap, styling
cc.map(layers=[BaseMap('light'),Layer('merged_weathermap',color={'column': 'max_temp','scheme': styling.tropic(10)},size=10)],

That’s it! From here, it is feasible to see with a bit of extra work and some scripts that continuously ping the APIs etc that we are only a few steps away from creating live dashboards which integrate other statistical and mathematical packages, such as even including machine learning.

Looking forward to seeing developments in this space and if you have any feedback or ideas let me know!

For more information on Cartoframes have a look at their documentation.


Basic plotting / visualisation in Python

This is very quick introduction to graphing / visualising in Python done for the Sydney Python course at Arup

Arup Sydney Python Club – Session 8

In [7]:
from IPython.display import Image

Today’s demand for data visualisation tools go beyond the standard charts and graphs used in Excel spreadsheets, displaying data in more sophisticated ways such as infographics, dials and gauges, geographic maps, heat maps, and detailed bar, pie and rose charts. This demand extends to visualisations with interactive capabilities – there is an expectation that we are able to create and can allow manipulation/drilling into the data for querying and analysis.

Using scripting / programmatic techniques for our data analysis and visualisation means we are able to surpass a number of barriers formerly in software packages (again, Excel), especially when faced with issues to do with speed, display, customisation and data size. Python has several options for creating both static and interactive data analysis/visualisation environments.

The Python scientific stack (http://scikit-learn.org/stable/) is fairly mature, and there are libraries for a variety of use cases, including machine learning http://scikit-learn.org/stable/), network analysis (https://networkx.github.io/) and data analysis (http://pandas.pydata.org/).

Many new Python data visualization libraries have been created in the past few years to close the gap. This class will focus on using ‘matplotlib’, one of the many libraries available. Some alternate libraries will be explained below.

What is Matplotlib?

Simply put, matplotlib is a graphing library for Python. It has an array of tools that you can use to create anything from simple scatter plots, to sin curves, to 3D graphs. It is used heavily in the scientific Python community for data visualisation. It is designed to closely resemble MATLAB (which can be found in many engineering courses).

You can read more about the ideas behind matplotlib on their website, but I especially recommend taking a look at their gallery to see the amazing things you can pull off with this library.


What else is out there?

Some of the many alternatives to matplotlib include seaborn, ggplot and plotly.

Seaborn; The main difference between matplotlib and seaborn is its default styles and palettes, which are more aesthetically pleasing and modern. Seaborn is built on top of matplotlib, so it doesn’t hurt to learn this first.

ggplot; ggplot is based on ggplot2 in stats package ‘R’ and concepts from the ‘Grammar of Graphics’. The outputs are much easier to create and much better looking, but it isn’t designed for high levels of customisation. It sacrifices complexity for a simple method of plotting (kind of like iOS vs Android). One of the advantages is its close integration with ‘pandas’ – a popular data analysis library. If you are familiar with R or SQL it might be useful to look at this.

Plotly; Plotly is an online platform for data visualisation, and within Python one can create interactive graphs and charts. Plotly’s forte is making interactive plots and offers some charts you won’t find in most libraries, like contour plots, dendograms, and 3D charts. If you are thinking of making something interactive or for the web with your Python code, this would be worth investigating.

Some examples

In [8]:
# Bus network analysis with San Francisco 7 terabyte data set spanning ~2008 - 2014. 
# Below extracts are from a tool that allows quick graphing and mapping 
# / visualisations of any time span, at stop,route and network level for the very large dataset generated from the 
# bus positioning and passenger countring system. 
# See: http://pandas.pydata.org/pandas-docs/stable/visualization.html

from IPython.display import Image
In [74]:
#This combines the bus data with analystical package 'networkx' to study the network according to graph theory
# Generates visualisation of specified time period and how network stops perform
#See: https://networkx.github.io/

from IPython.display import Image

What makes a good data visualisation?

In his 1983 book The Visual Display of Quantitative Information, Edward Tufte defines ‘graphical displays’ and principles for effective graphical display. Many of these ideas have stuck in the industry to date and the below principles are a staple 101 to data visualisation.

A good data visualisation should:

  1. Show the data
  2. Provoke thought about the subject at hand
  3. Avoid distorting the data
  4. Present many numbers in a small space
  5. Make large datasets coherent
  6. Encourage eyes to compare data (Interactivity)
  7. Reveal data at several levels of detail
  8. Serve a reasonably clear purpose
  9. Be closely integrated with statistical and verbal descriptions of the dataset

Section 1 – Line Graphs

1A ) Nailing the basic line graph

Start by importing these Python modules

In [3]:
import matplotlib.pyplot as plt 
import numpy as np
%matplotlib inline

Generate some data

For example here we have given some values to the variable X, in a list and values to the variable Y. The order is important here!
In [76]:
x = [1,2,3,4,5,6,7,8,9,10]
y = [1,2,4,8,16,32,64,128,256,512]

Plot a graph

In [77]:
#plot with all default settings
#! plt.show()
[<matplotlib.lines.Line2D at 0xe88bdd8>]
In [78]:
#flip x and y axis
#! plt.show()
[<matplotlib.lines.Line2D at 0xe689588>]
In [79]:
# red dashed line
[<matplotlib.lines.Line2D at 0xeb59c18>]

Scatter plot

In [80]:
#a line plot can easily become a scatter plot by changing markers 
#plot with blue markers
[<matplotlib.lines.Line2D at 0xed74320>]
In [81]:
#Two datasets, same plot
x1 = [1,2,3,4,5,6,7,8,9,10]
y1 = [1,2,4,8,16,32,64,128,256,512]
x2 = [1,2,3,4,5,6,7,8,9,10,11]
y2 = [10,20,40,89,160,300,640,450,500,510,700]
[<matplotlib.lines.Line2D at 0xefc9be0>,
 <matplotlib.lines.Line2D at 0xefc9d68>]
In [82]:
#More Control, plotting more than one dataset with alternate styling
x1 = [1,2,3,4,5,6,7,8,9,10,11]
y1 = [1,2,4,8,16,32,64,128,256,512,1024]
x2 = [1,2,3,4,5,6,7,8,9,10,11]
y2 = [10,20,40,89,160,300,640,450,500,510,700]
plt.plot(x1,y1, 'go-', label='Expected', linewidth=2)
plt.plot(x2,y2, 'rs',  label='Sampled')
plt.axis([0, 12, 0, 800])
<matplotlib.legend.Legend at 0xf1a9748>
In [1]:
#if you run this command if will give you documentation on styles etc

Bar chart

In [4]:
#example bar chart with positive and negative values

import numpy as np
import matplotlib.pyplot as plt

n = 12

X = np.arange(n)

Y1 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)
Y2 = (1-X/float(n)) * np.random.uniform(0.5,1.0,n)

plt.bar(X, +Y1, facecolor='#9999ff', edgecolor='white')
plt.bar(X, -Y2, facecolor='#ff9999', edgecolor='white')

<Container object of 12 artists>

1B ) Tidying / extending our graphs

Now we can extend this further by looking at some other aspects of a chart, with more sophisticated data

Load in some data

In [84]:
#Create a range of dates
x = range(1940,2016)
#Create a list of a fluctuating random value the same length of that date range
y = np.random.randn(len(x)).cumsum()

Plot the data

In [85]:
plt.plot(x, y)
[<matplotlib.lines.Line2D at 0xf595588>]

Add titles / legends / symbology

In [86]:
plt.title('Fake Global Sea Temperature over Time')


plt.ylabel('Sea Temperature')


plt.figtext(0.995, 0.01, 'Data Source: Bureau of Astrology, 2016', ha='right', va='bottom')

plt.legend(loc='best', framealpha=0.5,  prop={'size':'small'})


plt.gcf().set_size_inches(8, 4)

plt.plot(x, y)

[<matplotlib.lines.Line2D at 0xeee50f0>]


After your previous exercises with similar weather station data, you have become a well-known international expert.

You have been requested to produce the following visualisations.

The data can be found here:


1) Produce an image of all of the temperature or rainfall values (regardless of geographic data)

2) Produce an image that explains the potential relationship between latitude and rainfall

3) Produce an image which shows the countries in their geographic location (using supplied latitude/longitude field)

4) Produce the previous image this time with the plot points coloured or scaled by the temperature attribute

5) Extension – Visualise these values on a world map basemap

6) Extension – Visualise these values as a choropleth map (i.e. a map that shows country shapes rather than just points)

Reading CSVs – https://docs.python.org/2/library/csv.html

Extracting column from CSV – https://www.raspberrypi.org/learning/astro-pi-flight-data-analysis/graphing/ (read 1 & 2 here)

Styling data – run help(plt)

Any questions please email oliver.lock@arup.com! : – )


Data mining with XPath


At the recent CASA Hackathon I was part of a team that developed a Social Radar (originally forecast to be a Google Glass app), retrieving data from Twitter and producing maps led by Steven Gray. Steve has produced a ergonomic and very useful presentation on how to retrieve these feeds here.

While working on this project I thought that it would be useful for the radar to display what events are currently on in the city.

One useful way to retrieve data when there is no API, or you have no time to learn an API, is by using XPath.

In my most simple definition, XPath is a way to retrieve specified nodes within XML (or simply HTML) pages.

For example, if you have the following code:

< div class = "name" >
< firstname > John < / firstname >
< surname > Smith < /surname
< div >

You could query the ‘firstname’ element and it would return ‘John’.

One can write a script in a language such as Python to automatically retrieve data from webpages, to either display it live or collect it over time.

Google Drive have made it very simple to undertake such tasks, and here I will post a short tutorial.

Using Google Drive

1. Create a Spreadsheet in Google Drive
2. In Sheet1 fill the first two column headings (A1, B1) with ‘Source’ and ‘XPath’ respectively.
3. In the Source column (B2), put in the name of a website with some data you would like to retrieve (e.g. TimeOut Events : http://www.timeout.com/london/search?_source=global&profile=london&_dd=&page_zone=events&keyword=&section=events&on=today&locationText=&_section_search=events )
4. In the XPath column, place the XPath (e.g. //div[@class=’topSection’]/h3 ). This is the hardest bit – reading this might help.
5. Create Sheet2, and use the importxml function to run the XPath query e.g. ( importxml(Sheet1!A2,Sheet1!B2)
6. Once you have tried a few of these feeds you can publish the document to the web, or geocode the results if you have chosen a dataset with addresses (such as this TimeOut dataset). Geo for Google Docs is quite useful for this, particularly if you use TileMill/Mapbox. Google Fusion tables also has good mapping capabilities.

Click here for an example of one of these documents.

Using Python

One of the limitations of the method mentioned above is it is limited to a maximum of 50 XPath queries (last time I checked) per document. If you are interested in harvesting a large dataset (for example, parsing through a real estate website for all houses currently on sale in a city, or collecting them continuously over time) one can utilise XPath in Python. The following script was produced at the event to retrieve data from TimeOut, clean the data, Geocode it and place it in a CSV file.

import lxml.html
import urllib
import urllib2
import pprint
import json
import csv
invalid = 0
with open('timeout-london.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, delimiter=',',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
print 'title;time;venue;location;lat;lon'
while invalid < 5:
for num in range(1,500):
s1 = "http://www.timeout.com/london/search?"
s2 = "language=en_GB&profile=london&order=popularity&page="+str(num)+"&page_size=1&source=admin&type=event&on=today&_section_search=&section="
s3 = str(s1+s2)
page = urllib.urlopen(s3)
doc = lxml.html.document_fromstring(page.read())
xme = doc.xpath("//h3/a")
xmf = doc.xpath("//h4")
title = xme[0].text_content()
location = xmf[0].text_content()
breakdown = xmf[1].text_content().split('|')
line = breakdown
time = line[0]
loc_name = line[1]
loc_area = line[2]
loc_area = 'London'
time = " ".join(time.split())
loc_name = " ".join(loc_name.split())
loc_area = " ".join(loc_area.split())
locstr = loc_name+', '+loc_area+', '+'London'
add = str(locstr)
add = urllib2.quote(add)
geocode_url = "http://maps.googleapis.com/maps/api/geocode/json?address=%s&sensor=false&region=uk" % add
req = urllib2.urlopen(geocode_url)
jsonResponse = json.loads(req.read())
x = json.dumps([s['geometry']['location']['lat'] for s in jsonResponse['results']], indent=0)
x = x.strip('[')
x = x.strip(']')
y = x.strip("'")
x = x.split()
x = x[0]
x = x.strip('[')
x = x.strip(']')
x = x.strip(',')
y = json.dumps([s['geometry']['location']['lng'] for s in jsonResponse['results']], indent=0)
y = y.strip('[')
y = y.strip(']')
y = y.strip("'")
y = y.split()
y = y[0]
y = y.strip('[')
y = y.strip(']')
y = y.strip(',')
print title+';'+time+';'+loc_name+';'+loc_area+';'+x+';'+y
invalid +=1

The output looks something like this, with lat/long columns suitable for mapping.

New Year's Eve 2013 Firework Display;Tue Dec 31;EDF Energy London Eye;Waterloo;51.5033;-0.11475
Christmas at Kew Gardens 2013;Until Sat Jan 4 2014;Kew Gardens;Kew, Surrey;51.4782342;-0.2984129
Carnaby Christmas 2013: The Year of the Robin;Until Mon Jan 6 2014;Carnaby Street;Soho;51.5148445;-0.1413416
The Book of Mormon;Until Sat Apr 5 2014;Prince of Wales Theatre;Leicester Square;51.51121389999999;-0.1198244
Coriolanus;Until Sat Feb 8 2014;Donmar Warehouse;Leicester Square;51.51121389999999;-0.1198244
Mojo;Until Sat Feb 8 2014;Harold Pinter Theatre;Trafalgar Square;51.51121389999999;-0.1198244
Winter Lights at Canary Wharf 2013;Tue Dec 31 - Sat Jan 25 2014;Canary Wharf;Docklands;51.50755299999999;-0.024526

I hope this code and examples prove useful for anyone trying to retrieve data from difficult pages on the web where no feeds are available. If you would like to appropriate this code for other webpages, feel free to contact me or try yourself. Quick example:

(example Social Event map using code from this tutorial ; data: TimeOut London, map: MapBox)