:::: MENU ::::
Browsing posts in: Programming

How to scrape data from a website in 10 lines using Beautiful Soup and Python

Have you ever wanted to scrape data from a webpage where their open data isn’t available? It is actually quite straightforward for people with a little coding knowledge to retrieve a lot of data with the power of Python and libraries such as Beautiful Soup.

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. It is available for Python 2.6+ and Python 3.

You can perform research with this data scraped over time, or simply use it for personal use.

Below are two common data sets that everyday people would find useful – Property and Jobs!

Real Estate – Finding sold history of properties

  1. Navigate to webpage – for example http://realestate.com.au/sold

2) Perform a search ; for example here we have searched for all properties sold in Newcastle (you can do this for big areas, or specific streets .. it’s up to you).

3) Extract the URL of the search results to see if you can loop over the results:

https://www.realestate.com.au/sold/in-newcastle+-+greater+region,+nsw%3b/list-1

For example, as above, the value ‘1’ returns us the first page, if we replace that with ‘2’ we get the second page

https://www.realestate.com.au/sold/in-newcastle+-+greater+region,+nsw%3b/list-2

This means we can perform a simple loop over the data.

4) Right click over the element you want to retrieve and click inspect. This will tell you what part of the site’s HTML you would like to retrieve. For example, by right-clicking the price we can see that prices are stored in a <span> tag with the class ‘property-price’.

Run this script (>>>> is a tab):

from bs4 import BeautifulSoup
import requests
>>>>for num in range(0,20):
>>>>url = str(‘www.realestate.com.au/sold/in-olivers+hill,+vic+3199%3b/list-‘+str(num))
>>>>r = requests.get(“http://” +url)
>>>>data = r.text
>>>>soup = BeautifulSoup(data)
>>>>mydivs = soup.findAll(“span”, {“class”: “property-price”})
>>>>for line in mydivs:
>>>>>>>>print(line.text)

This will print the first 20 pages of results for all properties. You can give even more information here to extract particular features relevant to your property search – such as number of bedrooms, parking spaces and bathrooms. You can modify this to save it to a text file or csv, or even collect it over time for historical property sales by type.

Sites like ‘Inside AirBnB’ do these kind of scraping exercises (note that like AirBnB itself;  whether this work is allowed to be performed is a grey area) :

http://insideairbnb.com/

Jobs! 

We’ve all been there – looking for a new job or seeing how much your skills are worth in the market.

Here’s an example of scraping Indeed.com for jobs data (>>>> is a tab)::

from bs4 import BeautifulSoup
import requests
for num in range(0,2000,10):
>>>>url = str(‘au.indeed.com/jobs?q=python+data+analytics&l=australia&start=’+str(num))
>>>>r = requests.get(“http://” +url)
>>>>data = r.text
>>>>soup = BeautifulSoup(data)
>>>>mydivs = soup.findAll(“a”, {“data-tn-element”: “jobTitle”})
>>>>for line in mydivs:
>>>>>>>>print(line.text)

This will return all jobs related to ‘python data analysis’ in Australia. Have a go at trying to change it to search by salary, or to retrieve additional information when printed.

Happy scraping! Always make sure you read a website’s terms of service before performing any of the above ; and do so at your own risk!

 

FlipboardShare

Live transport performance dashboards – a light example

Sydney bus performance dashboard

This shows an up-to-date overview of the current performance of the bus network in Sydney, queried every 15 minutes from the Transport for NSW Vehicles API. .

The infrastructure behind this is quite simple and powerful, and you can learn 90% of it through this tutorial.

Possible extensions
> Integrate real vs scheduled time
> Historic performance / animations throughout the day

Try:
> See the buses in the most congested traffic routes
> Query individual bus routes
> Analyse the occupancy of the bus network
> Scroll into individual areas to see changes in average km/h speed
> Summarise one of the above variables for the area within field of view
> Screenshot / compare different times of day
> Select one variable (such as ‘Standing Only’) and see the routes ranked by their count of this variable

Bus performance (last 15 minutes)

FlipboardShare

Pages:1234