:::: MENU ::::
Browsing posts in: Cities

How to scrape data from a website in 10 lines using Beautiful Soup and Python

Have you ever wanted to scrape data from a webpage where their open data isn’t available? It is actually quite straightforward for people with a little coding knowledge to retrieve a lot of data with the power of Python and libraries such as Beautiful Soup.

Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. It is available for Python 2.6+ and Python 3.

You can perform research with this data scraped over time, or simply use it for personal use.

Below are two common data sets that everyday people would find useful – Property and Jobs!

Real Estate – Finding sold history of properties

  1. Navigate to webpage – for example http://realestate.com.au/sold

2) Perform a search ; for example here we have searched for all properties sold in Newcastle (you can do this for big areas, or specific streets .. it’s up to you).

3) Extract the URL of the search results to see if you can loop over the results:

https://www.realestate.com.au/sold/in-newcastle+-+greater+region,+nsw%3b/list-1

For example, as above, the value ‘1’ returns us the first page, if we replace that with ‘2’ we get the second page

https://www.realestate.com.au/sold/in-newcastle+-+greater+region,+nsw%3b/list-2

This means we can perform a simple loop over the data.

4) Right click over the element you want to retrieve and click inspect. This will tell you what part of the site’s HTML you would like to retrieve. For example, by right-clicking the price we can see that prices are stored in a <span> tag with the class ‘property-price’.

Run this script (>>>> is a tab):

from bs4 import BeautifulSoup
import requests
>>>>for num in range(0,20):
>>>>url = str(‘www.realestate.com.au/sold/in-olivers+hill,+vic+3199%3b/list-‘+str(num))
>>>>r = requests.get(“http://” +url)
>>>>data = r.text
>>>>soup = BeautifulSoup(data)
>>>>mydivs = soup.findAll(“span”, {“class”: “property-price”})
>>>>for line in mydivs:
>>>>>>>>print(line.text)

This will print the first 20 pages of results for all properties. You can give even more information here to extract particular features relevant to your property search – such as number of bedrooms, parking spaces and bathrooms. You can modify this to save it to a text file or csv, or even collect it over time for historical property sales by type.

Sites like ‘Inside AirBnB’ do these kind of scraping exercises (note that like AirBnB itself;  whether this work is allowed to be performed is a grey area) :

http://insideairbnb.com/

Jobs! 

We’ve all been there – looking for a new job or seeing how much your skills are worth in the market.

Here’s an example of scraping Indeed.com for jobs data (>>>> is a tab)::

from bs4 import BeautifulSoup
import requests
for num in range(0,2000,10):
>>>>url = str(‘au.indeed.com/jobs?q=python+data+analytics&l=australia&start=’+str(num))
>>>>r = requests.get(“http://” +url)
>>>>data = r.text
>>>>soup = BeautifulSoup(data)
>>>>mydivs = soup.findAll(“a”, {“data-tn-element”: “jobTitle”})
>>>>for line in mydivs:
>>>>>>>>print(line.text)

This will return all jobs related to ‘python data analysis’ in Australia. Have a go at trying to change it to search by salary, or to retrieve additional information when printed.

Happy scraping! Always make sure you read a website’s terms of service before performing any of the above ; and do so at your own risk!

 

FlipboardShare