You are on page 1of 27

Overview of Python

web scraping tools


Maik Rder
Barcelona Python Meetup Group
17.05.2012
Friday, May 18, 2012
Data Scraping

Automated Process

Explore and download pages

Grab content

Store in a database or in a text le


Friday, May 18, 2012
urlparse

Manipulate URL strings


urlparse.urlparse()
urlparse.urljoin()
urlparse.urlunparse()
Friday, May 18, 2012
urllib

Download data through different protocols

HTTP, FTP, ...


urllib.parse()
urllib.urlopen()
urllib.urlretrieve()
Friday, May 18, 2012
Scrape a web site

Example: http://www.wunderground.com/
Friday, May 18, 2012
Preparation
>>> from StringIO import StringIO
>>> from urllib2 import urlopen
>>> f = urlopen('http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html')
>>> p = f.read()
>>> d = StringIO(p)
>>> f.close()
Friday, May 18, 2012
Beautifulsoup

HTML/XML parser

designed for quick turnaround projects like


screen-scraping

http://www.crummy.com/software/
BeautifulSoup
Friday, May 18, 2012
BeautifulSoup
from BeautifulSoup import *
a = BeautifulSoup(d).findAll('a')
[x['href'] for x in a]
Friday, May 18, 2012
Faster BeautifulSoup
from BeautifulSoup import *
p = SoupStrainer('a')
a = BeautifulSoup(d, parseOnlyThese=p)
[x['href'] for x in a]
Friday, May 18, 2012
Inspect the Element

Inspect the Maximum temperature


Friday, May 18, 2012
Find the node
>>> from BeautifulSoup import
BeautifulSoup
>>> soup = BeautifulSoup(d)
>>> attrs = {'class':'nobr'}
>>> nobrs = soup.findAll(attrs=attrs)
>>> temperature = nobrs[3].span.string
>>> print temperature
23
Friday, May 18, 2012
htmllib.HTMLParser

Interesting only for historical reasons

based on sgmllib
Friday, May 18, 2012
htmllib5

Using the custom simpletree format

a built-in DOM-ish tree type (pythonic idioms)


from html5lib import parse
from html5lib import treebuilders
e = treebuilders.simpletree.Element
i = parse(d)
a =[x for x in d if isinstance(x, e)
and x.name= 'a']
[x.attributes['href'] for x in a]
Friday, May 18, 2012
lxml

Library for processing XML and HTML

Based on C libraries
sudo aptitude install libxml2-dev
sudo aptitude install libxslt-dev

Extends the ElementTree API

e.g. with XPath


Friday, May 18, 2012
lxml
from lxml import etree
t = etree.parse('t.xml')
for node in t.xpath('//a'):
node.tag
node.get('href')
node.items()
node.text
node.getParent()
Friday, May 18, 2012
twill

Simple

No JavaScript

http://twill.idyll.org

Some more interesting concepts

Pages, Scenarios

State Machines
Friday, May 18, 2012
twill

Commonly used methods:


go()
code()
show()
showforms()
formvalue() (or fv())
submit()
Friday, May 18, 2012
Twill
>>> from twill import commands as
twill
>>> from twill import get_browser
>>> twill.go('http://www.google.com')
>>> twill.showforms()
>>> twill.formvalue(1, 'q', 'Python')
>>> twill.showforms()
>>> twill.submit()
>>> get_browser().get_html()
Friday, May 18, 2012
Twill - acknowledge_equiv_refresh
>>> twill.go("http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html")
...
twill.errors.TwillException:
infinite refresh loop discovered;
aborting.
Try turning off
acknowledge_equiv_refresh...
Friday, May 18, 2012
Twill
>>> twill.config
("acknowledge_equiv_refresh", "false")
>>> twill.go("http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html")
==> at http://www.wunderground.com/
history/airport/BCN/2007/5/17/
DailyHistory.html
'http://www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html'
Friday, May 18, 2012
mechanize

Stateful programmatic web browsing

navigation history

HTML form state

cookies

ftp:, http: and le: URL schemes

redirections

proxies

Basic and Digest HTTP authentication


Friday, May 18, 2012
mechanize - robots.txt
>>> import mechanize
>>> browser = mechanize.Browser()
>>> browser.open('http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html')
mechanize._response.httperror_see
k_wrapper: HTTP Error 403:
request disallowed by robots.txt
Friday, May 18, 2012
mechanize - robots.txt

Do not handle robots.txt


browser.set_handle_robots(False)

Do not handle equiv


browser.set_handle_equiv(False)
browser.open('http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html')
Friday, May 18, 2012
Selenium

http://seleniumhq.org

Support for JavaScript


Friday, May 18, 2012
Selenium
from selenium import webdriver
from selenium.common.exceptions \
import NoSuchElementException
from selenium.webdriver.common.keys \
import Keys
import time
Friday, May 18, 2012
Selenium
>>> browser = webdriver.Firefox()
>>> browser.get("http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html")
>>> a = browser.find_element_by_xpath
("(//span[contains(@class,'nobr')])
[position()=2]/span").text
browser.close()
>>> print a
23
Friday, May 18, 2012
Phantom JS

http://www.phantomjs.org/
Friday, May 18, 2012

You might also like