Webscrapping Tools

Overview of Python
web scraping tools

Maik Rder
Barcelona Python Meetup Group
17.05.2012
Friday, May 18, 2012
Data Scraping
Automated Process
Explore and download pages
Grab content
Store in a database or in a text le

urlparse
Manipulate URL strings

urlparse.urlparse()
urlparse.urljoin()
urlparse.urlunparse()
urllib
Download data through different protocols
HTTP, FTP, ...

urllib.parse()
urllib.urlopen()
urllib.urlretrieve()
Scrape a web site
Example: http://www.wunderground.com/
Preparation
>>> from StringIO import StringIO
>>> from urllib2 import urlopen
>>> f = urlopen('http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html')
>>> p = f.read()
>>> d = StringIO(p)
>>> f.close()
Beautifulsoup
HTML/XML parser
designed for quick turnaround projects like

screen-scraping
http://www.crummy.com/software/
BeautifulSoup
BeautifulSoup
from BeautifulSoup import *
a = BeautifulSoup(d).findAll('a')
[x['href'] for x in a]
Faster BeautifulSoup
from BeautifulSoup import *
p = SoupStrainer('a')
a = BeautifulSoup(d, parseOnlyThese=p)
[x['href'] for x in a]
Inspect the Element
Inspect the Maximum temperature

Find the node
>>> from BeautifulSoup import
BeautifulSoup
>>> soup = BeautifulSoup(d)
>>> attrs = {'class':'nobr'}
>>> nobrs = soup.findAll(attrs=attrs)
>>> temperature = nobrs[3].span.string
>>> print temperature
23
htmllib.HTMLParser
Interesting only for historical reasons
based on sgmllib
htmllib5
Using the custom simpletree format
a built-in DOM-ish tree type (pythonic idioms)

from html5lib import parse
from html5lib import treebuilders
e = treebuilders.simpletree.Element
i = parse(d)
a =[x for x in d if isinstance(x, e)
and x.name= 'a']
[x.attributes['href'] for x in a]
lxml
Library for processing XML and HTML
Based on C libraries
sudo aptitude install libxml2-dev
sudo aptitude install libxslt-dev
Extends the ElementTree API
e.g. with XPath

lxml
from lxml import etree
t = etree.parse('t.xml')
for node in t.xpath('//a'):
node.tag
node.get('href')
node.items()
node.text
node.getParent()
twill
Simple
No JavaScript
http://twill.idyll.org
Some more interesting concepts
Pages, Scenarios
State Machines
twill
Commonly used methods:

go()
code()
show()
showforms()
formvalue() (or fv())
submit()
Twill
>>> from twill import commands as
twill
>>> from twill import get_browser
>>> twill.go('http://www.google.com')
>>> twill.showforms()
>>> twill.formvalue(1, 'q', 'Python')
>>> twill.showforms()
>>> twill.submit()
>>> get_browser().get_html()
Twill - acknowledge_equiv_refresh
>>> twill.go("http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html")
...
twill.errors.TwillException:
infinite refresh loop discovered;
aborting.
Try turning off
acknowledge_equiv_refresh...
Twill
>>> twill.config
("acknowledge_equiv_refresh", "false")
>>> twill.go("http://
BCN/2007/5/17/DailyHistory.html")
==> at http://www.wunderground.com/
history/airport/BCN/2007/5/17/
DailyHistory.html
'http://www.wunderground.com/history/
DailyHistory.html'
mechanize
Stateful programmatic web browsing
navigation history
HTML form state
cookies
ftp:, http: and le: URL schemes
redirections
proxies
Basic and Digest HTTP authentication

mechanize - robots.txt
>>> import mechanize
>>> browser = mechanize.Browser()
>>> browser.open('http://
DailyHistory.html')
mechanize._response.httperror_see
k_wrapper: HTTP Error 403:
request disallowed by robots.txt
mechanize - robots.txt
Do not handle robots.txt

browser.set_handle_robots(False)
Do not handle equiv

browser.set_handle_equiv(False)
browser.open('http://
DailyHistory.html')
Selenium
http://seleniumhq.org
Support for JavaScript

Selenium
from selenium import webdriver
from selenium.common.exceptions \
import NoSuchElementException
from selenium.webdriver.common.keys \
import Keys
import time
Selenium
>>> browser = webdriver.Firefox()
>>> browser.get("http://
BCN/2007/5/17/DailyHistory.html")
>>> a = browser.find_element_by_xpath
("(//span[contains(@class,'nobr')])
[position()=2]/span").text
browser.close()
>>> print a
23
Phantom JS
http://www.phantomjs.org/

Webscrapping Tools

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Webscrapping Tools

Uploaded by

Copyright:

Available Formats

Overview of Python

web scraping tools

Explore and download pages

Store in a database or in a text le

Manipulate URL strings

Download data through different protocols

HTTP, FTP, ...

designed for quick turnaround projects like

Inspect the Maximum temperature

Interesting only for historical reasons

Using the custom simpletree format

a built-in DOM-ish tree type (pythonic idioms)

Library for processing XML and HTML

Extends the ElementTree API

e.g. with XPath

Some more interesting concepts

Commonly used methods:

Stateful programmatic web browsing

HTML form state

ftp:, http: and le: URL schemes

Basic and Digest HTTP authentication

Do not handle robots.txt

Do not handle equiv

Support for JavaScript

You might also like