Professional Documents
Culture Documents
Automated Process
Grab content
Example: http://www.wunderground.com/
Friday, May 18, 2012
Preparation
>>> from StringIO import StringIO
>>> from urllib2 import urlopen
>>> f = urlopen('http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html')
>>> p = f.read()
>>> d = StringIO(p)
>>> f.close()
Friday, May 18, 2012
Beautifulsoup
HTML/XML parser
http://www.crummy.com/software/
BeautifulSoup
Friday, May 18, 2012
BeautifulSoup
from BeautifulSoup import *
a = BeautifulSoup(d).findAll('a')
[x['href'] for x in a]
Friday, May 18, 2012
Faster BeautifulSoup
from BeautifulSoup import *
p = SoupStrainer('a')
a = BeautifulSoup(d, parseOnlyThese=p)
[x['href'] for x in a]
Friday, May 18, 2012
Inspect the Element
based on sgmllib
Friday, May 18, 2012
htmllib5
Based on C libraries
sudo aptitude install libxml2-dev
sudo aptitude install libxslt-dev
Simple
No JavaScript
http://twill.idyll.org
Pages, Scenarios
State Machines
Friday, May 18, 2012
twill
navigation history
cookies
redirections
proxies
http://seleniumhq.org
http://www.phantomjs.org/
Friday, May 18, 2012