Scrapy How To's

SCRAPY
------
Sources:
** VIDEO TUTORIAL: http://www.youtube.com/watch?v=1EFnX1UkXVU
** SCRAPY TUTORIAL: http://doc.scrapy.org/en/latest/intro/tutorial.html
** XPATH TUTORIAL/INFO: http://www.w3schools.com/xpath/
** INSTALLATION: http://doc.scrapy.org/en/latest/intro/install.html#intro-instal
l
Note: (important) Find all dependencies fit for 64-bit OS and python ver
sion
-------------------------------------------------------------------------------
---------------------------------------------------------
| STARTING A SCRAPY PROJECT
|
-------------------------------------------------------------------------------
---------------------------------------------------------
** Step 1: Go to command prompt.
** Step 2: Type in the directory where you want to store the project in
e.g. cd desktop # current directory is a
t desktop
** Step 3: Type in: scrapy startproject your_project_name (to create a new scrap
y project)
e.g. scrapy startproject mlim
Note: A folder should appear in the current working directory with the s
ame name as your_project_name
-------------------------------------------------------------------------------
---------------------------------------------------------
| SCRAPING INFORMATION FROM WEBSITES
VIA XPATH |
-------------------------------------------------------------------------------
---------------------------------------------------------
** Step 1: Open items.py
Step 1.1: add in the class the fields you would want to obtain and then
save the file.
e.g. link = Field() and etc. (remove the pass statement)
** Step 2: Create the spider using the BaseSpider class.(simplest form of a spid
er)
Step 2.1: Open a new python file and save it (file extension .py) in the
spiders folder.
Sample Code dependent on XPath:
----------------------------------------------- CODE ------------------------
--------------------------------------------------------------------------

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
# name of spider
allowed_domains = ["craigslist.org"]
# "Homepage" or main page
start_urls = ["http://sfbay.craigslist.org/sfc/npo/"]
# link of the page to be parsed
def parse(self, response):
# parsing function
hxs = HtmlXPathSelector(response)

titles = hxs.select("//div[@id='toc_rows']/div[2]/p/span[2]")
# main body xpath (where information to be parsed is grouped)
items = []
# list where information of each article would be stored
for titles in titles:
item = CraigslistSampleItem()
# see items.py (item object contains link and title)
item["title"] = titles.select("a/text()").extract()
# title of link xpath (from inspection)
item["link"] = titles.select("a/@href").extract()
# link xpath @href accesses the information w/in tags
items.append(item)
# appends items to list
return items
--------------------------------------------------------------------------------
--------------------------------------------------------------------------
** Step 3: Save file and go back to command prompt
** Step 4: Run the code by first changing the directory to your project folder,
type in: cd your_project_name
** Step 5: Crawl the website by running the spider:
Type in: scrapy crawl spider_name from the name attribute of the class y
ou defined.
e.g. scrapy crawl craig
** Step 6: To save the parsed information, type in: scrapy crawl spider_name -o
filename.csv -t csv
e.g. scrapy crawl craig -o items.csv -t csv # file is found in scrapy di
rectory
-------------------------------------------------------------------------------
---------------------------------------------------------
| SCRAPING INFORMATION FROM WEBSITES USIN
G BeautifulSoup |
-------------------------------------------------------------------------------
---------------------------------------------------------
Source: https://gist.github.com/davepeck/790721
Sample Code:
import re
from scrapy.link import Link
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup
class SoupLinkExtractor(object):
def __init__(self, *args, **kwargs):
super(SoupLinkExtractor, self).__init__()
allow_re = kwargs.get('allow', None)
self._allow = re.compile(allow_re) if allow_re else None

def extract_links(self, response):
raw_follow_urls = []

soup = BeautifulSoup(response.body_as_unicode())
anchors = soup.findAll('a')
for anchor in anchors:
anchor_href = anchor.get('href', None)
if anchor_href and not anchor_href.startswith('#'):
raw_follow_urls.append(anchor_href)

potential_follow_urls = [urljoin(response.url, raw_follow_url) for raw_f
ollow_url in raw_follow_urls]

if self._allow:
follow_urls = [potential_follow_url for potential_follow_url in pote
ntial_follow_urls if self._allow.search(potential_follow_url) is not None]
else:
follow_urls = potential_follow_urls

return [Link(url = follow_url) for follow_url in follow_urls]

Scrapy How To's

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scrapy How To's

Uploaded by

Copyright:

Available Formats

SCRAPY

You might also like