You are on page 1of 43

Crawl data with Scrapy [Public] [Draft]

Scrapy Architecture (source: scrapy.org)

License

The document is licensed under a C


reative Commons Share-alike 4.0 license.

Contributors
No.

Name

Email

Cuong Tran

tranhuucuong91@gmail.com

Nguyn Quang Dng

duongnq094@gmail.com

Nguyn nh Khi

khainguyenptiter@gmail.com

Nguyn B Cng

cuongnb14@gmail.com

Phan Cng Hun

nauh94@gmail.com

Mc lc
License

Contributors

1. Scrapy Architecture
1.1 Thnh phn
1.2 Lung d liu

6
6
6

2. Tutorial vi Scrapy
Install Scrapy
Cc bc chnh trong tutorial:
1. Defining our Item
2. Our first Spider
3. Crawling
4. Extracting Items
5. Storing the scraped data (lu tr kt qu sau khi crawl)

7
7
8
8
8
9
9
10

3. Cc vn cn gii quyt vi Scrapy


Python Conventions
Xpath
Docker

10
11
11
11

4. Store into database


4.1 MongoDB: l kiu noSQL
MongoDB to MySQL
Store into mongo database
Export/Import MongoDB
4.2 MySQL database

12
12
12
13
13
14

5. Pipeline
Duplicates filter
Write items to a JSON file
Price validation and dropping items with no prices
Clean whitespace and HTML

15
15
15
15
16

6. Extractor, Spider
Duyt all page
Thm thng tin vo callback functions
XPath pattern
XPath Tips from the Web Scraping Trenches
Config ItemLoad default: extract first and strip
Cc th vin Extractor

16
16
17
17
18
19
20

7. Downloader
Cu hnh v s dng proxy

21
21
3

Working with http proxy


Scrapy Download images

21
23

8. Scrapy setting
ITEM_PIPELINES
DownloaderStats

23
23
24

Scrapy handle AJAX Website

24

Scrapy handle AJAX Website with Splash

26

Scrapy debug

26

Scrapy caching

26

Scrapy revisit for update:

27

Continue download: Jobs: pausing and resuming crawls

27

Monitoring scrapy, status, log:

28

X l nhiu spider trong 1 project:

28

Scrapy-fake-useragent

29

Crawler website s dng login

29

Filling Login Forms Automatically

29

Scrapy - how to manage cookies/sessions


Multiple cookie sessions per spider

30
31

How to send cookie with scrapy CrawlSpider requests?

31

By-pass anti-crawler

32

Kinh nghim thc t


Alibaba redirect, yu cu ng nhp
X l products list
Vn ly danh sch cng ty:
Vn next_page:

33
33
33
33
35

Deploy project scrapy s dng ScrapingHub

37

Lp lch chy sprider:

39

Django-dynamic-scraper (DDS)
Requirements:
Documents
SETUP:
1. Install docker, compose
2. Run docker django-dynamic-scraper
3. Defining the object to be scraped

39
39
40
40
40
40
40
4

4. run crawl data


5. run schedule crawl:
Ti liu tham kho

41
41
43

1. Scrapy Architecture
http://doc.scrapy.org/en/latest/topics/architecture.html

Hnh 1: Scrapy Architecture

1.1 Thnh phn

Scheduler: b lp lch th t cc URL download.


Downloader: thc hin download d liu. Qun l cc li khi download. Chng
trng.
Spiders: bc tch d liu thnh cc items v requests
Item Pipeline: x l d liu bc tch c v lu v db.
Scrapy Engine: qun l cc thnh phn trn.

1.2 Lung d liu


Bc 1: Cung cp URL xut pht (start_url), c to thnh mt R
equest lu trong
Scheduler.
Bc 2 - 3: Scheduler ln lt ly cc R
equests gi n D
ownloader.
Bc 4 - 5: Downloader download d liu t internet, c R
esponses gi n S
piders.
Bc 6 - 7: Spiders thc hin:
Bc tch d liu, thu c I tem, gi n I tem Pipeline.
Tch c URLs, to cc Requests gi n S
cheduler.
Bc 8: Item Pipeline thc hin x l d liu bc tch c. n gin nht l thc hin
lu d liu vo database.
6

Bc 9: kim tra Scheduler cn R


equest?
ng: quay li Bc 2.
Sai: kt thc.

2. Tutorial vi Scrapy
Tham kho:
1. Scrapy Tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html
2. Scraping v crawling Web vi Scrapy v SQLAlchemy:
https://viblo.asia/naa/posts/6BkGyxOLM5aV
3. K thut scraping v crawling Web nng cao vi Scrapy v SQLAlchemy:
https://viblo.asia/naa/posts/6BkGyxzeM5aV
4. Github: https://github.com/tranhuucuong91/scrapy-tutorial

Install Scrapy
# install virtualenv
sudo pip install virtualenv
virtualenv venv -p python3
source venv/bin/activate
# install scrapy dependencies
sudo apt-get install -y gcc g++
sudo apt-get install -y python3-dev
sudo apt-get install -y libssl-dev libxml2-dev libxslt1-dev libffi-dev
# install mysql dependencies
sudo apt-get install -y libmysqlclient-dev
# install python libs: scrapy, mysql
pip install -r requirements.txt

pip install twisted w3lib lxml cssselect pydispatch


# install scrapy in system
sudo apt-get install -y libssl-dev libxml2-dev libxslt1-dev
sudo apt-get install -y python-dev
sudo pip2 install scrapy pyOpenSSL
sudo apt-get install -y libssl-dev libxml2-dev libxslt1-dev
sudo apt-get install -y python-dev
sudo pip3 install scrapy

Cc bc chnh trong tutorial:


To mt Scrapy project:
scrapy startproject tutorial
1. nh ngha Items s extract
2. Vit mt spider crawl mt site v extract Items
3. Vit mt Item Pipeline store v extract Items
Danh sch m ngun:
__init__.py
items.py : nh ngha cu trc d liu s bc tch.
pipelines.py : nh ngha hm thc hin vic chn d liu vo
database.
settings.py : ci t cu hnh.
spiders
__init__.py
vietnamnet_vn.py : nh ngha hm bc tch d liu

1. Defining our Item


Items l containers c loaded cng scraped data, ging vi python dict, ngoi ra cn b
sung thm mt s tnh nng cn thit
To 1 class trong file tutorial/items.py

2. Our first Spider


Spider l class chng ta nh ngha v c scrapy s dng scrape thng tin t mt
domain (hoc mt nhm domain)
Chng ta nh ngha mt danh sch khi to ca URLs download, cch follow links, v
cch parse ni dung ca pages trch xut items.
to mt spider, chng ta to mt subclass scrapy.Spider v nh ngha mt s thuc tnh
name: nh danh spider v n l duy nht
start_urls: mt danh sch urls cho spider bt u thc hin crawl. Cc trang c
download u tin s bt u t y, cn li s c to t d liu c ly v
parse(): mt phng thc s c gi vi mt i tng Response c
download ca mi start urls. The response s c truyn ti phng thc nh l
tham s u tin v duy nht ca phng thc. Phng thc ny c trch nhim
phn tch response data v trch xut scraped data (nh l scraped items ) v nhiu
url follow (nh l Request object)
To mt spider trong th mc tutorial/spiders.

3. Crawling
Ti th mc gc ca project v chy lnh:
scrapy crawl dmoz
// dmoz l tn ca scrapy (name)
=> Qu trnh thc hin:
- Scrapy to scrapy.Request cho mi URL trong list start_urls ca spider v gn chng
phng thc parse c gi bi callback function ca chng.
- Cc Request c lp lch ri thc thi v tr v scrapy.http.Response object, sau c
a tr li spider thng qua phng thc parse().

4. Extracting Items
Introduction to Selectors
- S dng c ch da trn Xpath hoc biu thc CSS gi l Scrapy Selector.
Note: XPath mnh m hn CSS
- Scrapy cung cp class Selector v mt s quy c, shortcut lm vic vi biu thc
xpath v css
- Selector object i din cc nodes trong mt document c cu trc. V th u tin khi
to mt selector gn vi root node hoc ton b ti liu
- Selector c 4 phng thc c bn:
1. xpath(): tr v danh sch cc selectors, mi ci i din cho mt node c chn
bng tham s biu thc xpath truyn vo.
2. css(): tr v danh sch cc selectors, mi ci i din cho mt node c chn
bng tham s biu thc css truyn vo.
3. extract(): tr v mt list unicode string vi d liu c chn -> c th dng
extract_first() ly 1 phn t u tin
4. re(): tr v danh sch unicode string c trch xut bng applying tham s biu
thc chnh quy truyn vo.
Note: response object c thuc tnh selector l instance ca Selector class. Chng ta c th
query bng cch: response.selector.xpath() or response.selector.css()
hoc s dng shortcut: response.xpath() or r esponse.css()
Using our item
Item object are custom python dict, c th truy cp vo cc trng bng cch:
item = DmozItem()
//DmozItem l tn class nh ngha item
item['title'] = 'Example title'
S dng item trong parse() method (yield Item object)
yield l g : http://phocode.com/python/python-iterator-va-generator/
Following links
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)

5. Storing the scraped data (lu tr kt qu sau khi crawl)


Dng lnh:
scrapy crawl dmoz -o items.json

3. Cc vn cn gii quyt vi Scrapy


Scrapy TODO:
Ly d liu
[x] How to extract data?
[] Th vin cc mu extract, cc v d extract.
[] Duyt trang ly d liu
[x] Store data into database
Tng tc , hiu nng
[x] Using proxy with scrapy.
[] cache
[x] Tng tc . ( a lung). -> scrapy khng h tr a lung. Nhng h tr
c ch bt ng b.

[] Scrapy download continue.


[] re-visit for update.
[] Monitoring scrapy, status, log.
[] Using scrapy with Docker.
Scrapy for dev
[] Limit total request (for testing)
[] Scrapy debug?
Crawl d liu nhiu cp.
Bi ton chng trng.
Nhiu sprider.

Tm gii php cho cc vn sau:


1. Scrapy:
- re-extractor (v d nh dng caching)
- download continue
- re-visit for update.
- X l vn caching. Th nghim.
*

10

2. Monitoring scrapy, status, log: Tm cch ly cc thng tin v tnh trng crawler, nh
Scrapy stats nhng dng realtime. Mc ch l bit tnh trng crawler nh th no.
3. Using scrapy with Docker: ng gi scrapy vo docker. Chy trong docker

Python Conventions
https://www.python.org/dev/peps/pep-0008/
http://docs.python-guide.org/en/latest/writing/style/

Xpath
Tool th xpath:
https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?
utm_source=chrome-app-launcher-info-dialog
Ti liu v xpath:
https://drive.google.com/open?id=0ByyO0Po-LQ5aVnlobzNBOHhjWW8

Docker
Install docker and docker-compose
# install docker
wget -qO- https://get.docker.com/ | sh
sudo usermod -a -G docker `whoami`
# install docker-compose
sudo wget
https://github.com/docker/compose/releases/download/1.9.0/docker-compose-`una
me -s`-`uname -m` -O /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

Tham kho:
https://github.com/tranhuucuong91/docker-training

Ch : Nu crawl d liu ln hoc nhng trang kim duyt cht cn x dng proxy trnh
b ban IP
Xem phn Cu hnh v s dng proxy.

11

4. Store into database


4.1 MongoDB: l kiu noSQL
Ti docker image ca mongodb v v chy.
mongodb
https://hub.docker.com/_/mongo/
To file docker-compose.yml c ni dung:
version: "2"
services:
mongodb:
image: mongo:3.2
ports:
- "27017:27017"
volumes:
- ./mongodb-data/:/data/db
# hostname: mongodb
# domainname: coclab.lan
cpu_shares: 512
# 0.5 CPU
mem_limit: 536870912
# 512 MB RAM
# privileged: true
# restart: always
# stdin_open: true
# tty: true

MongoDB to MySQL
Database == Database
Collection == Table
Document == Row
Query Mongo : https://docs.mongodb.org/getting-started/python/query/
c thm: Why MongoDB Is a Bad Choice for Storing Our Scraped Data
https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/
-> cp cc vn gp phi khi s dng mongodb
1.
2.
3.
4.

Locking
Poor space efficiency
Too Many Databases
Ordered data
12

5.
6.
7.
8.

Skip + Limit Queries are slow


Restrictions
Impossible to keep the working set in memory
Data that should be good, ends up bad!

Store into mongo database


http://doc.scrapy.org/en/latest/topics/item-pipeline.html
template pipeline:
import pymongo
class MongoPipeline(object):
collection_name = 'scrapy_items'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert(dict(item))
return item

Export/Import MongoDB
Export from server:

mongodump --archive=crawler.`date +%Y-%m-%d"_"%H-%M-%S`.gz --gzip --db


crawler
13

Import to mongodb:
#copy gzip file from local to container
docker cp /path/to/file container_id:/root
#restore
mongorestore --gzip --archive=/root/crawler.2016-04-18_07-40-11.gz
--db crawler

4.2 MySQL database


Chon kieu du lieu nao la phu hop?
utf8_unicode_ci vs utf8_general
- utf8mb4_unicode_ci: sort chinh xac, cham hon.
- utf8mb4_general_ci: sort khong chinh xac bang, nhanh hon.
-> chon: utf8mb4_unicode_ci
CREATE DATABASE crawler CHARACTER SET utf8mb4 COLLATE
utf8mb4_unicode_ci;
Install mysqlclient
sudo apt-get install -y libmysqlclient-dev
sudo pip2 install mysqlclient
sudo pip2 install sqlalchemy
MySQL Command:
# Login
mysql -u username -p
# Create new database
> CREATE DATABASE name;
# import:
> use name
> source import.sql

Pipeline iu khin qu trnh store. Models x l to bng db.


_init_ : kt ni v khi to bng d liu,
hm x l vic lu d liu vo bng (sessionmaker)
process: tham s l item v spider. d liu c crawl bi spider c a vo item, sau
c session a vo lu tr trong db

TODO: Vit file demo vic lu d liu vo MySQL.


14

5. Pipeline
Duplicates filter
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item

Write items to a JSON file


import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item

Price validation and dropping items with no prices


from scrapy.exceptions import DropItem
class PricePipeline(object):
vat_factor = 1.15
def process_item(self, item, spider):

15

if item['price']:
if item['price_excludes_vat']:
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s" % item)

Clean whitespace and HTML


# cleans whitespace & HTML
class CleanerPipeline(object):
def process_item(self, item, spider):
# general tidying up
for (name, val) in item.items():
#utils.devlog("Working on %s [%s]" % (name, val))
if val is None:
item[name] = ""
continue
item[name] = re.sub('\s+', ' ', val).strip() # remove whitespace
#item['blurb'] = re.sub('<[^<]+?>', '', item['blurb']).strip() # remove
HTML tags
# spider specific
if spider.name == "techmeme":
item['blurb'] = item['blurb'].replace('&nbsp; &mdash;&nbsp;',
'').strip()
return item

6. Extractor, Spider
Duyt all page
Cn xc nh r cp ca page
start_urls (cp 1)
trong Parse dn link n page cp 2 -> y l page cn duyt ht cc page con trong

parse_next_page: crawl d liu


def parse_articles_follow_next_page(self, response):

for article in response.xpath("//article"):


item = ArticleItem()

... extract article data here


yield item

16

next_page = response.css("ul.navigation > li.next-page > a::attr('href')")

if next_page:

url = response.urljoin(next_page[0].extract())

yield scrapy.Request(url, self.parse_articles_follow_next_page)

Note: test kt qu tr v khi dng cc selector, ta dng scrapy shell


c php: scrapy shell [url]
https://doc.scrapy.org/en/latest/topics/shell.html#topics-shell

Thm thng tin vo callback functions


http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-req
uest-callback-arguments
In some cases you may be interested in passing arguments to those callback functions so
you can receive the arguments later, in the second callbMergeack.
You can use the Request.meta attribute for that.
Tnh hung: khi duyt cc category -> ... -> website cng ty -> ly d liu v cng ty
Vn : mun lu thng tin category ca cng ty
-> Gii php: truyn thm thng tin vo callback functions.

def parse_page1(self, response):


item = MyItem()

item['main_url'] = response.url

request = scrapy.Request("http://www.example.com/some_page.html",

callback=self.parse_page2)

request.meta['item'] = item

yield request

def parse_page2(self, response):

item = response.meta['item']

item['other_url'] = response.url

yield item

Xem bi thc hnh pha di.

XPath pattern
//p[contains(string(),'Address:')]

17

vietnamese_title
//p[contains(string(),'Vietnamese
Title:')]
english_title
//p[contains(string(),'English Title:')]
address
//p[contains(string(), 'Address:') or
contains(string(), 'a ch:') or contains(string(), 'Tr s chnh:')]
province
//p[contains(string(),'Province:')]
director
//p[contains(string(),'Director:')]
tel
//p[contains(string(),'Tel:') or
contains(string(),'in thoi:')]
fax
//p[contains(string(),'Fax:')]
email
//p[contains(string(),'Email:')]
main_business
//p[contains(string(),'Main Business:')]
business
//p[(contains(string(),'Business:') and
not(contains(string(),'Main Business'))) or contains(string(),'Ngnh ngh
kinh doanh:')]
website
//p[contains(string(),'Website:')]
company_title

//p[contains(string(),'Vietnamese Title:')]

fn:tokenize(//p[contains(string(),'Vietnamese Title:')], ':')[0]

XPath Tips from the Web Scraping Trenches


In the context of web scraping, XPath is a nice tool to have in your belt, as it allows you to write
specifications of document locations more flexibly than CSS selectors. In case youre looking for a
tutorial,here is a XPath tutorial with nice examples.

Avoid using contains(.//text(), search text) in your XPath conditions. Use contains(.,
search text) instead.
GOOD:
1

>>> xp("//a[contains(., 'Next Page')]")

[u'<a href="#">Click here to go to the <strong>Next Page</strong></a>']

BAD:
1

>>> xp("//a[contains(.//text(), 'Next Page')]")

[]

GOOD:

>>> xp("substring-after(//a, 'Next ')")

18

[u'Page']

BAD:
1

>>> xp("substring-after(//a//text(), 'Next ')")

[u'']

Beware of the difference between //node[1] and (//node)[1]


//node[1] selects all the nodes occurring first under their respective parents.

(//node)[1] selects all the nodes in the document, and then gets only the first of them.
When selecting by class, be as specific as necessary
If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:
>>> sel.css(".content").extract()
[u'<p class="content text-wrap">Some content</p>']
>>> sel.css('.content').xpath('@class').extract()
[u'content text-wrap']

Learn to use all the different axes


It is handy to know how to use the axes, you can follow through the examples given in the
tutorial to quickly review this.

Useful trick to get text content


Here is another XPath trick that you may use to get the interesting text contents:
1

//*[not(self::script or self::style)]/text()[normalize-space(.)]

This excludes the content from script and style tags and also skip whitespace-only text nodes.
Source: http://stackoverflow.com/a/19350897/2572383

Config ItemLoad default: extract first and strip


http://stackoverflow.com/questions/17000640/scrapy-why-extracted-strings-are-in-this-format
There's a nice solution to this using I tem Loaders. Item Loaders are objects that get data
from responses, process the data and build Items for you. Here's an example of an Item
Loader that will strip the strings and return the first value that matches the XPath, if any:
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import MapCompose, TakeFirst

19

class MyItemLoader(XPathItemLoader):
default_item_class = MyItem
default_input_processor = MapCompose(lambda string: string.strip())
default_output_processor = TakeFirst()

And you use it like this:


def parse(self, response):
loader = MyItemLoader(response=response)
loader.add_xpath('desc', 'a/text()')
return loader.load_item()

Cc th vin Extractor
http://jeroenjanssens.com/2013/08/31/extracting-text-from-html-with-reporter.html
3 HTML text extractors in Python
1. python-readability
https://github.com/buriy/python-readability
2. python-boilerpipe
https://github.com/misja/python-boilerpipe
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
3. python-goose
https://github.com/grangier/python-goose
Html Content / Article Extractor, web scrapping lib in Python
pip2 install: goose-extractor

https://github.com/codelucas/newspaper
pip3 install newspaper3k

V d:
https://blog.openshift.com/day-16-goose-extractor-an-article-extractor-that-just-works/
https://github.com/shekhargulati/day16-goose-extractor-demo
http://vietnamnet.vn/vn/thoi-su/du-bao-thoi-tiet-hom-nay-6-12-ha-noi-ret-14-do-mien-trung-m
ua-cuc-lon-344734.html
20

7. Downloader
Cu hnh v s dng proxy
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html
https://rohitnarurkar.wordpress.com/2013/10/29/scrapy-working-with-a-proxy-network/
1. Go into your project directory
2. Create a file middlewares.py and add the following code:

# Importing base64 library because we'll need it ONLY in case if the proxy we
are going to use requires authentication
import base64
# Start your middleware class
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
3. Add the following lines in your settings.py script:

DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'sample.middlewares.ProxyMiddleware': 100,
}

http://wayback.archive.org/web/20150828053704/http://blog.michaelyin.info/2014/02/19/scra
py-socket-proxy/

Working with http proxy

21

When crawling infos from some website like google shop, it will detect the source ip and restrict
some service to some specific ip address, however, scrapy framework can handle this situation
by making the request through proxy.
The scrapy has provided HttpProxyMiddleware to support http proxy, if you want to make your
web crawler to go through proxy, the first thing you need to do is modify your setting file just like
this

Convert socket proxy to http proxy


http://stackoverflow.com/questions/19446536/proxy-ip-for-scrapy-framework

Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list
available of various websites like this
No i am confused about the options i should choose
1. Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/
2. Use TOR
3. Use VPN Service like http://www.hotspotshield.com/
4. Any Option better than above three
free proxy list
http://proxylist.hidemyass.com/
proxy checker: script python
Ly ngu nhin proxy t danh sch:
https://pypi.python.org/pypi/proxylist
from proxylist import ProxyList
pl = ProxyList()
pl.load_file('/web/proxy.txt')
pl.random()
# <proxylist.base.Proxy object at 0x7f1882d599e8>

pl.random().address()
# '1.1.1.1:8085'

22

Scrapy Download images


http://doc.scrapy.org/en/0.24/topics/images.html

8. Scrapy setting
http://doc.scrapy.org/en/latest/topics/settings.html

ITEM_PIPELINES
Default: {}
A dict containing the item pipelines to use, and their orders. The dict is empty by
default order values are arbitrary but its customary to define them in the 0-1000
range.

ITEM_PIPELINES = {
'mybot.pipelines.validate.ValidateMyItem': 300,
'mybot.pipelines.validate.StoreMyItem': 800,
}

23

DownloaderStats
Middleware lu tr s liu thng k ca tt c cc yu cu, phn ng v cc ngoi
l i qua n.
'scrapy.downloadermiddleware.stats.DownloaderStats': None,

http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermi
ddlewares.stats

Scrapy handle AJAX Website


http://stackoverflow.com/questions/24652170/scrapy-how-to-catch-the-unexpected-case-of-r
eturn-a-response-with-partial-html
During my crawling, some pages return a response with partial html body and status 200,
after I compare the response body with the one I open in browser, the former one miss
something. How can I catch this unexpected partial response body case in spider or in
download middleware?
Below is about the log example
2014-01-23 16:31:53+0100 [filmweb_multi] DEBUG: Crawled (408)
http://www.filmweb.pl/film/Labirynt-2013-507169/photos>
(referer:http://www.filmweb.pl/film/Labirynt-2013-507169) ['partial']

Answer:
Its not partial content as such. The rest of the content is dynamically loaded by a

Javacript AJAX call.

To debug what content is being sent as response for a particular request, use Scrapy's
open_in_browser() function.
There's another thread on How to extract dynamic content from websites that are using
AJAX ?. Refer this for a workaround.

x l nhng website s dng AJAX, c cch dng selenium ly ni dung website,


ging nh cch duyt web thng thng: browser render d liu
vi cch ny th tc crawler s chm hn rt nhiu
Dng selenium th c webdriver l Firefox, s yu cu giao din ha. Mun chy trn
server khng c giao din ha th dng PhantomJS.

24

Headless with PhantomJS: s dng PhantomJS cho automation testing, crawler


Selenium with headless phantomjs webdriver
https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/
1. Setup
2. Example
3. Benchmarking
https://dzone.com/articles/python-testing-phantomjs
scrapy with AJAX
http://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content
-from-websites-that-are-using-ajax
scrape hidden web data with python
http://www.6020peaks.com/2014/12/how-to-scrape-hidden-web-data-with-python/
s dng PhantomJS lm downloader
https://github.com/flisky/scrapy-phantomjs-downloader/blob/master/scrapy_phantomjs/downl
oader/handler.py
https://github.com/flisky/scrapy-phantomjs-downloader
Element not found in the cache - perhaps the page has changed since it was looked up
-> cn phi tm li element sau khi reload trang
Web elements you stored before clicking on login button will not be present in cache after
login because of page refresh or page changes. You need to again store these web
elements in order to make them available again under cache. I have modified your code a bit
which might help:

selenium: select dropdown loop


http://dnxnk.moit.gov.vn/
-> gii php:
- Lu index select chn.
- Mi ln chn, ly li element select.
element = driver.find_element_by_name('Years')
all_options = element.find_elements_by_tag_name('option')
for index in range(len(all_options)):
element = driver.find_element_by_name('Years')
all_options = element.find_elements_by_tag_name('option')

25

option = all_options[index]
print('Select: year is {}'.format(option.get_attribute('value')))
option.click()

Message: 'phantomjs' executable needs to be in PATH.


-> Thm ng dn phantomjs vo PATH (sa ~/.bashrc hoc ~/.zshrc)

Scrapy handle AJAX Website with Splash


https://github.com/scrapy-plugins/scrapy-splash
S dng splash vi proxy:
https://github.com/tranhuucuong91/docker-training/blob/master/compose/splash/docker-com
pose.yml
Trong code, s dng proxy cho request ca splash:
SplashRequest(url, self.parse_data, args={'wait': 0.5, 'proxy': 'splash_proxy'})

Tham kho: Crawling dynamic pages: Splash + Scrapyjs => S2


http://www.thecodeknight.com/post_categories/search/posts/scrapy_python

Scrapy debug
http://doc.scrapy.org/en/latest/topics/debug.html#open-in-browser
How to use pycharm to debug scrapy projects
http://unknownerror.org/opensource/scrapy/scrapy/q/stackoverflow/21788939/how-to-use-py
charm-to-debug-scrapy-projects

Scrapy caching
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-storage-fs
HTTPCACHE_ENABLED = True
HTTPCACHE_GZIP = True
HTTPCACHE_EXPIRATION_SECS = 30 * 24 * 60 * 60

26

Scrapy revisit for update:


http://stackoverflow.com/questions/23950184/avoid-scrapy-revisit-on-a-different-run
pipelines.py
from mybot.utils import connect_url_database
class DedupPipeline(object):
def __init__(self):
self.db = connect_url_database()
def process_item(self, item, spider):
url = item['url']
self.db.insert(url)
yield item

middlewares.py
from scrapy import log
from scrapy.exceptions import IgnoreRequest
from mybot.utils import connect_url_database
class DedupMiddleware(object):
def __init__(self):
self.db = connect_url_database()
def process_request(self, request, spider):
url = request.url
if self.db.has(url):
log.msg('ignore duplicated url: <%s>'%url, level=log.DEBUG)
raise IgnoreRequest()

settings.py
ITEM_PIPELINES = {
'mybot.pipelines.DedupPipeline': 0
}
DOWNLOADER_MIDDLEWARES = {
'mybot.middlewares.DedupMiddleware': 0
}

Continue download: Jobs: pausing and resuming crawls


http://scrapy.readthedocs.io/en/latest/topics/jobs.html
Sometimes, for big sites, its desirable to pause crawls and be able to resume them later.
27

How to use it

To start a spider with persistence supported enabled, run it like this:


scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a
signal), and resume it later by issuing the same command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1

Monitoring scrapy, status, log:


https://github.com/scrapinghub/scrapyrt/tree/master/scrapyrt

X l nhiu spider trong 1 project:


http://stackoverflow.com/questions/8372703/how-can-i-use-different-pipelines-for-different-s
piders-in-a-single-scrapy-proje
I can think of at least four approaches:
1. Use a different scrapy project per set of spiders+pipelines (might be appropriate if
your spiders are different enough warrant being in different projects)
2. On the scrapy tool command line, change the pipeline setting with s crapy settings in
between each invocation of your spider

3. Isolate your spiders into their own s crapy tool commands, and define the

default_settings['ITEM_PIPELINES'] on your command class to the pipeline list you

want for that command. See line 6 of this example.

4. In the pipeline classes themselves, have p rocess_item() check what spider it's
running against, and do nothing if it should be ignored for that spider. See the

example using resources per spider to get you started. (This seems like an ugly
solution because it tightly couples spiders and item pipelines. You probably shouldn't
use this one.)

class CustomPipeline(object)
def process_item(self, item, spider)
if spider.name == 'spider1':
# do something
return item
return item

28

Scrapy-fake-useragent
Random User-Agent middleware based on f ake-useragent. It picks up User-Agent strings
based on usage statistics from a real world database.
Configuration

Turn off the built-in UserAgentMiddleware and add RandomUserAgentMiddleware.


In Scrapy >=1.0:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}

6 v d dng scrapy http request


http://www.programcreek.com/python/example/71420/scrapy.http.Request
Ti liu hng dn scrapy web
http://hopefulramble.blogspot.com/2014/08/web-scraping-with-scrapy-first-steps_30.html

Crawler website s dng login


http://blog.javachen.com/2014/06/08/using-scrapy-to-cralw-zhihu.html

Filling Login Forms Automatically


https://blog.scrapinghub.com/2012/10/26/filling-login-forms-automatically/
We often have to write spiders that need to login to sites, in order to scrape data from them.
Our customers provide us with the site, username and password, and we do the rest.
The classic way to approach this problem is:
1. launch a browser, go to site and search for the login page
2. inspect the source code of the page to find out:
1. which one is the login form (a page can have many forms, but usually one of
them is the login form)
2. which are the field names used for username and password (these could vary
a lot)
3. if there are other fields that must be submitted (like an authentication token)
29

3. write the Scrapy spider to replicate the form submission using FormRequest (here is
an example)
Being fans of automation, we figured we could write some code to automate point 2 (which is
actually the most time-consuming) and the result is loginform, a library to automatically fill
login forms given the login page, username and password.
Here is the code of a simple spider that would use loginform to login to sites automatically:
In addition to being open source, loginform code is very simple and easy to hack (check the
README on Github for more details). It also contains a collection of HTML samples to keep
the library well-tested, and a convenient tool to manage them. Even with the simple code so
far, we have seen accuracy rates of 95% in our tests. We encourage everyone with similar
needs to give it a try, provide feedback and contribute patches.

https://blog.scrapinghub.com/2016/05/11/monkeylearn-addon-retail-classifier-tutorial/

Scrapy - how to manage cookies/sessions


http://stackoverflow.com/questions/4981440/scrapy-how-to-manage-cookies-sessions

My script:
1. My spider has a start url of searchpage_url
2. The searchpage is requested by parse() and the search form response gets passed
to search_generator()
3. search_generator() then yields lots of search requests using FormRequest and the
search form response.
4. Each of those FormRequests, and subsequent child requests need to have it's own
session, so needs to have it's own individual cookiejar and it's own session cookie.
=> Solution
30

Multiple cookie sessions per spider


http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar
There is support for keeping multiple cookie sessions per spider by using the cookiejar
Request meta key. By default it uses a single cookie jar (session), but you can pass an
identifier to use different ones.
For example:
for i, url in enumerate(urls):
yield scrapy.Request("http://www.example.com", meta={'cookiejar': i},
callback=self.parse_page)

Keep in mind that the cookiejar meta key is not sticky. You need to keep passing it along
on subsequent requests. For example:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)

How to send cookie with scrapy CrawlSpider requests?


http://stackoverflow.com/questions/32623285/how-to-send-cookie-with-scrapy-crawlspider-re
quests
def start_requests(self):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'}
for i,url in enumerate(self.start_urls):
yield Request(url,cookies={'over18':'1'}, callback=self.parse_item,
headers=headers)

Don't know what's wrong with CrawlSpider but Spider could work anyway.
# encoding: utf-8
import scrapy

class MySpider(scrapy.Spider):
name = 'redditscraper'
allowed_domains = ['reddit.com', 'imgur.com']
start_urls = ['https://www.reddit.com/r/nsfw']
def request(self, url, callback):
"""

31

wrapper for scrapy.request


"""
request = scrapy.Request(url=url, callback=callback)
request.cookies['over18'] = 1
request.headers['User-Agent'] = (
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML,
'
'like Gecko) Chrome/45.0.2454.85 Safari/537.36')
return request
def start_requests(self):
for i, url in enumerate(self.start_urls):
yield self.request(url, self.parse_item)
def parse_item(self, response):
titleList = response.css('a.title')
for title in titleList:
item = {}
item['url'] = title.xpath('@href').extract()
item['title'] = title.xpath('text()').extract()
yield item
url = response.xpath('//a[@rel="nofollow
next"]/@href').extract_first()
if url:
yield self.request(url, self.parse_item)
# you may consider scrapy.pipelines.images.ImagesPipeline :D

By-pass anti-crawler
- Mt s cch tinh chnh setting cho Scrapy:
+ Tng delay time
+ Gim concurrent request
+ Dng proxy
+ Mt s trick khc - ty trang (ajax....)
- https://learn.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/
- http://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned

32

Kinh nghim thc t


Alibaba redirect, yu cu ng nhp
-> c th thc hin bc ng nhp.
def parse(self, response):
if response.url.startswith('https://login.alibaba.com'):
self.logger.debug('Login: {}'.format(response.url))
return self.login(response)
else:
return self.parse_product(response)
def login(self, response):
return [FormRequest.from_response(response,
formdata={'loginId': 'csdk@gmail.com', 'password': 'BSvrvY'},
callback=self.parse)]

X l products list
V d: http://www.alibaba.com/Agricultural-Growing-Media_pid144
scrapy shell http://www.alibaba.com/Agricultural-Growing-Media_pid144

company: //div[@class="cbrand ellipsis"]


next_page: //a[@class="next"]/@href
=> Problem: Khi next-page cc button c th b thay i
button "Next" khng get qua urljoin k t page 2
K t trang 2 th k ly c button nextpage.
response.xpath('//div[@class="cbrand ellipsis"]')

next_page l: http://www.alibaba.com/catalogs/products/CID144/2

Vn ly danh sch cng ty:


Truy cp vo trang: http://www.alibaba.com/catalogs/products/CID144/2
company: //div[@class="stitle util-ellipsis"]
chy cu lnh:
scrapy shell http://www.alibaba.com/catalogs/products/CID144/2

33

response.xpath('//div[@class="stitle util-ellipsis"]')
-> null. Khng ly c d liu.
-> Nguyn nhn: website s dng javascript render d liu.
- Dng trnh duyt th c th dng xpath ly c.
- Dng scrapy th cha ly c.

Hng gii quyt:


Khi dng browser nhn thy c d liu v extract c d liu bng xpath, nhng dng
scrapy li khng extract c
-> phi ngh n vic browser render d liu khc vi scrapy.
- browser s chy javascript, css, render d liu.
- scrapy ch ly html raw.
-> thc hin ly html raw phn tch.
V d:
wget http://www.alibaba.com/catalogs/products/CID144/2
c html raw th thy:
page.setPageData({"baseServer":"//www.alibaba.com","isForbiddenSel
l":false,"isForbidden":false,"clearAllHref":"//www.alibaba.com/cat
alogs/products/CID144","quotationSupplierNum":250375,"allCategory"
:null,"searchbarFixed":
-> cn phi tm cch x l d liu ny vi scrapy.

V d: get raw data


wget http://www.alibaba.com/catalogs/products/CID144/2
wget http://www.alibaba.com/catalogs/products/CID144/3
Tch phn json, c d liu json ca 2 trang nh sau.
http://pastebin.com/TBxYswGD
http://pastebin.com/ZFEMcuST

C th dng trang sau c file json, c v p hn trang em ang dng:


http://www.jsoneditoronline.org/
-> Hng gii quyt:
1. Dng regular expression ly phn json.
2. c json data, ly ra d liu mong mun.

34

V d:
json_data =
response.xpath('string(//body)').re(r'page.setPageData\((.*?\})\);')[0]
# tip tc x l json
Hoc:
1. Dng regular expression, extract chnh xc d liu mong mun.
V d:
Pattern:
"supplierHref":"http://dfl-vermiculite.en.alibaba.com/company_profile.h
tml#top-nav-bar"
Code extract:
response.xpath('string(//body)').re(r'"supplierHref":"([^#]+)')

Trong bi ton ny, chng ta chn cch n gin l cch 2: dng regular expression
extract chnh xc phn d liu mun ly.

Bi hc:
1. Dng browser thy c d liu, dng scrapy khng thy c d liu
-> cn phi ly d liu raw phn tch.
2. Cn thy im khc nhau ca th nhn thy trn browser v d liu raw. "Nhng g chng
ta nhn thy khng nh nhng g chng ta ngh".
y cng l cch cc website hn ch vic b crawl data.
3. Khng dng c xpath th dng regular expression. Regular expression l mc c bn
nht, x l c hu ht cc vn extract.

Vn next_page:
C th gii quyt bng cch duyt page theo index tng dn.
http://www.alibaba.com/catalogs/products/CID144/2
-> next_page: http://www.alibaba.com/catalogs/products/CID144/3
origin_url = 'http://www.alibaba.com/catalogs/products/CID144/2'
url_token = origin_url.split('/')
next_page_url = '/'.join(url_token[:-1] + [str(int(url_token[-1]) + 1)])
print(next_page_url)
Khi no ht page?
35

-> khi nt Next khng c th <a>


Nh vy, gii php next_page l:
Kim tra: next_page = response.xpath('//a[@class="next"]/@href').extract_first('').strip()
=> it incorrect, we need another solution
- Nu c next_page: next_page s c URL bng URL hin ti + (index + 1)
- Nu khng: dng.
crawl thng tin product:
Cn b sung thm thng tin:
- description
- cost, currency
- category, nu ly c

36

Deploy project scrapy s dng ScrapingHub


Source: https://scrapinghub.com/
Scrapinghub l mt cloud da trn web crawling platform h tr deloy v scale m bn s
khng phi lo lng v server, monitoring, backup v schedule cho project scrapy ca mnh.
H tr nhiu add-ons h tr vic m rng spider ca bn cng vi rotator proxy thng minh
h tr vic chn t cc website tng tc crawl.
Cc tnh nng chnh:
Jobs dashboard: C giao din qun l cc job, thng k chi tit rt d qun l v
chy.
Item browser: Hin th d liu crawl c.D liu c hin th kh p mt v
theo cu trc.
Log inspector: Kim tra logs sinh ra trong qu trnh chy, cc li pht sinh c
hin th kh r rng.
Data storage, usage reports and API: Tt c d liu crawl c u c lu vo
db ca ScrapingHub v truy cp thng qua API tr v. Ngoi ra cn c h thng lp
lch chy,cc addons c sn h tr crawl nh: Monkeylearn, splash, crawlera,
BigML, DeltaFetch, Images, Monitoring, Portia ...
Install scrapinghub:
$ pip/pip3 install scrapinghub
Thc hin login deploy:
$ shub login
Nhp API key login (API c ly ti: https://dash.scrapinghub.com/account/apikey)
Sau khi ng nhp thng tin s c lu ti ~ /.scrapinghub.yml
Tin hnh deploy mt project ln scrapinghub:
To mt project mi cha project ca mnh ti S
crapy Cloud Projects trn
scrapinghub.
Click vo project va to trn scrapinghub vo mc code & deploy ly API project va
to.
$cd <your project scrapy>
$shub deploy
Ty chn ci thm th vin khi chy:
Edit file scrapinghub.yml:
projects:
default: 123
requirements_file: requirements.txt

Nhp API ca project sau project s c deploy trn scrapinghub.

37

Tin hnh chy project click vo run chn tn spider,c vi ty chn khi chy nh priority ,
tags, Arguments ty vo nhu cu s dng.

Bn ch c th chy c 1 spider mt lc, cc c cc ln chy tip theo s c a vo


next jobs.
D liu s c xut ra csv v c lu trong data ca scrapinghub vi thi gian lu tr 1
tun. lu lu hn bn s cn phi nng cp v tr ph.

38

Lp lch chy sprider:


Scrapinghub h tr lp lch chy spider rt d s dng

Cc chc nng free cho bn deploy v chy mt project vi cc chc nng v cng c
h tr vi 1G ram v 1 concurrent crawl. Nu c nhu cu bn c th nng cp ty chnh
v s dng cc addons khc.

Django-dynamic-scraper (DDS)
Django-dynamic-scraper use scrapy base on django framework and use admin
django interface create scrapy crawl many website.
Dockerfile: https://github.com/khainguyen95/django-dynamic-scraper
Image: https://hub.docker.com/r/khainguyendinh/django-dynamic-scraper/

Requirements:
Python 2.7+ or 3.4+
Django 1.8/1.9
Scrapy 1.1
Scrapy-djangoitem 1.1
Python JSONPath RW 1.4+
Python future
scrapyd
django-celery

39

django-dynamic-scraper

Documents
Tutorial DDS
Scrapyd-client
DjangoItem in scrapy

SETUP:
1. Install docker, compose
Install docker:
$wget -qO- https://get.docker.com/ | sh
$sudo usermod -a -G docker whoami
Install compose:
$sudo wget -q
https://github.com/docker/compose/releases/download/1.6.2/dock
er-compose-`uname -s-uname -m` \
-O /usr/local/bin/docker-compose

$sudo chmod +x /usr/local/bin/docker-compose


Tip : after that, logout, then login for update environment

2. Run docker django-dynamic-scraper


Pull docker images:
$docker pull khainguyendinh/django-dynamic-scraper

3. Defining the object to be scraped


create Database utf8
40

CREATE DATABASE news CHARACTER SET utf8 COLLATE utf8_general_ci;


$cd djangoItem
create user admin
python manage.py createsuperuser
run django server
python manage.py runserver 0.0.0.0:8000
show admin django in browser
http://localhost:8000/admin
add New Scraped object classes
add New Scrapers
add News websites

4. run crawl data


Run:
$scrapy crawl [--output=FILE --output-format=FORMAT]
SPIDERNAME -a id=REF_OBJECT_ID [-a do_action=(yes|no) -a
run_type=(TASK|SHELL) -a max_items_read={Int} -a
max_items_save={Int} -a max_pages_read={Int} -a
output_num_mp_response_bodies={Int} -a
output_num_dp_response_bodies={Int} ]
$scrapy crawl news -a id=1 -a do_action=yes

5. run schedule crawl:


deploy project scrapy:
$cd crawl
$scrapyd-deploy -p crawl
$scrapyd

41

run schedule scrapy:


$python manage.py celeryd -l info -B
--settings=example_project.settings
python manage.py celeryd -l info -B --settings=djangoItem.settings
run check error expath:
$scrapy crawl news_checker -a id=ITEM_ID -a do_action=yes

42

Ti liu tham kho


1. Scrapy Documentation. https://doc.scrapy.org/en/latest/
2. Scrapinghub Documentation: h
ttps://doc.scrapinghub.com
3. Django Dynamic Scrapy: http://django-dynamic-scraper.readthedocs.io/
TODO: T rt nhiu ngun khc na, mi phn c link tham kho. Chng ti s sm cp
nht y ti liu tham kho.

43

You might also like