Professional Documents
Culture Documents
License
Contributors
No.
Name
Cuong Tran
tranhuucuong91@gmail.com
duongnq094@gmail.com
Nguyn nh Khi
khainguyenptiter@gmail.com
Nguyn B Cng
cuongnb14@gmail.com
nauh94@gmail.com
Mc lc
License
Contributors
1. Scrapy Architecture
1.1 Thnh phn
1.2 Lung d liu
6
6
6
2. Tutorial vi Scrapy
Install Scrapy
Cc bc chnh trong tutorial:
1. Defining our Item
2. Our first Spider
3. Crawling
4. Extracting Items
5. Storing the scraped data (lu tr kt qu sau khi crawl)
7
7
8
8
8
9
9
10
10
11
11
11
12
12
12
13
13
14
5. Pipeline
Duplicates filter
Write items to a JSON file
Price validation and dropping items with no prices
Clean whitespace and HTML
15
15
15
15
16
6. Extractor, Spider
Duyt all page
Thm thng tin vo callback functions
XPath pattern
XPath Tips from the Web Scraping Trenches
Config ItemLoad default: extract first and strip
Cc th vin Extractor
16
16
17
17
18
19
20
7. Downloader
Cu hnh v s dng proxy
21
21
3
21
23
8. Scrapy setting
ITEM_PIPELINES
DownloaderStats
23
23
24
24
26
Scrapy debug
26
Scrapy caching
26
27
27
28
28
Scrapy-fake-useragent
29
29
29
30
31
31
By-pass anti-crawler
32
33
33
33
33
35
37
39
Django-dynamic-scraper (DDS)
Requirements:
Documents
SETUP:
1. Install docker, compose
2. Run docker django-dynamic-scraper
3. Defining the object to be scraped
39
39
40
40
40
40
40
4
41
41
43
1. Scrapy Architecture
http://doc.scrapy.org/en/latest/topics/architecture.html
2. Tutorial vi Scrapy
Tham kho:
1. Scrapy Tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html
2. Scraping v crawling Web vi Scrapy v SQLAlchemy:
https://viblo.asia/naa/posts/6BkGyxOLM5aV
3. K thut scraping v crawling Web nng cao vi Scrapy v SQLAlchemy:
https://viblo.asia/naa/posts/6BkGyxzeM5aV
4. Github: https://github.com/tranhuucuong91/scrapy-tutorial
Install Scrapy
# install virtualenv
sudo pip install virtualenv
virtualenv venv -p python3
source venv/bin/activate
# install scrapy dependencies
sudo apt-get install -y gcc g++
sudo apt-get install -y python3-dev
sudo apt-get install -y libssl-dev libxml2-dev libxslt1-dev libffi-dev
# install mysql dependencies
sudo apt-get install -y libmysqlclient-dev
# install python libs: scrapy, mysql
pip install -r requirements.txt
3. Crawling
Ti th mc gc ca project v chy lnh:
scrapy crawl dmoz
// dmoz l tn ca scrapy (name)
=> Qu trnh thc hin:
- Scrapy to scrapy.Request cho mi URL trong list start_urls ca spider v gn chng
phng thc parse c gi bi callback function ca chng.
- Cc Request c lp lch ri thc thi v tr v scrapy.http.Response object, sau c
a tr li spider thng qua phng thc parse().
4. Extracting Items
Introduction to Selectors
- S dng c ch da trn Xpath hoc biu thc CSS gi l Scrapy Selector.
Note: XPath mnh m hn CSS
- Scrapy cung cp class Selector v mt s quy c, shortcut lm vic vi biu thc
xpath v css
- Selector object i din cc nodes trong mt document c cu trc. V th u tin khi
to mt selector gn vi root node hoc ton b ti liu
- Selector c 4 phng thc c bn:
1. xpath(): tr v danh sch cc selectors, mi ci i din cho mt node c chn
bng tham s biu thc xpath truyn vo.
2. css(): tr v danh sch cc selectors, mi ci i din cho mt node c chn
bng tham s biu thc css truyn vo.
3. extract(): tr v mt list unicode string vi d liu c chn -> c th dng
extract_first() ly 1 phn t u tin
4. re(): tr v danh sch unicode string c trch xut bng applying tham s biu
thc chnh quy truyn vo.
Note: response object c thuc tnh selector l instance ca Selector class. Chng ta c th
query bng cch: response.selector.xpath() or response.selector.css()
hoc s dng shortcut: response.xpath() or r esponse.css()
Using our item
Item object are custom python dict, c th truy cp vo cc trng bng cch:
item = DmozItem()
//DmozItem l tn class nh ngha item
item['title'] = 'Example title'
S dng item trong parse() method (yield Item object)
yield l g : http://phocode.com/python/python-iterator-va-generator/
Following links
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
10
2. Monitoring scrapy, status, log: Tm cch ly cc thng tin v tnh trng crawler, nh
Scrapy stats nhng dng realtime. Mc ch l bit tnh trng crawler nh th no.
3. Using scrapy with Docker: ng gi scrapy vo docker. Chy trong docker
Python Conventions
https://www.python.org/dev/peps/pep-0008/
http://docs.python-guide.org/en/latest/writing/style/
Xpath
Tool th xpath:
https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?
utm_source=chrome-app-launcher-info-dialog
Ti liu v xpath:
https://drive.google.com/open?id=0ByyO0Po-LQ5aVnlobzNBOHhjWW8
Docker
Install docker and docker-compose
# install docker
wget -qO- https://get.docker.com/ | sh
sudo usermod -a -G docker `whoami`
# install docker-compose
sudo wget
https://github.com/docker/compose/releases/download/1.9.0/docker-compose-`una
me -s`-`uname -m` -O /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
Tham kho:
https://github.com/tranhuucuong91/docker-training
Ch : Nu crawl d liu ln hoc nhng trang kim duyt cht cn x dng proxy trnh
b ban IP
Xem phn Cu hnh v s dng proxy.
11
MongoDB to MySQL
Database == Database
Collection == Table
Document == Row
Query Mongo : https://docs.mongodb.org/getting-started/python/query/
c thm: Why MongoDB Is a Bad Choice for Storing Our Scraped Data
https://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/
-> cp cc vn gp phi khi s dng mongodb
1.
2.
3.
4.
Locking
Poor space efficiency
Too Many Databases
Ordered data
12
5.
6.
7.
8.
Export/Import MongoDB
Export from server:
Import to mongodb:
#copy gzip file from local to container
docker cp /path/to/file container_id:/root
#restore
mongorestore --gzip --archive=/root/crawler.2016-04-18_07-40-11.gz
--db crawler
5. Pipeline
Duplicates filter
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
15
if item['price']:
if item['price_excludes_vat']:
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s" % item)
6. Extractor, Spider
Duyt all page
Cn xc nh r cp ca page
start_urls (cp 1)
trong Parse dn link n page cp 2 -> y l page cn duyt ht cc page con trong
16
if next_page:
url = response.urljoin(next_page[0].extract())
item['main_url'] = response.url
request = scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
yield request
item = response.meta['item']
item['other_url'] = response.url
yield item
XPath pattern
//p[contains(string(),'Address:')]
17
vietnamese_title
//p[contains(string(),'Vietnamese
Title:')]
english_title
//p[contains(string(),'English Title:')]
address
//p[contains(string(), 'Address:') or
contains(string(), 'a ch:') or contains(string(), 'Tr s chnh:')]
province
//p[contains(string(),'Province:')]
director
//p[contains(string(),'Director:')]
tel
//p[contains(string(),'Tel:') or
contains(string(),'in thoi:')]
fax
//p[contains(string(),'Fax:')]
email
//p[contains(string(),'Email:')]
main_business
//p[contains(string(),'Main Business:')]
business
//p[(contains(string(),'Business:') and
not(contains(string(),'Main Business'))) or contains(string(),'Ngnh ngh
kinh doanh:')]
website
//p[contains(string(),'Website:')]
company_title
//p[contains(string(),'Vietnamese Title:')]
Avoid using contains(.//text(), search text) in your XPath conditions. Use contains(.,
search text) instead.
GOOD:
1
BAD:
1
[]
GOOD:
18
[u'Page']
BAD:
1
[u'']
(//node)[1] selects all the nodes in the document, and then gets only the first of them.
When selecting by class, be as specific as necessary
If you want to select elements by a CSS class, the XPath way to do that is the rather verbose:
>>> sel.css(".content").extract()
[u'<p class="content text-wrap">Some content</p>']
>>> sel.css('.content').xpath('@class').extract()
[u'content text-wrap']
//*[not(self::script or self::style)]/text()[normalize-space(.)]
This excludes the content from script and style tags and also skip whitespace-only text nodes.
Source: http://stackoverflow.com/a/19350897/2572383
19
class MyItemLoader(XPathItemLoader):
default_item_class = MyItem
default_input_processor = MapCompose(lambda string: string.strip())
default_output_processor = TakeFirst()
Cc th vin Extractor
http://jeroenjanssens.com/2013/08/31/extracting-text-from-html-with-reporter.html
3 HTML text extractors in Python
1. python-readability
https://github.com/buriy/python-readability
2. python-boilerpipe
https://github.com/misja/python-boilerpipe
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
3. python-goose
https://github.com/grangier/python-goose
Html Content / Article Extractor, web scrapping lib in Python
pip2 install: goose-extractor
https://github.com/codelucas/newspaper
pip3 install newspaper3k
V d:
https://blog.openshift.com/day-16-goose-extractor-an-article-extractor-that-just-works/
https://github.com/shekhargulati/day16-goose-extractor-demo
http://vietnamnet.vn/vn/thoi-su/du-bao-thoi-tiet-hom-nay-6-12-ha-noi-ret-14-do-mien-trung-m
ua-cuc-lon-344734.html
20
7. Downloader
Cu hnh v s dng proxy
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html
https://rohitnarurkar.wordpress.com/2013/10/29/scrapy-working-with-a-proxy-network/
1. Go into your project directory
2. Create a file middlewares.py and add the following code:
# Importing base64 library because we'll need it ONLY in case if the proxy we
are going to use requires authentication
import base64
# Start your middleware class
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
3. Add the following lines in your settings.py script:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'sample.middlewares.ProxyMiddleware': 100,
}
http://wayback.archive.org/web/20150828053704/http://blog.michaelyin.info/2014/02/19/scra
py-socket-proxy/
21
When crawling infos from some website like google shop, it will detect the source ip and restrict
some service to some specific ip address, however, scrapy framework can handle this situation
by making the request through proxy.
The scrapy has provided HttpProxyMiddleware to support http proxy, if you want to make your
web crawler to go through proxy, the first thing you need to do is modify your setting file just like
this
Until now i am using middleware in Scrapy to manually rotate ip from free proxy ip list
available of various websites like this
No i am confused about the options i should choose
1. Buy premium proxy list from http://www.ninjasproxy.com/ or http://hidemyass.com/
2. Use TOR
3. Use VPN Service like http://www.hotspotshield.com/
4. Any Option better than above three
free proxy list
http://proxylist.hidemyass.com/
proxy checker: script python
Ly ngu nhin proxy t danh sch:
https://pypi.python.org/pypi/proxylist
from proxylist import ProxyList
pl = ProxyList()
pl.load_file('/web/proxy.txt')
pl.random()
# <proxylist.base.Proxy object at 0x7f1882d599e8>
pl.random().address()
# '1.1.1.1:8085'
22
8. Scrapy setting
http://doc.scrapy.org/en/latest/topics/settings.html
ITEM_PIPELINES
Default: {}
A dict containing the item pipelines to use, and their orders. The dict is empty by
default order values are arbitrary but its customary to define them in the 0-1000
range.
ITEM_PIPELINES = {
'mybot.pipelines.validate.ValidateMyItem': 300,
'mybot.pipelines.validate.StoreMyItem': 800,
}
23
DownloaderStats
Middleware lu tr s liu thng k ca tt c cc yu cu, phn ng v cc ngoi
l i qua n.
'scrapy.downloadermiddleware.stats.DownloaderStats': None,
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermi
ddlewares.stats
Answer:
Its not partial content as such. The rest of the content is dynamically loaded by a
To debug what content is being sent as response for a particular request, use Scrapy's
open_in_browser() function.
There's another thread on How to extract dynamic content from websites that are using
AJAX ?. Refer this for a workaround.
24
25
option = all_options[index]
print('Select: year is {}'.format(option.get_attribute('value')))
option.click()
Scrapy debug
http://doc.scrapy.org/en/latest/topics/debug.html#open-in-browser
How to use pycharm to debug scrapy projects
http://unknownerror.org/opensource/scrapy/scrapy/q/stackoverflow/21788939/how-to-use-py
charm-to-debug-scrapy-projects
Scrapy caching
http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-storage-fs
HTTPCACHE_ENABLED = True
HTTPCACHE_GZIP = True
HTTPCACHE_EXPIRATION_SECS = 30 * 24 * 60 * 60
26
middlewares.py
from scrapy import log
from scrapy.exceptions import IgnoreRequest
from mybot.utils import connect_url_database
class DedupMiddleware(object):
def __init__(self):
self.db = connect_url_database()
def process_request(self, request, spider):
url = request.url
if self.db.has(url):
log.msg('ignore duplicated url: <%s>'%url, level=log.DEBUG)
raise IgnoreRequest()
settings.py
ITEM_PIPELINES = {
'mybot.pipelines.DedupPipeline': 0
}
DOWNLOADER_MIDDLEWARES = {
'mybot.middlewares.DedupMiddleware': 0
}
How to use it
Then, you can stop the spider safely at any time (by pressing Ctrl-C or sending a
signal), and resume it later by issuing the same command:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
3. Isolate your spiders into their own s crapy tool commands, and define the
4. In the pipeline classes themselves, have p rocess_item() check what spider it's
running against, and do nothing if it should be ignored for that spider. See the
example using resources per spider to get you started. (This seems like an ugly
solution because it tightly couples spiders and item pipelines. You probably shouldn't
use this one.)
class CustomPipeline(object)
def process_item(self, item, spider)
if spider.name == 'spider1':
# do something
return item
return item
28
Scrapy-fake-useragent
Random User-Agent middleware based on f ake-useragent. It picks up User-Agent strings
based on usage statistics from a real world database.
Configuration
3. write the Scrapy spider to replicate the form submission using FormRequest (here is
an example)
Being fans of automation, we figured we could write some code to automate point 2 (which is
actually the most time-consuming) and the result is loginform, a library to automatically fill
login forms given the login page, username and password.
Here is the code of a simple spider that would use loginform to login to sites automatically:
In addition to being open source, loginform code is very simple and easy to hack (check the
README on Github for more details). It also contains a collection of HTML samples to keep
the library well-tested, and a convenient tool to manage them. Even with the simple code so
far, we have seen accuracy rates of 95% in our tests. We encourage everyone with similar
needs to give it a try, provide feedback and contribute patches.
https://blog.scrapinghub.com/2016/05/11/monkeylearn-addon-retail-classifier-tutorial/
My script:
1. My spider has a start url of searchpage_url
2. The searchpage is requested by parse() and the search form response gets passed
to search_generator()
3. search_generator() then yields lots of search requests using FormRequest and the
search form response.
4. Each of those FormRequests, and subsequent child requests need to have it's own
session, so needs to have it's own individual cookiejar and it's own session cookie.
=> Solution
30
Keep in mind that the cookiejar meta key is not sticky. You need to keep passing it along
on subsequent requests. For example:
def parse_page(self, response):
# do some processing
return scrapy.Request("http://www.example.com/otherpage",
meta={'cookiejar': response.meta['cookiejar']},
callback=self.parse_other_page)
Don't know what's wrong with CrawlSpider but Spider could work anyway.
# encoding: utf-8
import scrapy
class MySpider(scrapy.Spider):
name = 'redditscraper'
allowed_domains = ['reddit.com', 'imgur.com']
start_urls = ['https://www.reddit.com/r/nsfw']
def request(self, url, callback):
"""
31
By-pass anti-crawler
- Mt s cch tinh chnh setting cho Scrapy:
+ Tng delay time
+ Gim concurrent request
+ Dng proxy
+ Mt s trick khc - ty trang (ajax....)
- https://learn.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/
- http://doc.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned
32
X l products list
V d: http://www.alibaba.com/Agricultural-Growing-Media_pid144
scrapy shell http://www.alibaba.com/Agricultural-Growing-Media_pid144
next_page l: http://www.alibaba.com/catalogs/products/CID144/2
33
response.xpath('//div[@class="stitle util-ellipsis"]')
-> null. Khng ly c d liu.
-> Nguyn nhn: website s dng javascript render d liu.
- Dng trnh duyt th c th dng xpath ly c.
- Dng scrapy th cha ly c.
34
V d:
json_data =
response.xpath('string(//body)').re(r'page.setPageData\((.*?\})\);')[0]
# tip tc x l json
Hoc:
1. Dng regular expression, extract chnh xc d liu mong mun.
V d:
Pattern:
"supplierHref":"http://dfl-vermiculite.en.alibaba.com/company_profile.h
tml#top-nav-bar"
Code extract:
response.xpath('string(//body)').re(r'"supplierHref":"([^#]+)')
Trong bi ton ny, chng ta chn cch n gin l cch 2: dng regular expression
extract chnh xc phn d liu mun ly.
Bi hc:
1. Dng browser thy c d liu, dng scrapy khng thy c d liu
-> cn phi ly d liu raw phn tch.
2. Cn thy im khc nhau ca th nhn thy trn browser v d liu raw. "Nhng g chng
ta nhn thy khng nh nhng g chng ta ngh".
y cng l cch cc website hn ch vic b crawl data.
3. Khng dng c xpath th dng regular expression. Regular expression l mc c bn
nht, x l c hu ht cc vn extract.
Vn next_page:
C th gii quyt bng cch duyt page theo index tng dn.
http://www.alibaba.com/catalogs/products/CID144/2
-> next_page: http://www.alibaba.com/catalogs/products/CID144/3
origin_url = 'http://www.alibaba.com/catalogs/products/CID144/2'
url_token = origin_url.split('/')
next_page_url = '/'.join(url_token[:-1] + [str(int(url_token[-1]) + 1)])
print(next_page_url)
Khi no ht page?
35
36
37
Tin hnh chy project click vo run chn tn spider,c vi ty chn khi chy nh priority ,
tags, Arguments ty vo nhu cu s dng.
38
Cc chc nng free cho bn deploy v chy mt project vi cc chc nng v cng c
h tr vi 1G ram v 1 concurrent crawl. Nu c nhu cu bn c th nng cp ty chnh
v s dng cc addons khc.
Django-dynamic-scraper (DDS)
Django-dynamic-scraper use scrapy base on django framework and use admin
django interface create scrapy crawl many website.
Dockerfile: https://github.com/khainguyen95/django-dynamic-scraper
Image: https://hub.docker.com/r/khainguyendinh/django-dynamic-scraper/
Requirements:
Python 2.7+ or 3.4+
Django 1.8/1.9
Scrapy 1.1
Scrapy-djangoitem 1.1
Python JSONPath RW 1.4+
Python future
scrapyd
django-celery
39
django-dynamic-scraper
Documents
Tutorial DDS
Scrapyd-client
DjangoItem in scrapy
SETUP:
1. Install docker, compose
Install docker:
$wget -qO- https://get.docker.com/ | sh
$sudo usermod -a -G docker whoami
Install compose:
$sudo wget -q
https://github.com/docker/compose/releases/download/1.6.2/dock
er-compose-`uname -s-uname -m` \
-O /usr/local/bin/docker-compose
41
42
43