源码gitee地址:https://gitee.com/lizhaojie/Scrapy.git
准备工作:下载一些资源库,本来准备了下载网址,但是csdn超过5个链接会一直审核,就不放链接了。大家可以自行百度官网下载
使用Python3.6
安装wheel
pip install wheel
1
第二步 下载lxml,这是一个依赖的解析库,可以解析lxml和xml
下载好以后,在cmd中输入
pip install +下载好的文件地址路径
1
第三步 下载PyOpenssl,这是python支持的基于密码学的安全开发包
同样下载好以后,在cmd中输入安装
第四步 下载Twisted,这个用来实现异步
同样下载好以后,在cmd中输入安装
第五步 下载Pywin32,这个是模拟键盘的,还有一些其他的作用
下载完成以后一路next安装就好,会自动识别你的python
最后一步 安装scrapy
pip install scrapy
1
安装完成以后,在cmd中输入scrapy即可查看是否安装成功,如果打印出一些命令提示,就是安装成功了。
先说一下scrapy的流程框架:抓取第一页
获取内容和下一页链接
保存爬取结果
翻页爬取
ok,接下来实战演练
首先在命令行中输入
scrapy startproject quotetutorial
1
来新建一个scrapy项目,因为我们就用官方给的网站http://quotes.toscrape.com做练习,所以项目名字我们起名做quotetutorial
然后在quotetutorial文件夹下输入
scrapy genspider quotes quotes.toscrape.com
1
这是定义我们要爬取的域名
这样我们完成了spider的创建
ok,接下里我们可以打开我们的Pycharm编译器,其他同学使用别的编译器也可以,我用Pycharm来演示。
打开我们新建的项目之后可以看到是这样的一个文件目录
quotes.py 文件是用来写我们核心的爬取代码,入口类
items.py 是定义我们保存文件的数据结构
pipelines.py 是对我们的爬取结果做进一步的处理,比如存入数据库,分词之类的
settings.py 是设置一些默认路径或者属性,比如数据库的一些配置
middlewaress.py 是定义一些我们用到的中间件,用来处理一些request,response操作。
先看quotes.py
我们的目的是爬取网页的名言,作者,和tags
检查元素找到我们要爬取的标签
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuoteItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url,callback=self.parse)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
在这里就说一下quote.css(’.text::text’).extract_first()和quote.css(’.tags .tag::text’).extract()
相当于beautifulsoup中的find和findall,一个是返回一个,一个是返回多个,比如tags这样多标签的就是quote.css(’.tags .tag::text’).extract()
scarpy还提供了shell操作
在Terminal中输入
scrapy shell quotes.toscrapy.com
1
就可以进入一个命令行交互模式,进行一些response的交互操作。
输入
scrapy crawl quotes -o quotes.json
1
还可以保存为json文件,或者scrapy crawl quotes -o quotes.csv,保存为csv,可以支持很多格式的保存,不一一列举了。
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class QuoteItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.exceptions import DropItem
class TextPipeline(object):
def __init__(self):
self.limit = 50
def process_item(self, item, spider):
if item['text']:
if len(item['text'])>self.limit:
item['text'] = item['text'][0:self.limit].rstrip()+"..."
return item
else:
return DropItem('Missing Text')
class MongoPipeline(object):
def __init__(self,mongo_url,mongo_db):
self.mongo_url = mongo_url
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_url=crawler.settings.get('MONGO_URL'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self,spider):
self.client = pymongo.MongoClient(self.mongo_url)
self.db = self.client[self.mongo_db]
def process_item(self,item,spider):
name = item.__class__.__name__
self.db[name].insert(dict(item))
return item
def close_spider(self,spider):
self.client.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for quotetutorial project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
ITEM_PIPELINES = {
'quotetutorial.pipelines.TextPipeline': 300,
'quotetutorial.pipelines.MongoPipeline': 400,
}
BOT_NAME = 'quotetutorial'
SPIDER_MODULES = ['quotetutorial.spiders']
NEWSPIDER_MODULE = 'quotetutorial.spiders'
MONGO_URL = 'localhost'
MONGO_DB = 'quotetutorial'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'quotetutorial (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'quotetutorial.middlewares.QuotetutorialSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'quotetutorial.middlewares.QuotetutorialDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'quotetutorial.pipelines.QuotetutorialPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
最后在Terminal输入scrapy crawl quotes
等待爬取结果就完成啦,可以去mongodb中查看结果,Terminal也会有输出。