给新手详细讲解Scrapy从安装到实战框架爬虫Demo

python

 0  910

csroad 看网云空间提供免费测试学习站点

2019-08-05

【新用户专享特惠，上云优惠聚集地】云大使专享 • 阿里云轻量服务器108元/年 [链接]
阿里云云服务器 99元/1年

源码gitee地址：https://gitee.com/lizhaojie/Scrapy.git

准备工作：下载一些资源库，本来准备了下载网址，但是csdn超过5个链接会一直审核，就不放链接了。大家可以自行百度官网下载

使用Python3.6

安装wheel

pip install wheel

第二步下载lxml,这是一个依赖的解析库，可以解析lxml和xml

下载好以后，在cmd中输入

pip install +下载好的文件地址路径

第三步下载PyOpenssl，这是python支持的基于密码学的安全开发包

同样下载好以后，在cmd中输入安装

第四步下载Twisted，这个用来实现异步

同样下载好以后，在cmd中输入安装

第五步下载Pywin32，这个是模拟键盘的，还有一些其他的作用

下载完成以后一路next安装就好，会自动识别你的python

最后一步安装scrapy

pip install scrapy

安装完成以后，在cmd中输入scrapy即可查看是否安装成功，如果打印出一些命令提示，就是安装成功了。

先说一下scrapy的流程框架：抓取第一页

获取内容和下一页链接

保存爬取结果

翻页爬取

ok，接下来实战演练

首先在命令行中输入

scrapy startproject quotetutorial

来新建一个scrapy项目，因为我们就用官方给的网站http://quotes.toscrape.com做练习，所以项目名字我们起名做quotetutorial

然后在quotetutorial文件夹下输入

scrapy genspider quotes quotes.toscrape.com

这是定义我们要爬取的域名

这样我们完成了spider的创建

ok，接下里我们可以打开我们的Pycharm编译器，其他同学使用别的编译器也可以，我用Pycharm来演示。

打开我们新建的项目之后可以看到是这样的一个文件目录

quotes.py 文件是用来写我们核心的爬取代码，入口类

items.py 是定义我们保存文件的数据结构

pipelines.py 是对我们的爬取结果做进一步的处理，比如存入数据库，分词之类的

settings.py 是设置一些默认路径或者属性，比如数据库的一些配置

middlewaress.py 是定义一些我们用到的中间件，用来处理一些request，response操作。

先看quotes.py

我们的目的是爬取网页的名言，作者，和tags

检查元素找到我们要爬取的标签

# -*- coding: utf-8 -*-

import scrapy

from quotetutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):

name = 'quotes'

allowed_domains = ['quotes.toscrape.com']

start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):

quotes = response.css('.quote')

for quote in quotes:

item = QuoteItem()

text = quote.css('.text::text').extract_first()

author = quote.css('.author::text').extract_first()

tags = quote.css('.tags .tag::text').extract()

item['text'] = text

item['author'] = author

item['tags'] = tags

yield item

next = response.css('.pager .next a::attr(href)').extract_first()

url = response.urljoin(next)

yield scrapy.Request(url=url,callback=self.parse)

在这里就说一下quote.css(’.text::text’).extract_first()和quote.css(’.tags .tag::text’).extract()

相当于beautifulsoup中的find和findall，一个是返回一个，一个是返回多个，比如tags这样多标签的就是quote.css(’.tags .tag::text’).extract()

scarpy还提供了shell操作

在Terminal中输入

scrapy shell quotes.toscrapy.com

就可以进入一个命令行交互模式，进行一些response的交互操作。

输入

scrapy crawl quotes -o quotes.json

还可以保存为json文件，或者scrapy crawl quotes -o quotes.csv，保存为csv，可以支持很多格式的保存，不一一列举了。

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class QuoteItem(scrapy.Item):

# define the fields for your item here like:

# name = scrapy.Field()

text = scrapy.Field()

author = scrapy.Field()

tags = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

from scrapy.exceptions import DropItem

class TextPipeline(object):

def __init__(self):

self.limit = 50

def process_item(self, item, spider):

if item['text']:

if len(item['text'])>self.limit:

item['text'] = item['text'][0:self.limit].rstrip()+"..."

return item

else:

return DropItem('Missing Text')

class MongoPipeline(object):

def __init__(self,mongo_url,mongo_db):

self.mongo_url = mongo_url

self.mongo_db = mongo_db

@classmethod

def from_crawler(cls, crawler):

return cls(

mongo_url=crawler.settings.get('MONGO_URL'),

mongo_db=crawler.settings.get('MONGO_DB')

)

def open_spider(self,spider):

self.client = pymongo.MongoClient(self.mongo_url)

self.db = self.client[self.mongo_db]

def process_item(self,item,spider):

name = item.__class__.__name__

self.db[name].insert(dict(item))

return item

def close_spider(self,spider):

self.client.close()

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for quotetutorial project

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

# https://doc.scrapy.org/en/latest/topics/settings.html

# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

ITEM_PIPELINES = {

'quotetutorial.pipelines.TextPipeline': 300,

'quotetutorial.pipelines.MongoPipeline': 400,

}

BOT_NAME = 'quotetutorial'

SPIDER_MODULES = ['quotetutorial.spiders']

NEWSPIDER_MODULE = 'quotetutorial.spiders'

MONGO_URL = 'localhost'

MONGO_DB = 'quotetutorial'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'quotetutorial (+http://www.yourdomain.com)'

# Obey robots.txt rules

ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

# 'Accept-Language': 'en',

# Enable or disable spider middlewares

# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

# 'quotetutorial.middlewares.QuotetutorialSpiderMiddleware': 543,

# Enable or disable downloader middlewares

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

# 'quotetutorial.middlewares.QuotetutorialDownloaderMiddleware': 543,

# Enable or disable extensions

# See https://doc.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

# Configure item pipelines

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#ITEM_PIPELINES = {

# 'quotetutorial.pipelines.QuotetutorialPipeline': 300,

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后在Terminal输入scrapy crawl quotes

等待爬取结果就完成啦，可以去mongodb中查看结果，Terminal也会有输出。

给新手详细讲解Scrapy从安装到实战框架爬虫Demo

备忘帖