爬虫相关知识

2020-07-08

python爬虫

Word count: 7.2k | Reading time≈ 30 min

本文主要对爬虫相关知识进行梳理、回顾，以及一些常见情况的掌握。

爬虫知识

1. Scrapy框架原理

（1）Engine首先打开一个网站，找到处理该网站的Spider，并向该Spider请求第一个要爬取的URL
（2）Engine从Spider中获取到第一个要爬取的URL，并通过Scheduler以Request的形式调度
（3）Engine向Scheduler请求下一个呀哦爬取的URL
（4）Scheduler返回下一个要爬取的URL给Engine，Engine将URL通过Downloader Middlewares转发给Downloader下载
（5）一旦页面下载完毕，Downloader生成该页面的Response，并将其通过Downloader Middlewares发送给Engine
（6）Engine从下载器中接收到Response，并将其通过Spider Middlewares发送给Spider处理
（7）Spider处理Response，并返回提取到的item及新的Request给Engine
（8）Engine将Spider返回给item给Item Pipeline，将新的Request给Scheduler
（9）重复第（2）~（8）步，直到Scheduler中没有更多的Request，Engine关闭该网站，爬取结束

2. 反爬手段

遵守robots协议：网络爬虫排除标准，也叫爬虫协议、机器人协议，用来告诉爬虫或者搜索引擎那些页面可以抓取，那些不可抓取。
限制访问频率
验证码
多账号反爬
代理池
分布式爬虫
cookies
移动端app爬取
PhantomJS/Selenium模拟浏览器

3. 常见反爬虫应对方法

通过Headers反爬虫
基于用户行为反爬虫
动态页面的反爬虫
通过前端css和HTML标签混淆干扰
- 自定义自提干扰（汽车之家帖子、猫眼电影评分）
- 伪元素隐藏（汽车之家、美团价格）
- html标签干扰
- 字体替换（去哪儿m版酒店价格）
投毒喂毒反爬虫

4. 投毒反爬策略

从链接发现下手可以搞一些正常的url pattern 的页面链接做成隐藏链接正常用户看不到（比如白底白字，或者被遮盖）但是大部分爬虫就爬进去了然后就对这些ip做惩罚就行了比如www.xxx.com?id=12345 大部分id参数是正常页面把一些id做成只有隐藏链接的假id 监控这些页面被访问的ip 然后收割

反爬分级：

青铜反爬：请求头上做点基本的检验
白银反爬：限制访问频率，封 IP，跳图形验证码，infinite redirect loop，需要登录
黄金反爬：前端 js script 实时计算 parameter 加给请求在后端进行验证
铂金反爬：翻页使用 ajax ，没有 pagination ，数据以 json 形式在前端异步加载，验证参数不正确随机截断 html
钻石反爬：异地登录需要手机短信验证，拖动、拼图等各类人机验证码
王者反爬：前端写的代码让你的爬虫在 parse html 阶段的测试，难度上升为 nlp

5. urllib与requests

（1）urllib

（2）requests

6. 正则表达式

正则表达式常用的匹配规则

模式	描述
	\*一般字符***
.	匹配除换行符”\n”和”\r”之外的任意字符，在re.S模式下则能匹配任意字符
\	转义字符，使下一个字符标记为或特殊字符、或原义字符、或向后引用、或八进制转义符，如果原始字符串中含有`* . ? + $ ^ [ ] ( ) { }
[…]	字符集，用来表示一组字符，对应的位置可以是字符集中任意一个字符，字符集中的字符可以逐个列出，也可以给出范围如[abc]或[a-c]，所有的特殊字符在字符集中都失去本原有的含义，需要加\进行转义；常见的字符集[0-9]、[a-z]、[A-Z]、`^[\u4e00-\u9fa5]*$`
[^…]	在字符集内的开头加入非，表示匹配不在字符集内的其他任意字符，如[^abc]表示匹配除abc之外的任意字符
	\*预定义字符集（可以写在字符集[…]中）***
\d	匹配任意数字，等价于[0-9]
\D	匹配任意非数字的字符，等价于[^\d]
\s	匹配空白字符，等价于[\t\n\r\f]，如\t(Tab)、\r(回车)、’ ‘(空格)、\f(换页符)、\n(换行符)、\v(垂直制表符)
\S	匹配非空白字符，等价于[^\s]
\w	匹配字母数字及下划线，等价于[A-Za-z0-9_]
\W	匹配非字母数字及下划线的其他字符，等价于[^A-Za-z0-9_]
	\*数量词（用在字符或(…)之后）***
*	匹配前一个字符、子表达式0次或多次，例如`abc`能匹配ab，也能匹配abc、abcc，等价于{0,}
+	匹配前一个字符、子表达式1次或多次，例如`abc+`，能匹配abc、abcc，但是不能匹配到ab，+号前的字符至少要匹配一次，+等价于{1,}
?	匹配前一个字符、子表达式0次或1次，例如`do(es)?`可以匹配到do或does，?等价于{0,1}
{n}	精确匹配前一个字符、子表达式n次，例如`ab{2}c`可以匹配到abbc，n为非负整数
{n,}	匹配前一个字符、子表达式至少n次(即[n,+∞])，例如`ab{2,}c`可以匹配到abbc或abbbbbbbc，n为非负整数
{,n}	匹配前一个字符、子表达式至多n次(即[0,n])，例如`ab{,2}c`可以匹配到ac、abc或abbc，n为非负整数
{n,m}	匹配前一个字符、子表达式最少匹配n次，最多m次(即[n,m])，例如`ab{1,2}c`可以匹配到abc或abbc，n, m为非负整数，且n<=m
`*?, +?, ??`	默认情况下、+和?的匹配模式是贪婪模式，即会尽可能对的匹配符合规则的字符，?、+?和??表示启用对应的非贪婪模式。如对于字符串”Pythonnn”，正则表达式Python+能匹配大整个字符串，而Python+?则匹配Python
`{n,m}?`	同上，启用非贪婪模式，即只匹配n次，n, m为非负整数，且n<=m
	\*边界匹配（不消耗待匹配字符串中的字符）***
^	匹配字符串的开头，在多行模式下(re.M)匹配每一行的开头，如^abc可以匹配abc
$	匹配字符串的末尾，在多行模式下(re.M)匹配每一行的末尾，如abc$可以匹配abc
\A	仅匹配字符串开头，如\Aabc可以匹配abc
\Z	仅匹配字符串末尾，如果存在换行，只匹配到换行前的结束字符串，如abc\Z可以匹配abc
\z	仅匹配字符串末尾，如果存在换行，同时还会匹配到换行符
\b	匹配单词边界，也就是指单词和空格间的位置，如`er\b`可以匹配到never中的er，但不能匹配到verb中的er
\B	匹配非单词边界，也就是指单词和空格间的位置，如`er\B`可以匹配到verb中的er，但不能匹配到never中的er，等价于[^\b]
\G	匹配最后匹配完成的位置
	\*逻辑、分组***
`	`
(…)	匹配圆括号中的正则表达式，或者指定一个子组的开始和结束位置，被括起来的表达式将作为分组，从表达式的左边开始每遇到一个分组的左括号’(‘，编号+1；另外分组表达式作为一个整体，后面可以接数量词；表达式中的`
`(?P<name>...)`	给分组命名，除了愿有你的编号外在指定一个额外的别名，通过分组名字name既可以访问到子组匹配的字串，例如`(?P<id>abc){2}`能够匹配到abcabc
`\<number>`	引用序号为`<number>`对应的子组所匹配到的字符串，子组的序号从1开始计算；如果序号以0开头，或者3个数字的长度，那么不会被引用对应的子组，而是用于匹配八进制数字所表示的ASCII码值所对应的字符。例如`(.+) \1`会匹配”python python” 或 “66 66”，但不会匹配holysll”
`(?P=name)`	引用别名为`<name>`的分组匹配到的字符串，如`(?P<id>\d)abc(?P=id)`能够匹配到1abc1、5abc5
	\*特殊构造（不作为分组）***
`(?:...)`	(…)的不分组版本，用于使用`
`(?aiLmsux)`	aiLmsux的每个字符代表一种匹配模式，`(?` 后可以紧跟着 ‘a’，’i’，’L’，’m’，’s’，’u’，’x’ 中的一个或多个字符，只能在正则表达式的开头使用，如`(?i)abc`匹配模式是忽略大小写，能够匹配abc、Abc、aBc、abC、ABc、AbC、aBC、ABC
`(?#...)`	#后的内容将作为主食被忽略，如`abc(?#comment)123`能够匹配abc123
`(?=...)`	之后的字符串内容需要匹配表达式才能匹配成功，不消耗字符创的内容。如`a(?=\d)`能匹配后面全是数字的a
`(?!...)`	之后的字符串内容需要不匹配表达式才能匹配成功，不消耗字符创的内容。如`a(?!\d)`能匹配后面不是数字的a
`(?<=...)`	之前的字符串内容需要匹配表达式才能匹配成功，不消耗字符创的内容。如`a(?<=\d)`能匹配前面是数字的a
`(?<!...)`	之前的字符串内容需要不匹配表达式才能匹配成功，不消耗字符创的内容。如`a(?<!\d)`能匹配后面不是数字的a
`(?(id/name)yes-pattern	no-pattern)`

正则表达式的匹配模式：

flags	描述
re.I(IGNORECASE)	忽略大小写，使匹配对大小写不敏感
re.L(LOCALE)	做本地化识别（locale-aware）匹配
re.M(MULTILINE)	多行匹配模式，影响^ 和 $
re.S(DOTALL)	使 . 匹配包括换行在内的所有任意字符
re.U	根据Unicode字符集解析字符，这个标志影响\w、\W、\b、\B、\d、\D、\s、\S
re.X(VERBOSE)	该标志通过给予更灵活的格式以便将正则表达式更易于理解，详细表达式
re.A	只匹配ASCII字符

re分组匹配对象的方法：

方法	描述
group([group1, …])	用于获得一个或者多个分组匹配的字符串，当要获得整个匹配的子串时，使用group()或group(0)；groups()等价于(group(1), group(2), …)
start([group])	用于获取分组匹配的子串在整个字符串的起始位置（子串第一个字符的索引），参数值默认为0
end([group])	用于获取分组匹配的子串在整个字符串的结束位置（子串最后一个字符的索引+1），参数值默认为0
span([group])	返回(start(group), end(group))；span(0) 返回匹配成功的整个子串的索引；span(1) 返回第一个分组匹配成功的子串的索引

re模块中一些重要的函数：

search：扫描整个字符串并返回第一个成功的匹配，匹配成功返回一个匹配的对象，否则返回None。

search()函数语法：re.search(pattern, string[, flags]) 其中，参数pattern是匹配的正则表达式；参数string是需要匹配的字符串；参数flags是标志位，用于控制正则表达式的匹配模式。

import re

res = re.search(r'www\.(.*)\.(.{3})', 'www.baidu.com', re.I)  # 忽略大小写

if res:
    print(res.groups())  # 从分组1算起

    print("分组0：")
    print(res.group())
    print(res.group(0))

    print("分组1：")
    print(res.group(1))
    print(res.start(1))
    print(res.end(1))
    print(res.span(1))

    print("分组2：")
    print(res.group(2))
    print(res.start(2))
    print(res.end(2))
    print(res.span(2))

# 结果
'''
('baidu', 'com')
分组0：
www.baidu.com
www.baidu.com
分组1：
baidu
4
9
(4, 9)
分组2：
com
10
13
(10, 13)
'''

match：尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。

match()函数语法：re.match(pattern, string[, flags])其中，参数pattern是匹配的正则表达式；参数string是需要匹配的字符串；参数flags是标志位，用于控制正则表达式的匹配模式。

import re

res = re.search(r'baidu', 'www.baidu.com', re.I)
if res:
    print("search匹配成功")
    print(res.group())
else:
    print("search匹配失败")

rem = re.match(r'baidu', 'www.baidu.com', re.I)
if rem:
    print("match匹配成功")
    print(rem.group())
else:
    print("match匹配失败")

# 结果
'''
search匹配成功
baidu
match匹配失败
'''

compile：用于编译正则表达式，生成一个正则表达式(pattern)对象，供match()和search()这两个函数使用。

compile()函数语法：re.compile(pattern[, flags])其中，参数pattern是匹配的正则表达式；可选参数flags是匹配模式，用于控制正则表达式的匹配模式。

import re

pattern = re.compile(r'\w+', re.I)
res = pattern.search('www.baidu.com')
if res:
    print("compile search匹配成功")
    print(res.group(0))
    print(res.start(0))
    print(res.end(0))
    print(res.span(0))
else:
    print("compile search匹配失败")

rem = pattern.match('www.baidu.com', 4, 8)
if rem:
    print("compile match匹配成功")
    print(rem.group(0))
    print(rem.start(0))
    print(rem.end(0))
    print(rem.span(0))
else:
    print("compile match匹配失败")

# 结果
'''
compile search匹配成功
www
0
3
(0, 3)
compile match匹配成功
baidu
4
9
(4, 9)
'''

findall：在字符串中找到正则表达式所匹配的所有子串，并返回一个列表，如果没有找到匹配的，则返回空列表。match 和 search 是匹配一次 findall 匹配所有。

findall()函数语法：re.findall(pattern, string[, flags])其中，参数pattern是匹配的正则表达式；参数string是需要匹配的字符串；参数flags是标志位，用于控制正则表达式的匹配模式。

import re

res = re.findall(r'\w+', 'Hello World!', re.I)
print(res)  # 返回的是一个列表

# 结果
'''
['Hello', 'World']
'''

# 通过compile编译正则表达式
pattern = re.compile(r'\w+', re.I)
rec = re.findall(pattern, 'Hello World!')
print(rec)

# 结果
'''
['Hello', 'World']
'''

sub：将字符串中与模式pattern匹配的子串都替换为repl。

sub()函数语法：re.sub(pattern, repl, string[, count=0, flags])其中，参数pattern是匹配的正则表达式；参数repl是替换的字符串，也可以是一个函数；参数string是需要匹配的字符串；可选参数count是模式匹配后替换的最大次数，默认0表示替换所有的匹配项；可选参数flags是标志位，用于控制正则表达式的匹配模式。

import re

res0 = re.sub(r'\D', '', 'abc123#￥@', 0, re.I)  # 把非数字部分删除
res1 = re.sub(r'\D', '', 'abc123#￥@', 1, re.I)
res2 = re.sub(r'\D', '', 'abc123#￥@', 3, re.I)
print(res0)
print(res1)
print(res2)

# 结果
'''
123
bc123#￥@
123#￥@
'''

# 通过compile编译正则表达式
pattern = re.compile(r'\D', re.I)
res = re.sub(pattern, '', 'abc123#￥@', 0)  # 这时，不能传入flags匹配模式，而要在compile里传入
print(res)
# 结果
'''
123
'''

# repl是一个函数时，将匹配到的数字乘以2
def double(num):
    print(num.group('value'))
    value = int(num.group('value'))
    return str(value * 2)

result = re.sub(r'(?P<value>\d+)', double, 'a1B23C456d7890')
print(result)

# 结果
'''
1
23
456
7890
a2B46C912d15780
'''

split：按照能够匹配的子串将字符串分割后返回列表。

split()函数语法：re.split(pattern, string[, maxsplit=0, flags=0])其中，参数pattern是匹配的正则表达式；参数string是需要匹配的字符串；可选参数maxsplit是分割次数，默认为0表示不限制次数；可选参数flags是标志位，用于控制正则表达式的匹配模式。

import re

res1 = re.split('\W+', 'www.baidu.com')
res2 = re.split('(\W+)', 'www.baidu.com')
res3 = re.split('\W+', 'www.baidu.com', 1)  # 分割一次
res4 = re.split('p*', 'www.baidu.com')  # 匹配不到的正则表达式
print(res1)
print(res2)
print(res3)
print(res4)

# 结果
'''
['www', 'baidu', 'com']
['www', '.', 'baidu', '.', 'com']
['www', 'baidu.com']
['', 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '']
'''

escape：将字符串中所有的特殊字正则表达式字符转义，当大量主要加反斜杠\进行转义时，这个函数很有用，避免一些不必要的错误，实际上功能有点类似正则表达式前面加r。

escape()函数语法：re.escape(pattern)

import re

res = re.escape('www.python.org')
print(res)

result = re.findall(re.escape('.py'), "python www.python.org proxy.py")
print(result)

# 结果
'''
www\.python\.org
['.py', '.py']
'''

手写正则邮箱地址：

import re

email_addr = 'Please reply this email to <abc23@sample.com.cn>'
# tool.chinaz.com给出的是：\w[-\w.+]*@([A-Za-z0-9][-A-Za-z0-9]+\.)+[A-Za-z]{2,14}
pattern = re.compile(r'([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)')
res = re.search(pattern, email_addr)
if res:
    print(res.group)
else:
    print("匹配失败")

# 结果
'''
abc23@sample.com.cn
'''

常用正则表达式

目标	表达式
中文字符	`[\u4e00-\u9fa5]`
双字节字符(包括汉字在内)	`[^\x00-\xff]`
空白行	`\n\s*\r`
Email地址	`\w[-\w.+]*@([A-Za-z0-9][-A-Za-z0-9]+\.)+[A-Za-z]{2,14}`
网址URL	`[a-zA-z]+://[^\s]*` 或 `^((https
国内电话号码	`[0-9-()（）]{7,18}`
国内手机号码	`0?(13
QQ号码	`[1-9]([0-9]{5,11})`
邮政编码	`\d{6}`
身份证号	`^(\d{6})(\d{4})(\d{2})(\d{2})(\d{3})([0-9]
日期格式	`\d{4}(-
IP地址	`(25[0-5]
整数	`-?[1-9]\d*`
正整数	`[1-9]\d*`
负整数	`-[1-9]\d*`
正浮点数	`[1-9]\d.\d
负浮点数	`-([1-9]\d.\d
用户名(字母、数字、下划线、-、以及中文)	`[A-Za-z0-9_\-\u4e00-\u9fa5]+`

7. Selenium的使用

8. 验证码的识别

（1）图形验证码识别

最简单的图片验证码，利用OCR技术识别验证码，通过库tesserocr。

import tesserocr
from PIL import Image


# 识别验证码图片转换为文本
image = Image.open("Verify.jpg")
result = tesserocr.image_to_text(image)
print(result)


# 直接将验证码图片转换为字符串（效果不如上者）
print(tesserocr.file_to_text("Verify.jpg"))

如果图片存在干扰，需要去除噪声，如转灰度、二值化等操作，提高验证码识别的准确率。

import tesserocr
from PIL import Image

image = Image.open('code.jpg')
image = image.convert('L')  # 图片转灰度

# image = image.convert('1')  # 二值化处理

threshold = 127  # 指定二值化的阈值
table = []
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)

image = image.point(table, '1')
result = tesserocr.image_to_text(text)
print(result)

（2）极验滑动验证码识别

极验滑动验证码官网：http://www.geetest.com/，一般是点击滑动到缺口位置，或者生成滑块拖动路径，有效避免人际，犯伪造，防密集攻击了暴力识别，主要用到selenium库和配置ChromeDriver。

实现步骤：

模拟点击验证按钮（为了显示缺口）
识别滑动缺口位置（对照前后两张图片）
模拟滑动滑块（人拖动轨迹是先慢后快s=v0*t + 0.5*a*t*t和v=v0 + a*t，前面加速度a=2，后面为-3）

（3）点击选择验证码的识别

经典12306验证码，文字识别+图片识别，还有TouClick官网按顺序在验证图片找字的验证码。

互联网也有很多验证码服务平台，如超级鹰https://www.chaojiying.com，支持英文数字混合(20)、中文汉字(7)、纯英文(12)、纯数字(11)、任意特殊字符、坐标选择识别、计算题、选择题、问答题、选字、选物、选动物等。

（4）拖动旋转至正的验证码识别

（5）宫格连线验证码识别

找规律，C型、Z型、X型等等，箭头朝向遍历组合，全图匹配组合

9. 代理的使用

访问太频繁，导致IP被封禁。通过代理，伪装ip让服务器无法识别是本机发起的请求，免费的代理服务，如西刺http://www.xichidaili.com、66免费代理http://www.66ip.cn/、有代理IPhttp://www.youdaili.net/、快代理https://www.kuaidaili.com/free/等等，可以参考知乎这篇文章。

HTTP代理服务的端口是9743，SOCKS代理服务的端口是9742。

代理池的维护，搭建高可用的代理池，防止ip别人占用被封禁或代理服务器出现网络故障，提高爬虫效率。利用Redis的Sorted Set来存储抓取下来的代理，定时抓取各大代理网站的代理，定时检测数据库中的代理，评分与阈值淘汰制，链接数据库通过Web API提供接口服务，随机返回某个可代理的接口。

10. 模拟登陆（Headers与Cookies）

Headers里面主要包含Cookie、Host、Referer、User-Agent等

Form Data包含

commit: Sign in
utf8: ✔
authenticity_token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
login: amdin
password: 123456

11. APP爬取

12. PySpider框架

PySpider框架的功能：

提供方便易用的WebUI系统，可视化地编写和调试爬虫
提供爬取进度监控、爬取结果查看、爬虫项目管理等功能
支持多种数据库，如MySQL、MongoDB、Redis、SQLite、Elasticsearch、PostgreSQL
支持多种消息队列，如RabbitMQ、Beanstalk、Redis、Kombu
提供优先级控制、失败重试、定时抓取等功能
对接了PhantomJS，可以抓取JavaScript渲染的页面
支持单击和分布式部署，支持Docker部署

PySpider与Scrapy比较：

PySpider提供了WebUI可视化，爬虫的编写和调试都可以在WebUI中进行的；而Scrapy原生的是不具备这个功能的，它采用的是代码和命令操作，但可以通过对接Portia实现可视化配置。
PySpider调试非常方便，WebUI可视化界面直观；而Scrapy是使用parse命令进行调试，其方便程度不及PySpider。
PySpider支持PhantomJS来进行JavaScript渲染页面的采集；而Scrapy可以对接Scrapy-Splash组件，这需要额外牌子。
PySpider中内置了pyquery作为选择器；而Scrapy对接了XPath、CSS选择器和正则匹配。
PySpider的可扩展程度不足，可配置化程度不高；而Scrapy可以通过Middleware、Pipeline、Extension等组件实现非常强大的功能，模块间耦合程度低，可扩展程度极高。

13. Scrapy框架

Scrapy框架是基于Twisted的一部处理框架，纯python实现，模块间耦合度极低，可扩展性极强。框架包含几个部分：

Engine：引擎，处理整个系统的数据流处理、触发事务，是整个框架的核心。
Item：项目，它定义了爬取结果的数据结构，爬取的数据会被赋值成该Item对象。
Scheduler：调度器，接受引擎发过来的请求并将其加入队列中，在引擎再次请求的时候将请求提供给引擎。
Downloader：下载器，下载网页内容，并将网页内容返回给spider。
Spiders：蜘蛛，其内定义了爬取的逻辑和网页的解析规则，它主要负责解析相应并生成提取结果和新的请求。
Item Pipeline：项目管道，负责处理由蜘蛛从网页抽取的项目，它的主要任务是清洗、验证和存储数据。
Downloader Middlewares：下载器中间件，位于引擎和下载器的链接框架，主要处理引擎与下载器之间的请求及响应。
Spider Middlewares：蜘蛛中间件，位于引擎和蜘蛛之间的链接框架，主要处理向蜘蛛输入的响应和输出的结果及新的请求。

搭建Scrapy项目

# 创建项目
scrapy startproject test

# 创建一个Spider
cd test
scrapy genspider baidu www.baidu.com  # 网址是所要爬取网页的域名

import scrapy

class BaiduSpider(scrapy.Spider):
    name = "baidu"
    allowed_domains = ["www.baidu.com"]
    start_urls = ["http://www.baidu.com"]

    def parse(self, response):  # 这里提取方式用CSS选择器或者XPath选择器
        res = response.css('.quote')
        for s in res:
            item = BaiduItem()
            text = s.css('.text::text').extract_first()
            author = s.css('.author::text').extract_first()
            tags = s.css('.tags tag::text').extract()
            yield item

# 创建 Item，定义类型为scrapy.Field
import scrapy

class BaiduItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

# 找到后续的Request（即翻页）
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)

# 运行项目
scrapy crawl baidu

# 保存文件
scrapy crawl baidu -o baidu.jl  # 保存为一行json
scrapy crawl baidu -o baidu.csv  # 保存为csv
scrapy crawl baidu -o baidu.xml  # 保存为xml

# 使用Item Pipeline实现保存到数据库，修改pipeline.py
from scrapy.exceptions import DropItem
import pymongo

class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri = crawler.setting.get("MONGO.URI"),
            mongo_db = crawler.setting.get("MONGO.DB")
        )

    def open_spider(self, spider):  # 初始化操作
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):  # 关闭数据库链接
        self.client.close()

# 在setting.py中加入
ITEM_PIPELINES = {
    'test.pipilines.TextPipeline': 300,  # 保存为txt
    'test.pipilines.MongoPipeline': 400,  # 保存到mongodb
}
MONGO_URI = 'localhost'
MONGO_DB = test

项目目录结构：

myMpider/
    scrapy.cfg
    mySpider/
        __init__.py
        items.py
        pipelines.py
        settings.py
        middlewares.py
        spiders/
            __init__.py
            ...

scrapy.cfg：它是Scrapy项目的配置文件，其内定义了项目的配置文件路径、部署相关信息等内容。
item.py：它定义Item数据结构，所有的Item的定义Item Pipeline。为了定义通用输出数据格式，Scrapy提供了Item类。Item对象是用于收集抓取数据的简单容器。它们提供类似字典的API，并具有用于声明其可用字段的方便语法。

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy.item import Item, Field

class PropertiesItem(scrapy.Item):
    # Primary fields
    title = Field()
    price = Field()
    description = Field()
    address  = Field()
    image_urls = Field()

    # Calculated fields
    images = Field()
    location = Field()

    # Housekeeping fields
    url = Field()
    project = Field()
    spider = Field()
    server = Field()
    date = Field()

pipelines.py：它定义Item Pipeline的实现，所有的Item Pipeline的实现Item Pipeline。主要负责处理Spider中获取到的Item，并进行（详细分析、过滤、存储等）后期处理。项目管道的典型用途是：
- 项目管道的典型用途是
- 验证已删除的数据（检查项目是否包含某些字段）
- 检查重复项（并删除它们）
- 将已删除的项目存储在数据库中

# 几种项目管道的方法：
def process_item(self, item, spider):
    return item   #返回的数据类型是dict类型的

def open_spider(self, spider):
        self.file = open('items.jl', 'w')

def close_spider(self, spider):
        self.file.close()

# 把item保存为json
import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

# 保存到MongoDB数据库
import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

settings.py：它定义项目的全局配置。Scrapy设置允许您自定义所有Scrapy组件的行为，包括核心，扩展，管道和爬虫本身。设置的基础结构提供了键值映射的全局命名空间，代码可以使用该命名空间从中提取配置值，可以通过不同的机制填充设置。
middlewares.py：它定义Spider Middlewares和Downloader Middlewares的实现。Spider Middlewares是一个可以自定扩展和操作引擎和Spider中间通信的功能组件。Downloader Middlewares是一个可以自定义扩展下载功能的组件。

# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

class PropertiesSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class PropertiesDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

spiders：其内包含一个个Spider的实现，每个Spider都有一个文件。Scrapy用他来从网页里抓取内容，并解析抓取的结果。

# -*- coding: utf-8 -*-
import scrapy


class MyspiderSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com/']

    def parse(self, response):
        pass

14. 分布式爬虫

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.

爬虫知识

1. Scrapy框架原理

2. 反爬手段

3. 常见反爬虫应对方法

4. 投毒反爬策略

5. urllib与requests

（1）urllib

（2）requests

6. 正则表达式

7. Selenium的使用

8. 验证码的识别

（1）图形验证码识别

（2）极验滑动验证码识别

（3） 点击选择验证码的识别

（4） 拖动旋转至正的验证码识别

（5）宫格连线验证码识别

9. 代理的使用

10. 模拟登陆（Headers与Cookies）

11. APP爬取

12. PySpider框架

13. Scrapy框架

14. 分布式爬虫

（3）点击选择验证码的识别

（4）拖动旋转至正的验证码识别