三行代码捅穿 CloudFlare 的五秒盾

#编程技术 2021-05-14 14:50:20 | 全文 1050 字，阅读约需 3 分钟 | 加载中... 次浏览

👋 相关阅读

经常写爬虫的同学，肯定知道 CloudFlare 的五秒盾。当你没有使用正常的浏览器访问网站的时候，它会返回如下这段文字：

Checking your browser before accessing xxx.

This process is automatic. Your browser will redirect to your requested content shortly.

Please allow up to 5 seconds…

即使你把 Headers 带完整，使用代理 IP，也会被它发现。我们来看一个例子。

Mountain View Whisman students sent home after children test positive for COVID-19

这篇文章，使用正常浏览器访问，效果如下图所示：

直接查看原始的网页源代码，可以看到，新闻标题和正文就在源代码里面，说明新闻的标题和正文都是后端渲染的，不是异步加载。如下图所示：

现在，我们使用 requests，带上完整的请求头来访问这个网站，效果如下图所示：

网站识别到了爬虫行为，成功把爬虫请求挡住了。很多同学在这个时候就已经束手无策了。因为这是爬虫的第一次请求就被挡住了，所以网站不是检测的 IP 或者访问频率，所以即使用代理 IP 也无济于事。而现在即使带上了完整的请求头都能被发现，那还有什么办法绕过这个检测呢？

免费版5秒盾

实际上，要绕过这个 5 秒盾非常简单，只需要使用一个第三方库，叫做cloudscraper。我们可以使用 pip 来安装：

python3 -m pip install cloudscraper

安装完成以后，只需要使用 3 行代码就能绕过 CloudFlare 的 5 秒盾：

import cloudscraper
scraper = cloudscraper.create_scraper()
resp = scraper.get('目标网站').text

我们还是以上面的网站为例：

import cloudscraper
from lxml.html import fromstring

scraper = cloudscraper.create_scraper()
resp = scraper.get('https://mv-voice.com/news/2021/05/04/mountain-view-whisman-students-sent-home-after-children-test-positive-for-covid-19').text
selector = fromstring(resp)
title = selector.xpath('//h1/text()')[0]
print(title)

运行效果如下图所示：

破盾成功。

CloudScraper 非常强大，它可以突破 CloudFlare 免费版各个版本的五秒盾。而且它的接口和 requests 保持一致。原来用 requests 怎么写代码，现在只需要把requests.xxx改成scraper.xxx就可以了。

付费版五秒盾

如果网站开启了付费版的五秒盾，比如 Codebase ，当我们使用 CloudScraper 去爬时，报错如下：

那么现阶段，付费版的 CloudFlare 五秒盾，有没有什么办法绕过呢？

其实方法非常简单。只需要使用 Docker 运行一个容器就可以了。启动命令为：

docker run -d \
  --name=flaresolverr \
  -p 8191:8191 \
  -e LOG_LEVEL=info \
  --restart unless-stopped \
  ghcr.io/flaresolverr/flaresolverr:latest

这个容器启动以后，会开启 8191 端口。我们通过往这个端口发送 http 请求，让他转发请求给目标网站，就可以绕过五秒盾。

具体使用示例：

import requests
import json

url = "http://localhost:8191/v1"

payload = json.dumps({
  "cmd": "request.get",
  "url": "https://www.coinbase.com/ventures/content",
  "maxTimeout": 60000
})
headers = {
  'Content-Type': 'application/json'
}

response = requests.post(url, headers=headers, data=payload)

这个 Docker 镜像启动的接口，返回的数据是 JOSN，网页源代码在其中的 .solution.response 中

print(response.json()['solution']['response'])

访问效果如下图所示：

我们再写几行代码来提取一下标题：

我们启动的这个容器，为什么可以绕过CloudFlare的五秒盾呢，关键原因就在这个项目中：FlareSolverr 。大家可以阅读他的源代码，看看他是怎么绕过的。

via

一日一技：如何捅穿Cloud Flare的5秒盾 https://mp.weixin.qq.com/s/zwmatF3yTgSyS0gz8sinaA

VeNoMouS/cloudscraper: A Python module to bypass Cloudflare’s anti-bot page. https://github.com/venomous/cloudscraper

一日一技：【最新】再次突破CloudFlare五秒盾付费版 | 谢乾坤 | Kingname https://www.kingname.info/2023/02/25/crack-cf-2/

#python #破解 #骚操作 #爬虫

Edit | Last updated on 2024-04-29 22:20:38

< 前端面试手写“节流防抖”你不会？用动画带你秒懂！ MySQL 正则替换数据：REGEXP_REPLACE 函数 >