爬虫步骤

爬虫其实很简单，可以大致分为三个步骤：

发起请求：我们需要先明确如何发起 HTTP 请求，获取到数据。
解析数据：获取到的数据乱七八糟的，我们需要提取出我们想要的数据。
保存数据：将我们想要的数据，保存下载。

发起请求，我们用requests即可；解析数据其实是爬虫中最重要的环节，需要从网页的HTML文本中找到有用的信息，这一步有很多工具可以使用，比如xpath、Beautiful Soup、正则表达式等；保存数据，就是常规的文本保存。

爬取小说示例

下面的示例是爬取http://www.jinyongwang.com/she/ 这一网页内，所有子链接内的小说正文（这些子链接内的正文合起来之后，便为金庸《射雕英雄传》小说全文）：

首先加载需要用到的函数库：

import requests
import re
from bs4 import BeautifulSoup
import time

这一函数用于从网页的HTML文本中获得小说的正文：

def get_text_from_website(target):
    req=requests.get(target)
    req.encoding='utf.8'
    bf=BeautifulSoup(req.text)
    texts=bf.find('div',id='vcon') #相当于是小说正文所对应的HTML元素
    pattern='(<p>)(.*?)(</p>)' #由于小说正文的前后带有一对<p></p>的符号，因此需要使用正则表达式进行文本清洗
    res=re.findall(pattern,str(texts))
    text=str()
    for strs in res:
        text=text+strs[1]
    return text

这一步是从HTML文本中提取出所需的网页链接，即小说每一章节所对应的网址：

content_website='http://www.jinyongwang.com/she/'
root_website='http://www.jinyongwang.com'
req=requests.get(content_website)
bf=BeautifulSoup(req.text)
contents=bf.find(name='ul',attrs={'class':'mlist'})
pattern='(href=\")(.*?)(\")'
res=re.findall(pattern,str(contents))
res=res[::-1]
postfixes=[]
for hrefs in res:
    postfixes.append(hrefs[1])

提取出每一章节的网址之后，便可以发送网页请求，获得对应的HTML文本，并从中提取出小说正文内容：

texts=[]
for postfix in postfixes:
    time.sleep(10)
    texts.append(get_text_from_website(root_website+postfix))

提取完成之后，便可以将文本合并，并保存：

fulltext=str()
for text in texts:
    fulltext=fulltext+text
with open('射雕英雄传.txt','w') as f:
    f.write(fulltext)

下载电视剧示例

下面的示例是从在线视频平台（第三方，像爱奇艺之类的平台一般反爬虫比较严格）下载电视剧《沉默的真相》的示例。首先我们加载需要用到的函数库：

import requests
from bs4 import BeautifulSoup
import ffmpy3
import os
import tqdm
from multiprocessing.dummy import Pool as ThreadPool

由于网站有简单的反爬虫功能，因此我们需要在发送请求的时候添加HTTP报文的头部信息，从而骗过网站的服务器：

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36',
    'Referer': 'http://www.jisudhw.com',
    'Origin': 'http://www.jisudhw.com',
    'Host': 'www.jisudhw.com'
}
tv_lists_web='http://www.jisudhw.com/?m=vod-detail-id-64283.html'

1 2	req=requests.get(tv_lists_web,headers=headers) req.encoding='utf-8'

从HTML文本中解析出视频文件的链接：

1	detail_bf=BeautifulSoup(req.text,'lxml')

1 2	search_res={} num=1

for url in detail_bf.find_all(name='input'):
    if 'm3u8' in url.get('value'):
    #if 'mp4' in url.get('value'):
        tv_url=url.get('value')
        if tv_url not in search_res.keys():
            search_res[tv_url]=num
        num+=1

1	print(search_res)

{'http://iqiyi.cdn9-okzy.com/20200916/15492_d6ed2aea/index.m3u8': 1, 'http://iqiyi.cdn9-okzy.com/20200916/15493_ff7762e5/index.m3u8': 2, 'http://iqiyi.cdn9-okzy.com/20200917/15523_3343d3b8/index.m3u8': 3, 'http://iqiyi.cdn9-okzy.com/20200917/15522_61d94d67/index.m3u8': 4, 'http://iqiyi.cdn9-okzy.com/20200918/15574_3763fbe0/index.m3u8': 5, 'http://iqiyi.cdn9-okzy.com/20200918/15575_ba8a57fb/index.m3u8': 6, 'http://iqiyi.cdn27-okzy.com/20200918/7658_5f20adcc/index.m3u8': 7, 'http://iqiyi.cdn27-okzy.com/20200918/7657_7eb311d4/index.m3u8': 8, 'http://iqiyi.cdn27-okzy.com/20200918/7656_766f0698/index.m3u8': 9, 'http://iqiyi.cdn27-okzy.com/20200918/7655_9afd9602/index.m3u8': 10, 'http://iqiyi.cdn27-okzy.com/20200918/7654_fcf9ce30/index.m3u8': 11, 'http://iqiyi.cdn27-okzy.com/20200918/7653_6959c0d8/index.m3u8': 12}

得到视频文件的链接之后，则需要使用ffmpy3工具去下载视频文件，并将其保存到本地：

video_dir='C:\\Users\\Yufei Luo\\Desktop\\111'
def downVideo(url):
    num = search_res[url]
    name = os.path.join(video_dir, '第%03d集.mp4' % num)
    ffmpy3.FFmpeg(executable='C:\\Program Files\\ffmpeg\\bin\\ffmpeg.exe',inputs={url: None}, outputs={name:None}).run()
            
# 开8个线程池同时下载，可以加快下载进度
pool = ThreadPool(12)
results = pool.map(downVideo, search_res.keys())
pool.close()
pool.join()

Yufei Luo's Blog

网络爬虫-简单示例

爬虫步骤

爬取小说示例

下载电视剧示例