0x01[写在前面]

  • Python爬虫用途很广,刚好学校之前有布置大作业,虽然代码的确不够简洁,但好歹是我自己写的,于是我算是会了基本的bs4
  • 最近找到一部小说,奈何没有提供txt下载,于是想着给它爬下来,借此机会刚好整理一下bs4的基本用法方便以后自己复用

0x02[代码示例]

  • 首先我们需要用到两个库[注意,这里是基于python3.8] 具体安装使用pip或者pycharm安装

    • requests
    • bs4

Image.png

  • 对于Python3.8及其以上,在使用requests的时候要写proxies属性,否则会报错[文章后面会提到]
  • 引完包后是一些基础的准备工作

    • 写好proxies代理
    • 写好UA头池[可选]
  • 对于proxies代理,前面提到Python3.8+必写。由于我想要爬取的网站需要挂代理,且走的本地1080端口,所以我的proxies如下
proxies = {
    "http": "http://127.0.0.1:1080",
    "https": "http://127.0.0.1:1080"
}
  • 如果是不需要代理的,value置空即可
proxies = {
   "http": "",
   "https": ""
}
  • 然后是UA池,其实爬虫一般做代理池,之所以做UA池,是在爬取一些搜索引擎搜索内容的时候,比如Google,防止被ban
  • 还有就是有些网站做了反爬机制,比如我想要爬取的这个小说网站,它对UA头进行了检测,不认没有UA头的请求[经过测试所知道的]
  • 关于headers,其实这里面不只是有UA头,还有一些其他的参数,比如 cookie 有的网站反爬机制比较严格,这种时候可以抓个包将包体的头全写到headers中,这里就不细谈反爬了
  • 下面给出我在用的UA池[也是网上copy来的]
headers_pool = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
  • 使用的时候引入random库,每次发起请求随机带一个UA头,如下示例
result = requests.get(baseurl, 'lxml', proxies=proxies, headers = {'User-Agent': random.choice(headers_pool)})
  • 然后就是基本的发起请求,获取返回,利用bs4对返回体进行格式美化和分析 思路如下

Image.png

  • 有需要提取的部分特征后 就可以开始"剥洋葱"

Image.png

  • 在这里,我需要把所有抓取到的URL放入一个list中 以方便后面单独对他们进行内容爬取
  • 本来是打算获取data-title的详细资料的 但是后来想着没必要 只要URL就行 前面可以直接爬取href内容 因为href内容就是我要的URL
  • 但是写到这里了 我选择进行字符串切割 将http开头的URL内容提取出来放入chapter list中
  • 再单独对每个URL进行访问和内容爬取

Image.png
Image.png
Image.png

  • 可以看到还有一些

    之类的标签夹杂在文章中 我直接将他们替换为空字符串

Image.png

  • 将爬取到的内容保存到txt中

Image.png

  • 先测试5页小说内容看看有没有问题,加一个限制爬取5个URL的内容后退出

Image.png
Image.png

  • 经过检查,除了极少数标点符号因为编码问题乱码,但这无伤大雅
  • 爬取的内容是繁体中文,为了方便阅读我想将他转为简体中文
  • 根据网上的教程在找到了一个大佬的转换库
  • https://github.com/gumblex/zhconv
  • 而且在pycharm中可以直接搜到,实在是省事

Image.png

  • 根据知乎的教程使用方法也很简单

Image.png

  • 于是在写入txt前先进行繁简转换

Image.png

  • 去掉前面的限制,开始愉快的爬取!

0x03[整篇代码]

#-*- codeing = utf-8 -*-
#@Time :  13:41
#@Author : LTLT
#@File : novel.py
#@Software : PyCharm

import requests
import random
from bs4 import BeautifulSoup as bs
from zhconv import convert

proxies = {
    "http": "http://127.0.0.1:1080",
    "https": "http://127.0.0.1:1080"
}

headers_pool = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

chapter = []
num = 0
#目录URL
def chapterURL():
    baseurl = "https://www.xxxxxxxxxxx.html"
    result = requests.get(baseurl, 'lxml', proxies=proxies, headers = {'User-Agent': random.choice(headers_pool)})
    bs_baseurl = bs(result.text, 'lxml')
    for base_i in bs_baseurl.find_all("div", attrs={"id": "chapterList"}):
        for base_j in base_i.find_all("a"):
            start = base_j['data-title'].index("http")
            chapter.append(base_j['data-title'][start:])

#将爬到的内容写入txt文本
def savetxt(contents):
    file = open('novel.txt', 'a', encoding='utf-8')
    file.write(convert(contents, 'zh-cn')+"\n")
    file.close()

#爬取小说页面内容
def novelpage():
    replacelist = ['<div class="forum-content mt-3" id="">', '<p>', '</p>', '</div>', '<br/>']
    #上面这个list中的字符串是要去掉的
    for i in chapter:
#       global num
#       num += 1
#       if num == 5:
#           exit(0)
        res = requests.get(i, 'lxml', proxies=proxies, headers = {'User-Agent': random.choice(headers_pool)})
        novel_base = bs(res.text, 'lxml')
        novel_i = str(novel_base.find("h2"))
        novel_tit = novel_i[novel_i.index("<h2>")+4: novel_i.index("</h2>")]
        #print(novel_tit)
        #提取出h2标签内容  因为这里h2标签在每一页都是唯一的  所以我直接使用了find  然后字符串切割了一下
        #根据上面的novel_base内容可知  所有的需要的P标签都在class forum-content mt-3 的div中
        for novel_j in novel_base.find_all("div", attrs={"class": "forum-content mt-3"}):
            savecontents = str(novel_j)
            for rep_i in replacelist:
                savecontents = savecontents.replace(rep_i, "")
            savetxt(novel_tit+savecontents)
            #这里是我们要保存到txt中的内容

if __name__ == '__main__':
    chapterURL()
    novelpage()