python爬虫学习-day3-BeautifulSoup

今天学习BeautifulSoup，并且利用此工具爬取丁香园跟帖。BeautifulSoup官方教程：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

由于教程本身很详细，这里不再进行细讲，直接开始任务的实现。进入给定的网站http://3g.dxy.cn/bbs/topic/509959#!_id=626626 ，可以发现此网站需要登录才能够阅览完整信息，因此我们需要设置Cookie。参考：https://blog.csdn.net/eye_water/article/details/78484217

在爬取网页时没有传入Cookie，服务器不能识别用户身份，网页不能显示给没有用户身份的请求，所以网页源码会被隐藏。如何获取Cookie？如下方步骤所示，使用一个txt文件进行保存：

import urllib.request
import http.cookiejar

# 设置文件来存储Cookie
filename = 'cookie.txt'
# 创建一个MozillaCookieJar()对象实例来保存Cookie
cookie = http.cookiejar.MozillaCookieJar(filename)
# 创建Cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
# 构建opener
opener = urllib.request.build_opener(handler)
response = opener.open("http://3g.dxy.cn/bbs/")
cookie.save(ignore_discard=True, ignore_expires=True)

打开Cookie文件，可以发现对应的内容：

接下来，在访问论坛时，从文件中读取Cookie，在进行网页请求时添加Cookie即可。

import requests
import http.cookiejar
from bs4 import BeautifulSoup as bs
cookie = http.cookiejar.MozillaCookieJar()
#加载Cookie
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
url = 'http://3g.dxy.cn/bbs/topic/509959#!_id=626626'
res = requests.get(url, cookies=cookie)

# 使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:
soup = bs(res.text, 'html.parser')

# 找到对应的元素位置
name = soup.select(".auth")
cont = soup.select(".postbody")

for i in range(len(name)-1):
    print("用户:{}".format(name[i].get_text()))
    print("跟帖:{}".format(cont[i].get_text()))

比较奇怪的是，我登录丁香园后获取到的Cookie仍然显示是游客身份，因此对应的网页中无法发现需要的内容。因此，还需要进一步实现，时间关系，先进行打卡操作。