利用Python爬虫爬取网页福利图片

import requests
from requests.exceptions import RequestException

def get_one_page(url):
    headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
    try:
        response = requests.get(url,headers=headers)
        response.encoding = 'GBK'
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

def main():
    url = 'http://www.xinxin103.top/L/lunlipian1.html'
    html = get_one_page(url)
    print(html)
        
main()

这样运行之后，就可以成功获取首页的源代码了。获取源代码后，就需要解析页面，提取出我们想要的信息。

5.正则提取

接下来，回到网页看一下页面的真实源码。在开发者模式下的 Network 监昕组件中查看源代码。

注意，这里不要在 Elements 选项卡中直接查看源码，因为那里的惊码可能经过 JavaScript 操作而与原始请求不同，而是需要从 Network 选项卡部分查看原始请求得到的惊码。其主要内容在<body></body>内，复制粘贴在在线HTML解析可以看到网页内部结构。

可以看到，一部电影信息对应的源代码是一个 <li> 节点，我们用正则表达式来提取这里面的一些影片信息。首先，需要提取它的名称信息。随后需要提取电影的图片。匹配正则表达式如下：

'<li>.*?title="(.*?)" class=.*?<img src="(.*?)" alt=.*?</li>'

接下来，通过调用 findall()方法提取出所有的内容。再接下来，我们再定义解析页面的方法 parse_one_page （），主要是通过正则表达式来从结果中提取我们想要的内容，我们再将匹配结果处理一下，遍历提取结果并生成字典，实现代码如下：

def parse_one_page(html):
    pattern = re.compile('<li>.*?title="(.*?)" class=.*?<img src="(.*?)" alt=.*?</li>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'title': item[0],
            'image': item[1]
        }

6.下载图片

随后，我们将提取的图片URL存入一个链表，之后通过遍历的方式获取所有图片。

def download(imgurls):
    global j
    file_path="D:/我的文档/pictures/AV/"
    for imgurl in imgurls:
        path=file_path+str(j)
        #写入文件
        with open(path+".jpg","wb") as f:
            r=requests.get(imgurl)
            f.write(r.content)
        print("%s下载成功"%path)
        j+=1
   
def main():
    url = 'http://www.xinxin103.top/L/lunlipian.html'
    html = get_one_page(url)
    urls =[]
    for content in parse_one_page(html):
        print(content)
        #print ("dict['Name']: ", content['image'])
        urls.append(content['image'])
    download(urls)

到此为止，我们就完成了单页影片海报的提取。

7.分页爬取

此外，为了实现连续爬取多张网页的照片，这里还需要将 main （）方法修改一下，接收一个 offset 值作为偏移量，然后构造 URL 进行爬取。实现代码如下：

def main(offset):
    url = 'http://www.xinxin103.top/L/lunlipian'+str(offset)+'.html'
    html = get_one_page(url)
    urls =[]
    for content in parse_one_page(html):
        print(content)
        #print ("dict['Name']: ", content['image'])
        urls.append(content['image'])
    download(urls)

if __name__ == '__main__':
    for i in range(5):
        main(offset=i+2)
        print("第{}页下载结束！".format(str(i+2)))

8.完整代码

到此为止，我们的福利影片的爬虫就全部完成了，再稍微整理一下，完整的代码如下：

# -*- coding: utf-8 -*-
"""
Created on Sun Jul 28 22:15:20 2019

@author: Administrator
"""

import json
import requests
from requests.exceptions import RequestException
import re
import time

j=1
def get_one_page(url):
    headers={
            'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'
    }
    try:
        response = requests.get(url,headers=headers)
        response.encoding = 'GBK'
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        return None

def parse_one_page(html):
    pattern = re.compile('<li>.*?title="(.*?)" class=.*?<img src="(.*?)" alt=.*?</li>', re.S)
    items = re.findall(pattern, html)
    for item in items:
        yield {
            'title': item[0],
            'image': item[1]
        }

def download(imgurls):
    global j
    file_path="D:/我的文档/pictures/AV/"
    for imgurl in imgurls:
        path=file_path+str(j)
        #写入文件
        with open(path+".jpg","wb") as f:
            r=requests.get(imgurl)
            f.write(r.content)
        print("%s下载成功"%path)
        j+=1

def main(offset):
    url = 'http://www.xinxin103.top/L/lunlipian'+str(offset)+'.html'
    html = get_one_page(url)
    urls =[]
    for content in parse_one_page(html):
        print(content)
        #print ("dict['Name']: ", content['image'])
        urls.append(content['image'])
    download(urls)
        

if __name__ == '__main__':
    for i in range(5):
        main(offset=i+2)
        print("第{}页下载结束！".format(str(i+2)))

接下里就是福利时间了。不要谢我，叫我雷锋。（手动滑稽）

参考：《Python 3网络爬虫开发实战》

版权声明：本文来源CSDN，感谢博主原创文章，遵循 CC 4.0 by-sa 版权协议，转载请附上原文出处链接和本声明。
原文链接：https://blog.csdn.net/akenseren/article/details/97625690
站方申明：本站部分内容来自社区用户分享，若涉及侵权，请联系站方删除。

发表于 2020-02-29 19:48:20
阅读 ( 1433 )
分类：