利用scrapy框架python爬虫初探

经过三天的“摸爬滚打”，终于搞定了一个简单的爬虫项目，因为个人初学爬虫，没有一个系统的框架很难完整爬一个项目，所以参照诸多教程与博客，终于拿下一个简单的爬取“伯乐在线”所有文章的爬虫。

1、准备工作——安装scrapy框架

Command "python setup.py egg_info" failed with error code 1 in

可通过此网站寻找解决方案，然后就可以新建我们的项目。

2、开始爬取——新建scrapy项目

scrapy startproject article
文件架构

开始编写爬虫代码

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider
from ..items import ArticleItem

class ArticleSpider(scrapy.Spider):
    name = 'article'
    start_urls = ['http://python.jobbole.com/all-posts/']
    def parse(self, response):
        item = ArticleItem()
        posts = response.xpath('//div[@class="post floated-thumb"]')
        # print(posts)
        for post in posts:
            item['title'] = post.xpath('.//a[@class="archive-title"]/text()').extract()[0]
            # print(item['title'])
            item['date'] = post.xpath('.//div[@class="post-meta"]/p/text()').re('d+/+d+/+d+')[0]
            # print(item['date'])
            item['short'] = post.xpath('.//span[@class="excerpt"]/p/text()').extract()[0]
            # print(item['short'])
            item['link'] = post.xpath('.//span[@class="read-more"]/a/@href').extract()[0]
            print(item['title']+item['date']+item['short']+item['link'])
            yield item
        urls = response.xpath('//a[@class="next page-numbers"]/@href').extract()[0]
        if urls:
            yield scrapy.Request(urls, callback=self.parse)

存入excel
piprlines.py

from openpyxl import Workbook

class TuniuPipeline(object):  # 设置工序一
    wb = Workbook()
    ws = wb.active
    ws.append(['标题', '链接', '发布时间', '简介'])  # 设置表头


    def process_item(self, item, spider):  # 工序具体内容
        line = [item['title'], item['link'], item['date'], item['short']]  # 把数据中每一项整理出来
        self.ws.append(line)  # 将数据以行的形式添加到xlsx中
        self.wb.save('article.xlsx')  # 保存xlsx文件
        return item

settings.py

ITEM_PIPELINES = {
    'article.pipelines.TuniuPipeline': 200,  # 200是为了设置工序顺序
}

启动爬虫 scrapy crawl article
结果

版权声明：本文来源CSDN，感谢博主原创文章，遵循 CC 4.0 by-sa 版权协议，转载请附上原文出处链接和本声明。
原文链接：https://blog.csdn.net/qq_38350792/article/details/77698065
站方申明：本站部分内容来自社区用户分享，若涉及侵权，请联系站方删除。

发表于 2020-02-02 19:03:07
阅读 ( 1244 )
分类：Go Web框架

利用scrapy框架python爬虫初探

你可能感兴趣的文章

精选的优质文章

0 条评论

官方社群

GO教程

推荐文章

猜你喜欢

随便看看