软件工程

引用本文:

胡学军,李嘉诚.基于Scrapy-Redis的分布式爬取当当网图书数据[J].软件工程,2022,25(10):8-11.【点击复制】

【打印本页】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】

←前一篇|后一篇→

过刊浏览

分享到：微信更多

基于Scrapy-Redis的分布式爬取当当网图书数据

胡学军，李嘉诚

(上海理工大学机械工程系，上海 200082)
HXJ20161645@163.com; jiujiuniansheng@163.com

摘要: 单机的网络爬虫爬取数据效率较低，而研究分布式网络爬虫能有效提高数据的爬取效率。文中选择使用上更为简单的Scrapy-Redis框架，设计一个架构模式为主从式的分布式网络爬虫系统，实现对当当网图书信息的爬取；并对布隆过滤器算法进行研究，分析影响其性能的参数，将算法集成到Scrapy-Redis的Scheduler的去重模块中。系统使用一台主机做Master，两台从机做Slave，最终运行1 小时后，抓取图书信息18,000余条。

关键词: 网络爬虫；Scrapy框架；Scrapy-Redis框架；布隆过滤器算法

中图分类号: TP391.1 文献标识码: A

Distributed Crawling of Dangdang Book Data based on Scrapy-Redis

HU Xuejun, LI Jiacheng

(School of Mechanical Engineering, University of Shanghai for Science and Technology, Shanghai 200082, China)
HXJ20161645@163.com; jiujiuniansheng@163.com

Abstract: Aiming at the low efficiency of single-machine web crawler, this paper conducts a research on distributed web crawling that can effectively improve the efficiency of data crawling. This paper proposes to use the simpler Scrapy-Redis framework and design a distributed web crawler system with master-slave architecture mode to realize the book information crawling of Dangdang. In addition, Bloom Filter algorithm is studied, and the parameters affecting its performance are analyzed. The algorithm is integrated into the deduplication module of Scrapy-Redis Scheduler. With one host computer as Master and two slave ones as Slaves, the system captures more than 18,000 pieces of book information after running for one hour.

Keywords: web crawler; Scrapy framework; Scrapy-Redis framework; Bloom Filter algorithm

用微信扫一扫