摘 要: 单机的网络爬虫爬取数据效率较低,而研究分布式网络爬虫能有效提高数据的爬取效率。文中选择使用上更为简单的Scrapy-Redis框架,设计一个架构模式为主从式的分布式网络爬虫系统,实现对当当网图书信息的爬取;并对布隆过滤器算法进行研究,分析影响其性能的参数,将算法集成到Scrapy-Redis的Scheduler的去重模块中。系统使用一台主机做Master,两台从机做Slave,最终运行1 小时后,抓取图书信息18,000余条。 |
关键词: 网络爬虫;Scrapy框架;Scrapy-Redis框架;布隆过滤器算法 |
中图分类号: TP391.1
文献标识码: A
|
|
Distributed Crawling of Dangdang Book Data based on Scrapy-Redis |
HU Xuejun, LI Jiacheng
|
(School of Mechanical Engineering, University of Shanghai for Science and Technology, Shanghai 200082, China)
HXJ20161645@163.com; jiujiuniansheng@163.com
|
Abstract: Aiming at the low efficiency of single-machine web crawler, this paper conducts a research on distributed web crawling that can effectively improve the efficiency of data crawling. This paper proposes to use the simpler Scrapy-Redis framework and design a distributed web crawler system with master-slave architecture mode to realize the book information crawling of Dangdang. In addition, Bloom Filter algorithm is studied, and the parameters affecting its performance are analyzed. The algorithm is integrated into the deduplication module of Scrapy-Redis Scheduler. With one host computer as Master and two slave ones as Slaves, the system captures more than 18,000 pieces of book information after running for one hour. |
Keywords: web crawler; Scrapy framework; Scrapy-Redis framework; Bloom Filter algorithm |