软件工程

引用本文:

潘洪涛.一种多源统一爬虫框架的设计与实现[J].软件工程,2021,24(4):30-33.【点击复制】

【打印本页】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】

←前一篇|后一篇→

过刊浏览

分享到：微信更多

一种多源统一爬虫框架的设计与实现

潘洪涛

(保定电力职业技术学院，河北保定 071051)
bddypht@126.com

摘要: 面向深层网数据的爬虫技术与反爬虫技术之间的对抗随着网站技术、大数据、异步传输等技术的发展而呈现此消彼长的趋势。综合对比当前主流的爬虫和反爬虫技术，针对高效开发、快速爬取的需求，MUCrawler(多源统一爬虫框架)被设计成一种可以面向多个网站数据源，以统一的接口形式提供爬虫开发的Python框架。测试结果显示，该框架不但能够突破不同的反爬虫技术获取网站数据，在开发效率、鲁棒性和爬取效率等方面也体现出较好的运行效果。

关键词: Python开发；网络爬虫；浏览器行为；HTTP请求

中图分类号: TP311.1 文献标识码: A

Design and Implementation of a Multi-source Uniform-interface Crawler Framework

PAN Hongtao

(Baoding Electric Power VOC. & TECH College, Baoding 071051, China)
bddypht@126.com

Abstract: Confrontation between crawler technology for deep web data and anti-crawler technology has waxed and waned with development of website technology, big data, and asynchronous transmission technology. This paper proposes to develop a Multi-source Uniform-interface Crawler (MUCrawler) framework after comprehensively comparing current mainstream crawler and anti-crawler technologies and considering the needs of efficient development and fast crawling. MUCrawler framework can face multiple websites data sources and provide Python framework of crawler development in the form of a uniform interface. Test results show that the proposed framework can not only break through different anticrawler technologies to obtain website data, but also show better operating results in terms of development efficiency, robustness, and crawling efficiency.

Keywords: Python program; web crawler; browser behavior; HTTP (High Text Transfer Protocol) request

用微信扫一扫