摘 要: 面向深层网数据的爬虫技术与反爬虫技术之间的对抗随着网站技术、大数据、异步传输等技术的发展而呈现此消彼长的趋势。综合对比当前主流的爬虫和反爬虫技术,针对高效开发、快速爬取的需求,MUCrawler(多源统一爬虫框架)被设计成一种可以面向多个网站数据源,以统一的接口形式提供爬虫开发的Python框架。测试结果显示,该框架不但能够突破不同的反爬虫技术获取网站数据,在开发效率、鲁棒性和爬取效率等方面也体现出较好的运行效果。 |
关键词: Python开发;网络爬虫;浏览器行为;HTTP请求 |
中图分类号: TP311.1
文献标识码: A
|
|
Design and Implementation of a Multi-source Uniform-interface Crawler Framework |
PAN Hongtao
|
(Baoding Electric Power VOC. & TECH College, Baoding 071051, China)
bddypht@126.com
|
Abstract: Confrontation between crawler technology for deep web data and anti-crawler technology has waxed and waned with development of website technology, big data, and asynchronous transmission technology. This paper proposes to develop a Multi-source Uniform-interface Crawler (MUCrawler) framework after comprehensively comparing current mainstream crawler and anti-crawler technologies and considering the needs of efficient development and fast crawling. MUCrawler framework can face multiple websites data sources and provide Python framework of crawler development in the form of a uniform interface. Test results show that the proposed framework can not only break through different anticrawler technologies to obtain website data, but also show better operating results in terms of development efficiency, robustness, and crawling efficiency. |
Keywords: Python program; web crawler; browser behavior; HTTP (High Text Transfer Protocol) request |