摘 要: 针对网络爬虫技术选型较多,影响抓取效率和准确性的问题,对基于Python语言的Requests、Scrapy和Selenium三种主流爬虫技术进行分析。首先,安装配置开发环境,完成单线程和多线程爬虫软件开发;其次,爬取“站长之家”网站10 页、100 页、500 页和1,000 页简历数据,计算爬取时间;最后,通过爬取“中国裁判文书网”中的数据验证突破反爬虫机制的能力。实验结果表明,Requests爬虫使用一行代码就能实现数据爬取,开发定制灵活;Scrapy爬虫技术平均每页的抓取时间为0.02 s,并发性能突出;Selenium爬虫技术破解网站反爬虫机制能力强。因此,开发网络爬虫技术要综合考虑业务需求和技术特点,只有这样,才能达到最佳的数据抓取效果。 |
关键词: 网络爬虫;Requests技术;Scrapy技术;Selenium技术 |
中图分类号: TP302.7
文献标识码: A
|
|
Research on Three Web Crawler Technologies based on Python |
YANG Jian, CHEN Wei
|
(Zhuji Public Security Bureau, Shaoxing 311800, China )
716291923@qq.com; 396293104@qq.com
|
Abstract: As there are many types of web crawler technologies, which affect the efficiency and accuracy of crawling, this paper proposes to analyze three mainstream crawler technologies based on Python: Requests, Scrapy and Selenium. Firstly, the development environment is installed and configured to complete the development of single threaded and multithreaded crawler software. Secondly, the three crawlers crawl 10, 100, 500 and 1,000 pages of resume data from the "Home of Webmasters", and the crawling time is calculated. Finally, the ability to break through the anti-crawler mechanism is verified by crawling the data on the website of "China Judgements Online". The results show that Requests crawler technology uses one line of code to achieve data crawling, and the development and customization are flexible. The average crawling time per page of Scrapy crawler technology is 0.02 seconds, and its concurrency performance is outstanding. Selenium crawler technology has strong ability to crack website anti-crawler mechanism. Therefore, the development of web crawler technology should comprehensively consider the business needs and technical characteristics. Only in this way can the best data grabbing effect be achieved. |
Keywords: web crawler; Requests; Scrapy; Selenium |