摘 要: 在微博大数据环境下,文章以舆情数据采集、用户行为分析为应用背景,提出了一种爬虫数据采集系统的设计与实现方案。该方案主要采用的是聚焦爬虫和增量式爬虫相结合,同时基于内容评价的爬行策略,对用户给定的关键词进行搜索,并在其发生变化时对相关内容进行更新,从而实现数据采集的及时性和有效性。通过实际数据采集效果来看,本方案单机日数据采集量约为88万条,实际应用中用户可根据需求自定义爬取数据的速度,也可通过增加分布式爬虫数量提升爬取数据量与速度。 |
关键词: 大数据;数据采集;网络爬虫 |
中图分类号: TP319
文献标识码: A
|
基金项目: 广东省哲学社会科学规划学科共建项目(GD18XXW07);广东省自然科学基金项目(2021A1515011803). |
|
Research and Practice of Network Data Acquisition based on Big Data |
HUO Ying1, LI Xiaofan1, QIU Zhimin2, LI Yanting1
|
( 1.School of Information Engineering, Shaoguan University, Shaoguan 512005, China ; 2.School of Intelligent Engineering, Shaoguan University, Shaoguan 512005, China)
huoying@sgu.edu.cn; 14929099@qq.com; 250437325@qq.com; kidi@qq.com
|
Abstract: In the context of Weibo big data, this paper proposes to design and implement a crawler data acquisition system based on the application background of public opinion data collection and user behavior analysis. In this solution, the focused crawler is combined with the incremental crawler, and a content evaluation-based crawling strategy is used to search for the keywords given by the user and update the relevant content with the changes of the keywords, so as to achieve the timeliness and effectiveness of data acquisition. According to the actual data acquisition effect, the daily data acquisition volume of a single machine in this solution is about 1 million pieces. In practical application, users can customize the speed of crawling data according to their needs, and can also increase the amount and speed of crawling data by increasing the number of distributed crawlers. |
Keywords: big data; data acquisition; network crawler |