软件工程

引用本文:

霍英,李小帆,丘志敏,李彦廷.基于大数据的网络数据采集研究与实践[J].软件工程,2023,26(4):28-32.【点击复制】

【打印本页】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】

←前一篇|后一篇→

过刊浏览

分享到：微信更多

基于大数据的网络数据采集研究与实践

霍英¹，李小帆¹，丘志敏²，李彦廷¹

(1.韶关学院信息工程学院，广东韶关 512005；
2.韶关学院智能工程学院，广东韶关 512005)
huoying@sgu.edu.cn; 14929099@qq.com; 250437325@qq.com; kidi@qq.com

摘要: 在微博大数据环境下，文章以舆情数据采集、用户行为分析为应用背景，提出了一种爬虫数据采集系统的设计与实现方案。该方案主要采用的是聚焦爬虫和增量式爬虫相结合，同时基于内容评价的爬行策略，对用户给定的关键词进行搜索，并在其发生变化时对相关内容进行更新，从而实现数据采集的及时性和有效性。通过实际数据采集效果来看，本方案单机日数据采集量约为88万条，实际应用中用户可根据需求自定义爬取数据的速度，也可通过增加分布式爬虫数量提升爬取数据量与速度。

关键词: 大数据数据采集网络爬虫

中图分类号: TP319 文献标识码: A

基金项目: 广东省哲学社会科学规划学科共建项目(GD18XXW07)；广东省自然科学基金项目(2021A1515011803).

Research and Practice of Network Data Acquisition based on Big Data

HUO Ying¹, LI Xiaofan¹, QIU Zhimin², LI Yanting¹

( 1.School of Information Engineering, Shaoguan University, Shaoguan 512005, China ;
2.School of Intelligent Engineering, Shaoguan University, Shaoguan 512005, China)
huoying@sgu.edu.cn; 14929099@qq.com; 250437325@qq.com; kidi@qq.com

Abstract: In the context of Weibo big data, this paper proposes to design and implement a crawler data acquisition system based on the application background of public opinion data collection and user behavior analysis. In this solution, the focused crawler is combined with the incremental crawler, and a content evaluation-based crawling strategy is used to search for the keywords given by the user and update the relevant content with the changes of the keywords, so as to achieve the timeliness and effectiveness of data acquisition. According to the actual data acquisition effect, the daily data acquisition volume of a single machine in this solution is about 1 million pieces. In practical application, users can customize the speed of crawling data according to their needs, and can also increase the amount and speed of crawling data by increasing the number of distributed crawlers.

Keywords: big data data acquisition network crawler

用微信扫一扫