摘 要: 大数据时代,各行各业对数据采集的需求日益增多,其中使用JavaScript加密技术进行数据采集的需求广泛,但也存在不少瓶颈。文章采用JavaScript逆向爬虫技术还原参数加密过程,动态构造出某购物网站商品评价的统一资源定位系统(Uniform Resource Locator,URL),实现了指定分类下多商品评价数据的动态采集,为同类加密数据的采集提供了新的思路。使用SnowNLP[基于Python的中文自然语言处理(NLP)库]对采集到的乐高评论数据进行情感分析发现,约66%的购买者对商品给出了积极评论;情感分布呈极性,高段集中在0.8~1.0,低段集中在0.0~0.2;词云分析显示出购买者群体比较注重商品的快递包装外观。以上结论可为在线商家提升经营管理水平提供参考。 |
关键词: 深层网络爬虫;JavaScript加密;逆向技术;Ajax;数据挖掘 |
中图分类号: TP391.
文献标识码: A
|
|
Deep Web Crawlers and Data Analysis Based on Reverse Technology |
XING Yuqi, YANG Cheng
|
(School of Mathematics and Computer Science, Yunnan Minzu University, Kunming 650500, China)
20202134050104@ymu.edu.cn; cheng_yang@ymu.edu.cn
|
Abstract: In the era of big data, there is an increasing demand for data acquisition from various industries, among which the use of JavaScript encryption technology for data acquisition is widespread, but there are also many bottlenecks. The paper proposes to use JavaScript reverse crawler technology to restore the parameter encryption process and dynamically construct a Uniform Resource Locator (URL) for product evaluation on a shopping website. It realizes the dynamic acquisition of multiple product evaluation data under specified classifications, providing a new approach for the acquisition of similar encrypted data. SnowNLP [Python-based Chinese Natural Language Processing (NLP) library] is used to conduct sentiment analysis on the collected LEGO comment data, and it is found that about 66% of buyers gave positive comments on the product. The distribution of sentiment shows polarity, with high levels concentrated between 0.8 and 1.0, and low levels concentrated between 0.0 and 0.2. Word cloud analysis shows that the buyer group pays more attention to the appearance of the product's express packaging. The above conclusions can provide reference for online sellers to improve their business management. |
Keywords: deep web crawler; JavaScript encryption; reverse technology; Ajax; data mining |