摘 要: 针对采用贝叶斯分类器算法进行商品描述分类时,出现大量混淆性词汇从而无法保证特征间独立的问题,提出了采用停用词优化的贝叶斯分类器算法,通过词频统计和词性筛选的方式,过滤掉大部分混淆性词汇,从而保证特征独立。针对相似类别无法准确区分的问题,提出了子模型训练的解决方案,对易混淆类别单独进行训练并记录训练过程,在测试阶段根据结果判断并使用子模型,从而实现细化区分。实验表明,优化方案确实可行,可以获得97.80%的准确率。 |
关键词: 朴素贝叶斯分类器;停用词;子模型训练;商品分类 |
中图分类号: TP311
文献标识码: A
|
基金项目: 辽宁省自然科学基金(2019-ZD-0354). |
|
Research on Application of Improved Bayesian Algorithm in Commodity Classification |
SHAO Xinxin
|
(Department of Software Engineering, Dalian Neusoft University of Information, Dalian 116023, China )
Shaoxinxin@neusoft.edu.cn
|
Abstract: A large number of confusing words appear when Naive Bayes classifier is used to classify commodity description. This makes it difficult to ensure independence between features. This paper proposes a Bayesian classifier algorithm optimized by stop words. Most confusing words can be filtered out by using word frequency statistics word part-of-speech screening. Aiming at the problem that similar categories cannot be distinguished, confusing categories are trained separately and the training process is recorded. In test phase, sub-models are judged and then used according to the results, so to realize detailed distinction. Experiments show that the optimized solution is indeed feasible, and can reach an accuracy of 97.80%. |
Keywords: Naive Bayes classifier; stop words; sub-model training; commodity classification |