软件工程

引用本文:

董宗然,闻柏智,朱毅.一种新型高效全文检索引擎的设计[J].软件工程,2024,(2):44-48.【点击复制】

分享到：微信更多

一种新型高效全文检索引擎的设计

董宗然^1,2, 闻柏智³, 朱毅^1,2

[1.大连外国语大学软件学院, 辽宁大连 116044;
2.大连外国语大学大数据图书情报研究中心, 辽宁大连 116044;
3.联通(辽宁)产业互联网有限公司, 辽宁沈阳 110041]
dongzongran@163.com; wbz1234569@126.com; zhuyidl@163.com

摘要: 为了改善常规存储方式模糊查询性能较低的问题,提出一种针对大文本文档数据的高效模糊查询方法。通过对文档建立倒排索引,将索引以及部分文档信息提取到内存中以降低磁盘输入和输出(Input/Output, I/O)。根据内存中的倒排索引和数据库中主键形成的映射查询数据,然后通过相关度算法对这些数据进行排序,并以字典树作为搜索提示,实现高效的全文检索。实验结果表明:与ElasticSearch使用相同词集时,随着测试数据量的变化,所设计的全文检索引擎的查询效率是ElasticSearch效率的80~1 200倍,其效率优势随着数据量增加呈现反比例关系变化,并且在17 919条文档数据下,其内存占用不超过2.5 GB,适合用于海量文档数据检索。

关键词: 倒排索引;全文检索;检索引擎;模糊查询;字典树

中图分类号: TP391.3 文献标识码: A

基金项目: 2022年度辽宁省高等学校基本科研项目(LJKMZ20221547)

Design of a New Efficient Full-text Search Engine

DONG Zongran^1,2, WEN Baizhi³, ZHU Yi^1,2

[1. School of Sof tware, Dalian University of Foreign Languages, Dalian 116044, China;
2.Big Data Library and Inf ormation Research Center, Dalian University of Foreign Languages, Dalian 116044, China;
3.China Unicom (Liaoning) Industrial Internet Co., Ltd., Shenyang 110041, China]

dongzongran@163.com; wbz1234569@126.com; zhuyidl@163.com

Abstract: In order to improve the low performance issue of fuzzy query in conventional storage, an effective fuzzy query method for large-text document data is proposed. By establishing inverted indexes on documents, the indexes and some document information are extracted into memory to reduce disk I/O. The data is queried based on the maps formed by inverted indexes in memory and the primary keys in database, and then these data is sorted by the relevance algorithm, and the Tire tree is used as the search prompt to achieve an efficient full-text search. The experimental results show that when using the same word set as ElasticSearch, the efficiency of the designed full-text search engine is 80 to 1 200 times that of ElasticSearch, depending on the amount of test data. With 17 919 document data, the memory size does not exceed 2.5 GB, making it suitable for massive document data retrieval.

Keywords: inverted index; full-text search; search engine; fuzzy query; tire tree

用微信扫一扫