摘 要: 为了改善常规存储方式模糊查询性能较低的问题,提出一种针对大文本文档数据的高效模糊查询方法。通过对文档建立倒排索引,将索引以及部分文档信息提取到内存中以降低磁盘输入和输出(Input/Output, I/O)。根据内存中的倒排索引和数据库中主键形成的映射查询数据,然后通过相关度算法对这些数据进行排序,并以字典树作为搜索提示,实现高效的全文检索。实验结果表明:与ElasticSearch使用相同词集时,随着测试数据量的变化,所设计的全文检索引擎的查询效率是ElasticSearch效率的80~1 200倍,其效率优势随着数据量增加呈现反比例关系变化,并且在17 919条文档数据下,其内存占用不超过2.5 GB,适合用于海量文档数据检索。 |
关键词: 倒排索引;全文检索;检索引擎;模糊查询;字典树 |
中图分类号: TP391.3
文献标识码: A
|
基金项目: 2022年度辽宁省高等学校基本科研项目(LJKMZ20221547) |
|
Design of a New Efficient Full-text Search Engine |
DONG Zongran1,2, WEN Baizhi3, ZHU Yi1,2
|
[1. School of Sof tware, Dalian University of Foreign Languages, Dalian 116044, China; 2.Big Data Library and Inf ormation Research Center, Dalian University of Foreign Languages, Dalian 116044, China; 3.China Unicom (Liaoning) Industrial Internet Co., Ltd., Shenyang 110041, China]
dongzongran@163.com; wbz1234569@126.com; zhuyidl@163.com
|
Abstract: In order to improve the low performance issue of fuzzy query in conventional storage, an effective fuzzy query method for large-text document data is proposed. By establishing inverted indexes on documents, the indexes and some document information are extracted into memory to reduce disk I/O. The data is queried based on the maps formed by inverted indexes in memory and the primary keys in database, and then these data is sorted by the relevance algorithm, and the Tire tree is used as the search prompt to achieve an efficient full-text search. The experimental results show that when using the same word set as ElasticSearch, the efficiency of the designed full-text search engine is 80 to 1 200 times that of ElasticSearch, depending on the amount of test data. With 17 919 document data, the memory size does not exceed 2.5 GB, making it suitable for massive document data retrieval. |
Keywords: inverted index; full-text search; search engine; fuzzy query; tire tree |