原创

ElasticSerach安装插件方式


1、在线安装分词器,以analysis-icu分词器为例

analysis-icu功能:

  • 基于ICU(International Components for Unicode)库,提供高级的文本分析和处理功能
  • 支持多语言和复杂的Unicode文本处理
  • 包含ICU分词器(ICU Tokenizer)和ICU标准化过滤器(ICU Normalizer)

analysis-icu应用场景:

  • 多语言文本分析,适用于处理各种语言文本
  • 支持Unicode标准化和处理复杂字符
  • 提供高级的文本处理功能,如正则表达式替换,文本转换等。

查看已安装的插件

cd /home/es/elasticsearch-8.17.0
bin/elasticsearch-plugin list

在线安装插件

cd /home/es/elasticsearch-8.17.0
bin/elasticsearch-plugin  install  analysis-icu

删除插件

cd /home/es/elasticsearch-8.17.0
bin/elasticsearch-plugin  remove  analysis-icu

测试分词器

POST  _analyze
{
    "analyzer":"icu_analyzer",
    "text":"中华人民共和国"
}

测试结果

{
    "tokens": [
    	{
            "token":"中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position":0
        }
        ,
        {
            "token":"人民",
            "start_offset": 2,
            "end_offset": 4,
            "type":"<IDEOGRAPHIC>",
            "position":1
        }
        ,
        {
            "token":"共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type":"<IDEOGRAPHIC>",
            "position":2
        }
    ]
}

2、离线安装分词器,以IK中文分词器为例

手动下载安装包,上传到elasticsearch安装目录下的plugins目录,然后重启elasticsearch实例即可。

IK中文分词器插件源码地址:https://github.com/infinilabs/analysis-ik

IK分词器插件版本必须与ElasticSerach版本一一对应,否则会出现兼容性问题,导致ElasticSerach启动失败。

此次以8.17.0版本,如果源码地址中最新版本没找到对应的版本,可以到下面这个地址下载:https://release.infinilabs.com/analysis-ik/stable/

standard模式

# 默认的分词器模式:standard,会单字拆分
POST  _analyze
{
    "analyzer":"standard",
    "text":"中华人民共和国"
}

# 测试结果
{
    "tokens": [
    	{
            "token":"中",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position":0
        }
        ,
        {
            "token":"华",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position":1
        }
        ,{
            "token":"人",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position":2
        }
        ,{
            "token":"民",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position":3
        }
        ,{
            "token":"共",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position":4
        }
        ,{
            "token":"和",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position":5
        }
        ,{
            "token":"国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position":6
        }
    ]
}

ik_smart模式

# 分词器模式:ik_smart,会做最粗粒度的拆分,适用于做标签场景
POST  _analyze
{
    "analyzer":"ik_smart",
    "text":"中华人民共和国"
}

# 测试结果
{
    "tokens": [
    	{
            "token":"中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":0
        }
    ]
}

#############################################

POST  _analyze
{
    "analyzer":"ik_smart",
    "text":"中华渔船"
}

# 测试结果
{
    "tokens": [
    	{
            "token":"中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position":0
        }
        ,
        {
            "token":"渔船",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position":1
        }
    ]
}

ik_max_word模式

# 分词器模式:ik_max_word,会做最细粒度的拆分,适用于做模糊查询匹配场景等
POST  _analyze
{
    "analyzer":"ik_max_word",
    "text":"中华人民共和国"
}

# 测试结果
{
    "tokens": [
    	{
            "token":"中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":0
        }
        ,
        {
            "token":"中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position":1
        }
        ,{
            "token":"中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position":2
        }
        ,{
            "token":"华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position":3
        }
        ,{
            "token":"人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":4
        }
        ,{
            "token":"人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position":5
        }
        ,
        {
            "token":"共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":6
        }
        ,{
            "token":"共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position":7
        }
        ,{
            "token":"国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":8
        }
        ,
    ]
}
elasticsearch
  • 作者:一介闲人(联系作者)
  • 发表时间: 2025-03-12 13:48
  • 版权声明:原创-转载需保持署名
  • 公众号转载:请在文末添加本文链接
  • 评论