详情页

2025-03-12

217

原创

1、在线安装分词器，以analysis-icu分词器为例

analysis-icu功能：

基于ICU（International Components for Unicode）库，提供高级的文本分析和处理功能
支持多语言和复杂的Unicode文本处理
包含ICU分词器（ICU Tokenizer）和ICU标准化过滤器（ICU Normalizer）

analysis-icu应用场景：

多语言文本分析，适用于处理各种语言文本
支持Unicode标准化和处理复杂字符
提供高级的文本处理功能，如正则表达式替换，文本转换等。

查看已安装的插件

cd /home/es/elasticsearch-8.17.0
bin/elasticsearch-plugin list

在线安装插件

cd /home/es/elasticsearch-8.17.0
bin/elasticsearch-plugin  install  analysis-icu

删除插件

cd /home/es/elasticsearch-8.17.0
bin/elasticsearch-plugin  remove  analysis-icu

测试分词器

POST  _analyze
{
    "analyzer":"icu_analyzer",
    "text":"中华人民共和国"
}

测试结果

{
    "tokens": [
    	{
            "token":"中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position":0
        }
        ,
        {
            "token":"人民",
            "start_offset": 2,
            "end_offset": 4,
            "type":"<IDEOGRAPHIC>",
            "position":1
        }
        ,
        {
            "token":"共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type":"<IDEOGRAPHIC>",
            "position":2
        }
    ]
}

2、离线安装分词器，以IK中文分词器为例

手动下载安装包，上传到elasticsearch安装目录下的plugins目录，然后重启elasticsearch实例即可。

IK中文分词器插件源码地址：https://github.com/infinilabs/analysis-ik

IK分词器插件版本必须与ElasticSerach版本一一对应，否则会出现兼容性问题，导致ElasticSerach启动失败。

此次以8.17.0版本，如果源码地址中最新版本没找到对应的版本，可以到下面这个地址下载：https://release.infinilabs.com/analysis-ik/stable/

standard模式

# 默认的分词器模式：standard，会单字拆分
POST  _analyze
{
    "analyzer":"standard",
    "text":"中华人民共和国"
}

# 测试结果
{
    "tokens": [
    	{
            "token":"中",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position":0
        }
        ,
        {
            "token":"华",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position":1
        }
        ,{
            "token":"人",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position":2
        }
        ,{
            "token":"民",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position":3
        }
        ,{
            "token":"共",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position":4
        }
        ,{
            "token":"和",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position":5
        }
        ,{
            "token":"国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position":6
        }
    ]
}

ik_smart模式

# 分词器模式：ik_smart，会做最粗粒度的拆分，适用于做标签场景
POST  _analyze
{
    "analyzer":"ik_smart",
    "text":"中华人民共和国"
}

# 测试结果
{
    "tokens": [
    	{
            "token":"中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":0
        }
    ]
}

#############################################

POST  _analyze
{
    "analyzer":"ik_smart",
    "text":"中华渔船"
}

# 测试结果
{
    "tokens": [
    	{
            "token":"中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position":0
        }
        ,
        {
            "token":"渔船",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position":1
        }
    ]
}

ik_max_word模式

# 分词器模式：ik_max_word，会做最细粒度的拆分，适用于做模糊查询匹配场景等
POST  _analyze
{
    "analyzer":"ik_max_word",
    "text":"中华人民共和国"
}

# 测试结果
{
    "tokens": [
    	{
            "token":"中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":0
        }
        ,
        {
            "token":"中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position":1
        }
        ,{
            "token":"中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position":2
        }
        ,{
            "token":"华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position":3
        }
        ,{
            "token":"人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":4
        }
        ,{
            "token":"人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position":5
        }
        ,
        {
            "token":"共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":6
        }
        ,{
            "token":"共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position":7
        }
        ,{
            "token":"国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_WORD",
            "position":8
        }
        ,
    ]
}

elasticsearch

作者：一介闲人(联系作者)

发表时间： 2025-03-12 13:48

公众号转载：请在文末添加本文链接

ElasticSerach安装插件方式

1、在线安装分词器，以analysis-icu分词器为例

2、离线安装分词器，以IK中文分词器为例

评论

ElasticSerach安装插件方式

1、在线安装分词器，以analysis-icu分词器为例

2、离线安装分词器，以IK中文分词器为例

评论

不让你联系，略略略~

(请点击任意空白位置关闭)