ElasticSearch update api 和 update_by_query哪家强

很久没有怎么随记笔记了，今天这里是为了纠正一个一直以来我们使用es的一个误区，这个误区很大的可能你会就范。很多童靴会把update_by_query拿mysql的语法特性来用，那你就大错特错了，这里有必要温习下我之前的一篇update_by_query，理论上讲es的准实时的仅限于search，而get id则是实时的。

实践往往是检验真理的唯一标准，看下面演示吧

<!--1.关闭refresh-->
http://xxx:9200/mytest_user/_settings
{
    "index" : {
        "refresh_interval" : -1
    }
}
<!--2.新增-->
http://xxx:9200/mytest_user/_doc/2
{
	"product_type": "test",
  	"product_code": "324049",
	"shop_code": "9N72"
}
<!--3.修改，可以并发但并没有发生更新丢失-->
http://xxx:9200/mytest_user/_doc/2/_update
{ "doc" : {
        "name" : "new_name"
    }
}
http://xxx:9200/mytest_user/_doc/2/_update
{ "doc" : {
        "name" : "alex",
        "age" : 20
    }
}
...
<!--4.update_by_query，不会有任何效果-->
http://xxx:9200/mytest_user/_update_by_query?conflicts=proceed
{
  
  "query" : {
    "term" : { "product_code": "324049" }
  },
   "script": {
    "source": "ctx._source.en_product_name='cn';ctx._source.plu_code='00';"
  }
}
<!--5.有最新的数据-->
http://xxx:9200/mytest_user/_doc/2

没错_update_by_query使用了search，顾没有任何反应。而update api借助get API的实时性做到了（即先根据文档ID做一次GET，然后拿最新文档修改后写回去），而get API为此有个参数可以控制的是为非实时（http://xxx:9200/mytest_user/_doc/4?realtime=false）。

realtime

官方介绍，默认情况下，get API是实时的，并且不受索引刷新率的影响（当数据在搜索中变为可见时）。如果文档已更新但尚未刷新，则get API将发出刷新调用以使文档可见。这还会使上次刷新后的其他文档可见。为了禁用realtime GET，可以将realtime参数设置为false。

update API的文档和源码都没有提供一个“禁用”实时性的参数。update对GET的调用，传入的realtime是写死为true的。

为何get API会要求实时？

update允许对文档做部分字段更新。如果有2个请求分别更新了不同的字段，可能先更新的数据只在writter buffer里，searcher里看不到，那后面的更新还是在老版本文档上做的，造成部分更新丢失。

上面的结论我们借助cat的监控可以看到：http://xxx:9200/_cat/segments/mytest_user?v

如果realtime设置为false，就从searcher里面拿，而searcher只能访问refresh过的数据。刚写入的数据存在于index writter buffer里，暂时无法搜索到，所以这种方式拿到的数据是准实时的。

在5.x版本以上，实时则能够访问到index writter buffer里的数据，并且还执行了强制刷新（并非refresh_interval），生成了新的segment file。如果短时间反复大量更新相同doc id的操作，会因为过于频繁的refresh短时间生成很多小segment，继而不断做短合产生性能损耗。官方认为，在提升大多数应用场景性能情况下，对于这种较少见的场景下的性能损失是值得的，应该在应用层面解决。

注：update方法更新文档，如果关闭了Upsert，意味着如果更新的文档id如果不存在，会抛出doc missing异常，大量抛出和捕获doc missing异常开销很高。

在2.4版本中，没有采用refresh的方式让数据实时，而是直接访问的translog来保证GET的实时性。官方在这个变更里 https://github.com/elastic/elasticsearch/pull/20102 将其更新方式改为了refresh。理由是之前ES里有很多地方用translog维护数据的位置，使得很多操作变得很慢，去掉对translog的依赖可以提高性能。

代码验证环节

代码实现中确实有realtime参数和 refresh("realtime_get"); 的函数调用

//源自core/src/main/java/org/elasticsearch/index/engine/InternalEngine.java
public GetResult get(Get get, Function<String, Searcher> searcherFactory, LongConsumer onRefresh) throws EngineException {
        assert Objects.equals(get.uid().field(), uidField) : get.uid().field();
        try (ReleasableLock lock = readLock.acquire()) {
            ensureOpen();
            if (get.realtime()) {
                VersionValue versionValue = versionMap.getUnderLock(get.uid());
                if (versionValue != null) {
                    if (versionValue.isDelete()) {
                        return GetResult.NOT_EXISTS;
                    }
                    if (get.versionType().isVersionConflictForReads(versionValue.getVersion(), get.version())) {
                        throw new VersionConflictEngineException(shardId, get.type(), get.id(),
                            get.versionType().explainConflictForReads(versionValue.getVersion(), get.version()));
                    }
                    long time = System.nanoTime();
                    refresh("realtime_get");
                    onRefresh.accept(System.nanoTime() - time);
                }
            }

            // no version, get the version from the index, we know that we refresh on flush
            return getFromSearcher(get, searcherFactory);
        }

现在足以可见如果对es的更新需求特别多，首先需要考虑借助get API（依赖 _id），否则使用update_by_query还是你手写的类似语义（先search，再update）都不得不接受更新丢失的问题。

版权声明：本文来源CSDN，感谢博主原创文章，遵循 CC 4.0 by-sa 版权协议，转载请附上原文出处链接和本声明。
原文链接：https://blog.csdn.net/itsoftchenfei/article/details/104424823
站方申明：本站部分内容来自社区用户分享，若涉及侵权，请联系站方删除。

发表于 2020-02-25 00:30:44
阅读 ( 1752 )
分类：

ElasticSearch update api 和 update_by_query哪家强

你可能感兴趣的文章

精选的优质文章

0 条评论

官方社群

GO教程

推荐文章

猜你喜欢

随便看看