elasticsearch - Primary/Replica Inconsistent Scoring -
we have cluster 3 primary shards , 2 replicas per primary. total doc count same primary/replica shards; however, we're getting 3 distinct scores same query/document. when add preference = primary
query parameter, consistent scores each time.
the explanation can think of different df counts between primary/replicas. inconsistency between primary/replica shards, , how 1 go fixing this? we're using 1.4.2.
edit: reindexed doctype querying, there's still inconsistent scoring.
primary , replica shards have different "path" when comes segment merging. meaning, number , size of segments can differ between them. each shared takes care of own segments independent other shards.
why matters when comes calculating score, because merging moment when documents deleted deleted. until then, deleted documents marked deleted (and taken out query results after query ran). so, means can influence algorithm score calculated.
to more specific - total number of docs in shard used [idf calculation](http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/defaultsimilarity.html#idf(long, long)) , document frequency (docfreq
):
return (float)(math.log(numdocs/(double)(docfreq+1)) + 1.0)
and number of docs include deleted (marked deleted, more precise) documents. take, also, @ this github issue , simon's comments regarding same subject.
Comments
Post a Comment