Elasticsearch 扩容设计

The Unit of scale 扩容单元

  • A share is a unit of scale. A shard is a Lucene index.
  • Elasticsearch is a collection of shards. Your application talks to an Elasticsearch index, and Elasticsearch routes to the appropriate shard.
shard = hash(routing) % number_of_primary_shards

Shard Overallocation 分片预分配

  • Elaticsearch doesn't support sharding-split

Kagillion Shards 海量分片

  • 并不是分片越多越好
  • A shard is not free
  • A shard is a Lucene index under the covers, which uses file handles, memory, and CPU cycles. (多一个shard多一份资源利用)
  • Every search request needs to hit a copy of every shard in the index. That’s fine if every shard is sitting on a different node, but not if many shards have to compete for the same resources.
  • Term statistics, used to calculate relevance, are per shard. Having a small amount of data in many shards leads to poor relevance.

Capacity Planning (容量规划)

  • 预先计划好所需要的shard数目

Replica Shards

  • The main purpose of replicas is for failover(容错)
  • Increasing the number of replicas does not change the capacity of the index.(repica 不能扩容)
  • replica shards can serve read requests(在增加硬件的同时增加replica可以增加读取速度)

Multiple Indices

  • there is no rule that limits your application to using only a single index(一个application可以同时使用多个index)
  • Searching 1 index of 50 shards is exactly equivalent to searching 50 indices with 1 shard each: both search requests hit 50 shards.(搜索一个index的多个shard跟搜索多个index的多个1个shard 效果是一样的)
  • This can be a useful fact to remember when you need to add capacity on the fly. Instead of having to reindex your data into a bigger index, you can just do the following: (可以通过增加一个新的index来实现线上动态扩容)

Create a new index to hold new data.
Search across both indices to retrieve new and old data.

  • 以下是通过alias来实现动态扩容
PUT /tweets_1/_alias/tweets_search 
PUT /tweets_1/_alias/tweets_index 
POST /_aliases
{
  "actions": [
    { "add":    { "index": "tweets_2", "alias": "tweets_search" }}, 
    { "remove": { "index": "tweets_1", "alias": "tweets_index"  }}, 
    { "add":    { "index": "tweets_2", "alias": "tweets_index"  }}  
  ]
}

Time-Based Data

  • Index per Time Frame
  • If we were to have one big index for documents of this type, we would soon run out of space. Instead, use an index per time frame(per year, per month, per day)
  • 根据不同的时间段来创建不同的index,然后
POST /_aliases
{
  "actions": [
    { "add":    { "alias": "logs_current",  "index": "logs_2014-10" }}, 
    { "remove": { "alias": "logs_current",  "index": "logs_2014-09" }}, 
    { "add":    { "alias": "last_3_months", "index": "logs_2014-10" }}, 
    { "remove": { "alias": "last_3_months", "index": "logs_2014-07" }}  
  ]
}