集群监控,集群状态,指标。集群原理。集群管理工具

是什么消耗了我的意志,让我看一会就不想看了,是我看到没完没了的我不感兴趣的东西,或者很多我以为我知道了的东西,
所以呢,应对之策就是不断看目录,预测里面写的啥,我是否会,是否感兴趣,是否要看,如果不想看 干脆就放弃那章啦。
https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-cluster.html

  • One node in the cluster is elected to be the master node, which is in charge of managing cluster-wide changes like creating or deleting an index, or adding or removing a node from the cluster.
  • The master node does not need to be involved in document-level changes or searches, which means that having just one master node will not become a bottleneck as traffic grows.
  • Any node can become the master.
  • As users, we can talk to any node in the cluster, including the master node.
  • Every node knows where each document lives and can forward our request directly to the nodes that hold the data we are interested in.
  • Whichever node we talk to manages the process of gathering the response from the node or nodes holding the data and returning the final response to the client.

Cluster Health

  • get cluster health
GET _cluster/health
{
   "cluster_name":          "elasticsearch",
   "status":                "green", 
   "timed_out":             false,
   "number_of_nodes":       1,
   "number_of_data_nodes":  1,
   "active_primary_shards": 0,
   "active_shards":         0,
   "relocating_shards":     0,
   "initializing_shards":   0,
   "unassigned_shards":     0
}
  • status field
green
All primary and replica shards are active.
yellow
All primary shards are active, but not all replica shards are active.
red
Not all primary shards are active.

Add an index

  • an index is just a logical namespace that points to one or more physical shards.
  • TODO (Inside a shard) How shard works.
  • A shard is a low-level worker unit that holds just a slice of all the data in the index.
  • A shard is a single instance of Lucene, and is a complete search engine in its own right.
  • TODO (How Lucene works).
  • Our documents are stored and indexed in shards, but our applications don’t talk to them directly. Instead, they talk to an index
  • Documents are stored in shards, and shards are allocated to nodes in your cluster
  • As your cluster grows or shrinks, Elasticsearch will automatically migrate shards between nodes so that the cluster remains balanced.
  • A shard can be either a primary shard or a replica shard.
  • Each document in your index belongs to a single primary shard, so the number of primary shards that you have determines the maximum amount of data that your index can hold.
  • A replica shard is just a copy of a primary shard. Replicas are used to provide redundant copies of your data to protect against hardware failure, and to serve read requests like searching or retrieving a document.
  • The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time.

Prefer Unicast over Multicast

  • Elasticsearch is configured to use unicast discovery out of the box to prevent nodes from accidentally joining a cluster. Only nodes running on the same machine will automatically form cluster.
  • To use unicast, you provide Elasticsearch a list of nodes that it should try to contact. When a node contacts a member of the unicast list, it receives a full cluster state that lists all of the nodes in the cluster. It then contacts the master and joins the cluster.
  • This means your unicast list does not need to include all of the nodes in your cluster. It just needs enough nodes that a new node can find someone to talk to.

Zen Discovery

  • The zen discovery is the built in discovery module for elasticsearch and the default. It provides unicast discovery, but can be extended to support cloud environments and other forms of discovery.
  • separated into several sub modules, which are explained below:
  • Ping , Unicast, Master Election
  • Unicast discovery requires a list of hosts to use that will act as gossip routers.
  • with the Java security manager in place, the JVM defaults to caching positive hostname resolutions indefinitely. This can be modified by adding networkaddress.cache.ttl= to your Java security policy.
  • with the Java security manager in place, the JVM defaults to caching negative hostname resolutions for ten seconds. This can be modified by adding networkaddress.cache.negative.ttl= to your Java security policy.

Scale Horizontally

  • The number of primary shards is fixed at the moment an index is created. Effectively, that number defines the maximum amount of data that can be stored in the index. (The actual number depends on your data, your hardware and your use case.) However, read requests—searches or document retrieval—can be handled by a primary or a replica shard, so the more copies of data that you have, the more search throughput you can handle.

Dealing with Conflicts

  • elasticsearch use Optimistic concurrency control.
  • We can take advantage of the _version number to ensure that conflicting changes made by our application do not result in data loss. We do this by specifying the version number of the document that we wish to change. If that version is no longer current, our request fails.
PUT /website/blog/1?version=1 
{
  "title": "My first blog entry",
  "text":  "Starting to get the hang of this..."
}
  • Using Versions from an External System
  • The way external version numbers are handled is a bit different from the internal version numbers we discussed previously. Instead of checking that the current _version is the same as the one specified in the request, Elasticsearch checks that the current _version is less than the specified version. If the request succeeds, the external version number is stored as the document’s new _version.

Distributed document store

Routing a Document to a shard

  • When you index a document, it is stored on a single primary shard. How does Elasticsearch know which shard a document belongs to?
  • it is determined by a simple formula
shard = hash(routing) % number_of_primary_shards
  • The routing value is an arbitrary string, which defaults to the document’s _id but can also be set to a custom value.

  • This explains why the number of primary shards can be set only when an index is created and never changed

  • Users sometimes think that having a fixed number of primary shards makes it difficult to scale out an index later. In reality, there are techniques that make it easy to scale out as and when you need. We talk more about these in Designing for Scale.

  • TODO (在固定了primary shard的情况下如何扩展 Designing for Scale)

  • All document APIs (get, index, delete, bulk, update, and mget) accept a routing parameter that can be used to customize the document-to- shard mapping. A custom routing value could be used to ensure that all related documents—for instance, all the documents belonging to the same user—are stored on the same shard. We discuss in detail why you may want to do this in Designing for Scale.(这段彻底没看懂,需要细看)

  • By default, shard placement — or routing — is controlled by using a hash of the document’s id value. For more explicit control, the value fed into the hash function used by the router can be directly specified on a per-operation basis using the routing parameter

  • When setting up explicit mapping, the _routing field can be optionally used to direct the index operation to extract the routing value from the document itself. This does come at the (very minimal) cost of an additional document parsing pass. If the _routing mapping is defined and set to be required, the index operation will fail if no routing value is provided or extracted.

Creating, indexing, or deleting a single document

  • There are a number of optional request parameters that allow you to influence this process, possibly increasing performance at the cost of data security. These options are seldom used because Elasticsearch is already fast, but they are explained here for the sake of completeness:
  • consistency
    By default, the primary shard requires a quorum, or majority, of shard copies (where a shard copy can be a primary or a replica shard) to be available before even attempting a write operation. This is to prevent writing data to the “wrong side” of a network partition. A quorum is defined as follows:
int( (primary + number_of_replicas) / 2 ) + 1
  • The allowed values for consistency are one (just the primary shard), all (the primary and all replicas), or the default quorum, or majority, of shard copies.

  • 当更新一个文档时,Es 首先更新primary Shard 然后更新所有的replica shard。
    最后返回更新结果。

  • 如果想要提升效率,可以控制更新的repilica的数量,用consistency参数。更新的replica shard的数量是可以配置的,可选值是 one,all,quorum,majority.
    默认是quorum计算方式是int( (primary + number_of_replicas) / 2 ) + 1

Multi-index, Multitype

/_search
Search all types in all indices
/gb/_search
Search all types in the gb index
/gb,us/_search
Search all types in the gb and us indices
/g*,u*/_search
Search all types in any indices beginning with g or beginning with u
/gb/user/_search
Search type user in the gb index
/gb,us/user,tweet/_search
Search types user and tweet in the gb and us indices
/_all/user,tweet/_search
Search types user and tweet in all indices

Pagination

  • Deep Paging in Distributed Systems

  • imagine that we ask for page 1,000—results 10,001 to 10,010. Everything works in the same way except that each shard has to produce its top 10,010 results. The coordinating node then sorts through all 50,050 results and discards 50,040 of them!

  • use scroll to retrieve batches of documents

GET /old_index/_search?scroll=1m
{
    "query": {
        "range": {
            "date": {
                "gte":  "2014-01-01",
                "lt":   "2014-02-01"
            }
        }
    },
    "sort": ["_doc"],
    "size":  1000
}
  • _doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.
  • 用size和limit就可以对es进行分页查询
  • 分布式系统的分页取很多数据时会有性能问题,主要是因为,需要在每个shard上排序,然后再merge
  • 当需要查询大量数据时 用scroll。有两个用处,一个是长连接,一个是sort为doc是 不排序。
  • 因为_source中有所有的文档,所以可以用scroll和bulkapi reindex。

Search Lite

  • There are two forms of the search API: a “lite” query-string version that expects all its parameters to be passed in the query string, and the full request body version that expects a JSON request body and uses a rich search language called the query DSL
GET /_all/tweet/_search?q=tweet:elasticsearch
+name:john +tweet:mary
GET /_search?q=%2Bname%3Ajohn+%2Btweet%3Amary
  • When you index a document, Elasticsearch takes the string values of all of its fields and concatenates them into one big string, which it indexes as the special _all field. For example, when we index this document:

  • 有两种搜索,一种是 querybody 那种 一种是 query string

  • query string 是搜索的_all字段中的内容。

Analysis and Analyzers

  • tokenizing into individual terms
  • improve terms for "searchability"

three functions

  • Character filters: tidy up String.A character filter could be used to strip out HTML, or to convert & characters to the word and.处理字符,如处理html 转换&为and等
  • Tokenizer: tokenized into individual terms,A simple tokenizer(split terms by whitespace and punctuation). 转化句子为term。最简单的tokenizer就是直接用空格和标点符号转换term
  • Token filters: each term pass token filters in turn, which can change terms (for example lowercase Quick), remove terms (for example stopwords like a , and), add terms (for example, synonyms like jump and leap).每个term 依次通过每一个token filter. token filter 主要用于修改term(如大写转小些),删除term(如删除 a,and等),添加term(添加一些同义词)

Built-in Analyzers

  • 可以使用一些自带的analyzer 和 某些语言的analyzer

When Analyzers Are Used

  • When we index a document
  • When we search a full-text field

Testing Analyzers

GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}

Viewing the Mapping

GET /gb/_mapping/tweet

Complex Core Field Types

Multivalue Fields

{ "tag": [ "search", "nosql" ]}
  • all the values of an array must be of the same datatype
  • 数组中的值必须是同一类型的,如果类型不一样则以数组中的第一个值为准
  • When you get a document back from Elasticsearch, any arrays will be in the same order as when you indexed the document.
  • However, arrays are indexed—made searchable—as multivalue fields, which are unordered. At search time, you can’t refer to “the first element” or “the last element.” Rather, think of an array as a bag of values.
  • 搜索出来的数组的值的排列顺序跟放入elasticsearch时的顺序是一样的。但是search的时候是不能指定顺序的,你不能指定哪个是第一个,哪个是第二个
How Inner Objects are Indexed
{
            "id":           { "type": "string" },
            "gender":       { "type": "string" },
            "age":          { "type": "long"   },
            "name":   { 
              "type":         "object",
              "properties": {
                "full":     { "type": "string" },
                "first":    { "type": "string" },
                "last":     { "type": "string" }
              }
            }
}
{
    "tweet":            [elasticsearch, flexible, very],
    "user.id":          [@johnsmith],
    "user.gender":      [male],
    "user.age":         [26],
    "user.name.full":   [john, smith],
    "user.name.first":  [john],
    "user.name.last":   [smith]
}

Arrays of Inner Objects

{
    "followers": [
        { "age": 35, "name": "Mary White"},
        { "age": 26, "name": "Alex Jones"},
        { "age": 19, "name": "Lisa Smith"}
    ]
}
{
    "followers.age":    [19, 26, 35],
    "followers.name":   [alex, jones, lisa, smith, mary, white]
}
  • we can’t get an accurate answer to this: "Is there a follower who is 26 years old and who is called Alex Jones?"

  • Correlated inner objects, which are able to answer queries like these, are called nested objects, and we cover them later, in Nested Objects.

Full body search

Combining Multiple Clauses

{
    "bool": {
        "must": { "match":   { "email": "business opportunity" }},
        "should": [
            { "match":       { "starred": true }},
            { "bool": {
                "must":      { "match": { "folder": "inbox" }},
                "must_not":  { "match": { "spam": true }}
            }}
        ],
        "minimum_should_match": 1
    }
}

Queries and Filters

  • When used in filtering context, the query is said to be a "non-scoring" or "filtering" query. That is, the query simply asks the question: "Does this document match?". The answer is always a simple, binary yes|no.

  • Is the created date in the range 2013 - 2014?

  • Does the status field contain the term published?

  • Is the lat_lon field within 10km of a specified point?

  • When used in a querying context, the query becomes a "scoring" query. Similar to its non-scoring sibling, this determines if a document matches and how well the document matches.

  • A typical use for a query is to find documents:

  • Best matching the words full text search

  • Containing the word run, but maybe also matching runs, running, jog, or sprint

  • Containing the words quick, brown, and fox—the closer together they are, the more relevant the document

  • Tagged with lucene, search, or java—the more tags, the more relevant the document

Most important queries

  • match_all Query
{ "match_all": {}}
  • match Query: query for a full-text or exact value in almost any field.
  • If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search. If you use it on a field containing an exact value, such as a number, a date, a Boolean, or a not_analyzed string field, then it will search for that exact value.
{ "match": { "tweet": "About Search" }}
{ "match": { "age":    26           }}
{ "match": { "date":   "2014-09-01" }}
{ "match": { "public": true         }}
{ "match": { "tag":    "full_text"  }}
  • multi_match Query
{
    "multi_match": {
        "query":    "full text search",
        "fields":   [ "title", "body" ]
    }
}
  • range Query
{
    "range": {
        "age": {
            "gte":  20,
            "lt":   30
        }
    }
}
  • term Query
  • The term query is used to search by exact values, be they numbers, dates, Booleans, or not_analyzed exact-value string fields:
{ "term": { "age":    26           }}
{ "term": { "date":   "2014-09-01" }}
{ "term": { "public": true         }}
{ "term": { "tag":    "full_text"  }}
  • terms Query
{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}

  • exists and missing Queries
{
    "exists":   {
        "field":    "title"
    }
}

Combining queries together

{
    "bool": {
        "must":     { "match": { "title": "how to make millions" }},
        "must_not": { "match": { "tag":   "spam" }},
        "should": [
            { "match": { "tag": "starred" }}
        ],
        "filter": {
          "bool": { 
              "must": [
                  { "range": { "date": { "gte": "2014-01-01" }}},
                  { "range": { "price": { "lte": 29.99 }}}
              ],
              "must_not": [
                  { "term": { "category": "ebooks" }}
              ]
          }
        }
    }
}

Validating Queries

GET /gb/tweet/_validate/query?explain 
{
   "query": {
      "tweet" : {
         "match" : "really powerful"
      }
   }
}

Sorting and Relevance

  • By default, results are returned sorted by relevance—with the most relevant docs first.

Sorting

  • In Elasticsearch, the relevance score is represented by the floating-point number returned in the search results as the _score.