Please enable JavaScript.
Coggle requires JavaScript to display documents.
Efficiency in ElasticSearch (Hot, Warm, Cold Architecture (Example…
Efficiency in ElasticSearch
How much node data can a node hold?
It depends!
Elastic Stack
Hot, Warm, Cold Architecture
Hot
Indexing Load: High
Query Load: High
CPU: High
Storage: Very Fast (SSDs)
Data Volume: Low
Warm
Indexing Load: None
Query Load: Medium
CPU: High/Medium
Storage: Fast (HDDs/SSDs)
Data Volume: Medium
Cold
Indexing Load: None
Query Load: Low
CPU: Medium/Low
Storage: Medium/Slow (HDDs)
Data Volume: High
Example
1 Master Node
For high availability I need 3 nodes
1 Coordinator Node
2 Cold Data Nodes
ElasticSearch 6.3
Data simulated on Nginx access log:
https://github.com/elastic/rally-eventdata-track
~285GB of snapshots
5 primary / 1 replica shard
AWS EC2 d2.4xlarge data nodes
Results
Very large latency values: ~10s for search
39.8TB; 4 nodes; 360 shards
JVM heap on node: ~30GB, usage 75%
Index memory on node: Lucene 20GB
Optimize High Density Data
Warm and Cold nodes
What limits storage density?
Query Latency Requirements
Random reads disk performance
Low page cache hit ration
Comment: cache will not help, usually very different data
Data & query patterns
Comment: e.g. time windows can influence a lot the query time
Network and Recovery Performance
Cluster Size
Recovery Window Length
Comment: We can lose data if we lose the node within the recovery window. Cluster nodes with a lot of data can be problematic since it takes lot of time to replicate the data.
Heap usage (most important one)
Circuit Breakers
Comments: Circuit breakers are designed to deal with situations when request processing needs more memory than available
Static Lucene Data
Transient Data
Comment: Elasticsearch builds the transient data structures of each shard of a frozen index each time that shard is searched, and discards these data structures as soon as the search is complete
Use Large Shards
Each shard has overhead
Ideal size depend on Use-Case
Each shard max. 2 billion docs
Practical guideline: 50GB-200GB
Minimize Heap Usage and Size on Disk
ALWAYS optimize mappings
Comments: Defaults on Logstash are focused for flexibility
Avoid nested and parent child mappings
Watch out for high cardinality keyword fields
Comment: Avoid storing high cardinality keywords, like unique IDs, text with timestamps, etc.
Forcemerge down to a single segment (if single optimization, this is the most important one to do)
I/O intensive
Comment: the larger the shard, the longer it takes
Only run on read-only indexes
Comment: For example when you go from Hot to Warm node
Reduce Heap Usage
Select use-case appropriate identifier
Document IDs have different profiles
Comment: Auto generated IDs are not always ideal
https://www.elastic.co/blog/efficient-duplicate-prevention-for-event-based-data-in-elasticsearch
Heap Space vs disk usage trade off
Use Case dependent
Lucene Data Structures and Storage Density
Resource Usage
Types
Disk Usage
Off-heap memory
On-heap memory
Stats can be visualized on (per index):
http://my.elasticsearch.cluster/_cluster/stats
indices/segments and indices/fileddata
Inverted Index
Terms dictionary
Read into heap memory
Posting Lists
Read from disk into byte buffers
Doc Values
Per-document values
Stored on disk, with compression
Numeric values stored in a byte buffers
Keyword values use a dual structure
term dictionary (on heap)
per document ordinals (in byte buffers)
Sgemented Sctructure
Lucene indexes are written as multiple small indexes called segments
Each segment will have its own set of data structures
Most structures can be used per-segment
Keyword docvalues need to re-map ordinals to a global structure (on heap) - field data or global ordinals
Really big usage of heap space with field data/global ordinals
Global ordinals are unnecessary if you only have one segment
Summary - Main consumers of Heap
Term dictionaries of high cardinality fields
Global ordinals - can be avoided if you forceMerge
Types of data
Inverted index
Index storing a mapping from content, such as words or numbers, to its locations in a document or set of documents
:check: Most disk space is used here
:check: Most heap usage is used here
Doc Values
Per document information used for sorting and aggregation
:check: Most heap usage is used here
Points
Range quering
Stored Fields
Original document
:check: Most disk space is used here
Term vectors
Per document inverted index
Uses a lot of space
Disabled by default
Q&A
How much data should you put on the Hot Node?
Warm node should be able to cope with queries as well, not much slower than Hot nodes. If you have lot of data and queries that, you should have big Warm Nodes
Shard size?
Small shards you will suffer
The greater you can make it, better
Usually 30-60GBs. But varies a lot per use case.