Please enable JavaScript.
Coggle requires JavaScript to display documents.
M201: MongoDB Performance - Coggle Diagram
M201: MongoDB Performance
Hardware
Memory
25x faster than SSD
Mongo has storage architecture that rely on RAM, eg writes, aggregations, pipelines, queries, connections
CPU
Mongo will use all cores by default
Important for storage
And concurrency
IO
Higher IOPS the faster can write and read data
RAID 1+0 is preferred for performance and redundancy
RAID 5 or 6 discouraged because of performance overhead
RAID 0 no redundancy
Network
Client - Shell - Database
Distributed data reads and writes, esp with concern settings
INDEXES
_id automatically indexed
Index solves read speed
However write, update and delete operations are slower
Single Field Indexes
db.<collection>.createIndex( { <field>: <direction> } )
direction 1 means
ascending
Keys from only one field
Use dot notation to index sub documents
db.people.explain("executionStats").find( { ssn: 1 } )
Understanding Explain
explain()
tells us:
is your query using the indexyou expect
is you query using an index to provide a sort
is your query using an index to provide the projection
how selective is your index
which part of your plan is the most expensive
exp = db.people.explain() // shows what would happen without running
exp = db.people.explain("queryPlanner) // default and same as above
exp = db.people.explain("executionStats") // execute the query and returns detail about the execution
exp = db.people.explain("allPlansExecution") // the most verbose way we can get output, shows alternate plans
Explain for Sharded Clusters
In this scenario each shard will have its own plan in the shards array
Sorting with Indexes
Use indexes to sort
Create indexes that match and sort
For single key index order isn't important since it can traverse forward and backward
For compound key index order becomes important
Compound indexes can be used with sorting
Can also use a compound index
prefix
in our sorts
Also applies if prefix is split across filter and sort
For compound keys the sort can only use an index if it's a compound, or prefix of the
inverse
of either (ie direction 1 to -1 and vice versa)
Compound Indexes
Index on
two or more
fields
Index prefix is a continuous subset of indexed fields
eg compound index { "item": 1, "location": 1, "stock": 1 }
prefixes would be:
item
item, location
Multikey Indexes
Essentially an index on a field that holds an array value, MongoDB creates an index key for each element in the array
Can only create a compound index with no more than a single multikey index (ie a single array field)
Don't support covered queries
Partial Indexes
Index part of a collection
Instead of creating
restaurants.createIndex( { "address.city": 1, cuisine: 1, stars: 1 })
We could create:
restaurants.createIndex(
{ "address.city": 1, cuisine: 1 },
{ partialFilterExpression: { 'stars': { $gte: 3.5 } } }
)
In order to use the index the query has to include the partial expression
A
sparse
index is a special case where only create index on fields that exist
db.restaurants.createIndex(
{ stars: 1 },
{ sparse: true }
)
Restrictions
can't specify both partialFilterExpression and sparse options
_id indexes cannot be partial indexes
shard key indexes cannot be partial indexes
Text Indexes
db.products.createIndex( { productName: "text" }, { score: { $meta: "textScore } } )
Because text search is
OR
based for each term, the score shows best ranking for match
Can also be part of a compound index
Search on this using:
db.textExample.find( { $text: { $search: "MongoDB best" } )
Collations
Collation allows users to specify language-specific rules for string comparison, such as rules for lettercase and accent marks.
Can be defined at:
Collection
Index
Wildcard Indexes
Used to create indexes on all or some sub parts of nested documents, eg
db.data.createIndex( { "waveMeasurement.waves.$
": 1} )**
Resource Allocation for Indexes
Determine index size, easy to see in Compass
Or can run
db.stats()
Indexes need
disk
and
memory
db.<collection>.stats( {indexDetails:true} )
Can tell us how much of our indexes are in RAM
Performance Benchmarking
Low level benchmarking
File IO
Scheduler
Memory allocation
Thread perf
DB server perf
Transaction Isolation
Tools like sysbench
Database Server Benchmarking
Data set load
Writes per sec
Reads per sec
Balanced workloads
Read / Write ratio
Tools like YCSB or TPC
Distributed Systems Benchmarking
Linearization
Serialisation
Fault tolerance
Tools likeHiBench and JEPSEN
Benchmarking Conditions
Tools like POCDriver
Fair comparisons againsg hardware, clients and load
CRUD Optimisations
Equality queries make better use of indexes that range
Equality > Sort > Range
Covered Queries
Completely satisfied by our index keys
Doesn't therefore need to examine documents
Usually involves projection only those fields
Regex Performance
Usually poor performance on unindexed fields
Adding carat at start means can use some of b-tree optimisation
db.users.find( { username: /^kirby/ } )
Aggregation Performance
Once a stage can't use indexes, no subsequent stage can
Note the optimizer will try and reorder stages if possible
Can add the
{ explain: true }
option to the aggregation
$match can use indexes, esp if it's at the start of pipeline
$sort should be as early as possible
$limit should be near sort, again as early as possible, means can use top-ksorting algorithm
Results are subject to
16Mb
document limit
Mitigate this by using "limit and $project
For each stage there is 100Mb limit per stage
Mitigate this by using indexes, or add the
{ allowDiskUse: true }
option
Doesn't work with $graphLookup
Performance on Clusters
Increasing Write Performance with Sharding
Chunk default max size is
64Mb
, grows to this then splits
Shard key design should consider:
Cardinality
: we want
high
, this controls max number of shards
Frequency
: we want this close to
evenly distributed
, otherwise can get
jumbo chunks
Rate of Change
: avoid monotonically changing fields, eg object id, times, dates
Bulk Writes
For ordered writes are done sequentially, if one fails it stops immediately
For unordered writes are done in parallel and do not depend on each other
Performance Considerations in Distributed Systems
Try using shard key in queries otherwise you will end up having scatter-gather across multiple shards
Reading form Secondaries
db.people.find().readPref("primary") // by default all reads in replica set go to primary
... readPref("secondaryPreferred") // reads go to secondary unless none available, writes still go to primary
... readPref("nearest") // reads go to node with lowest network latency
If reading from secondary may be reading stale data
Reading from secondary is a
GOOD IDEA
when:
doing Analytics queries (there are usually resource intensive and long running)
Local reads in geographically distributed replica sets
Reading from secondary is a
BAD IDEA
when:
providing extra capacity for reads
Replica Sets with Differing Indexes
Not common
Should not become primary, i.e.
{ priority: 0 }
Aggregation Pipeline on Sharded Cluster
Merging for multi shards usually a random shard does this
Primary will always be used for $out, $facet, $lookup and $graphLookup