Please enable JavaScript.
Coggle requires JavaScript to display documents.
Big Data Platform Service Availability (MapReduce Counters (Job counters,…
Big Data Platform
Service Availability
HDFS Metrics
Name Node Metrics
Failure of the NameNode without any replica standing by will cause data stored in the cluster to become unavailable. Therefore monitoring the primary NameNode and its Secondary/Standby NameNode is critical to ensure high cluster availability.
NameNode-emitted metrics
CapacityRemaining
:explode:
Available capacity (Resource: Utilization)
Disk use not to exceed 80% capacity.
CorruptBlocks/MissingBlocks
:explode:
Number of corrupt/missing blocks (Resource: Error/Resource: Availability)
Corrupt blocks are replicated from healthy copies.
Missing blocks have no known copy.
VolumeFailuresTotal
:explode:
Number of failed volumes (Resource: Error)
A failed volume will not bring your cluster to grinding halt, you most want to know when hardware failures occur, so that you can replace the failed hardware.
NumLiveDataNodes/NumDeadDataNodes
:explode:
Count of alive DataNodes/Count of dead DataNodes (Resource: Availability)
When the NameNode does not hear from a DataNode for 30 seconds, that DataNode is marked as “stale.” Should the DataNode fail to communicate with the NameNode for 10 minutes following the transition to the “stale” state, the DataNode is marked “dead.”
FilesTotal
Total count of files tracked by the NameNode (Resource: Utilization)
TotalLoad
The current number of concurrent file accesses (read/write) across all DataNodes. (Resource: Utilization)
BlockCapacity/BlocksTotal
Maximum number of blocks allocable/Count of blocks tracked by NameNode (Resource: Utilization)
UnderReplicatedBlocks
Count of under-replicated blocks (Resource: Availability)
NumStaleDataNodes
Count of stale DataNodes (Resource: Availability)
NameNode JVM Metrics
the NameNode runs in the Java Virtual Machine (JVM), it relies on Java garbage collection processes to free up memory. The more activity in your HDFS cluster, the more often the garbage collection will run.
ConcurrentMarkSweep count
Number of old-generation collections. ConcurrentMarkSweep collections free up unused memory in the old generation of the heap. (Other)
ConcurrentMarkSweep time
Elapsed time of old-generation collections, in milliseconds (Other)
Data Node Metrics
DataNode metrics are host-level metrics specific to a particular DataNode.
Remaining disk space :explode:
(Resource: Utilization)
A single DataNode running out of space could quickly cascade into failures across the entire cluster as data is written to an increasingly-shrinking pool of available DataNodes.
Alert on this metric when the remaining space falls dangerously low (less than 10 percent).
NumFailedVolumes
Number of failed storage volumes (Resource: Error)
By default, a single volume failing on a DataNode will cause the entire node to go offline. Depending on your environment, that may not be desired.
YARN Metrics
Cluster metrics
unhealthyNodes
:explode:
Number of unhealthy nodes (Resource: Error)
YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.
activeNodes
Number of currently active nodes (Resource: Availability)
This metric should remain static in the absence of anticipated maintenance.
lostNodes
Number of lost nodes (Resource: Error)
If a NodeManager fails to maintain contact with the ResourceManager, it will eventually be marked as “lost” and its resources will become unavailable for allocation.
appsFailed
Number of failed applications (Work: Error)
Assuming you are running MapReduce on YARN, if the percentage of failed map or reduce tasks exceeds a specific threshold the application as a whole will fail.
totalMB/allocatedMB
Total amount of memory/amount of memory allocated (Resource: Utilization). This gives a high-level view of your cluster’s memory usage. Memory is currency in Hadoop, and if you are nearing the ceiling of your memory usage you have a couple of options. To increase the memory available to your cluster, you can add new NodeManager nodes, tweak the amount of memory reserved for YARN applications, or change the minimum amount of RAM allocated to containers.
Application Metrics
progress
Application execution progress meter (Work: Performance)
Progress gives you a real-time window into the execution of a YARN application. Its reported value will always be in the range of zero to one (inclusive), with a value of one indicating completed execution.
NodeManager Metrics
containersFailed
Number of containers that failed to launch (Resource: Error)
MapReduce Counters
Job counters
MILLIS_MAPS/MILLIS_REDUCES
Processing time for maps/reduces (Work: Performance)
Tracks the wall time spent across all of your map and reduce tasks.
NUM_FAILED_MAPS/NUM_FAILED_REDUCES
Number of failed maps/reduces (Work: Error)
tracks the number of failed map/reduce tasks for a job.
RACK_LOCAL_MAPS/DATA_LOCAL_MAPS/OTHER_LOCAL_MAPS
Counters tracking where map tasks were executed
track the number of map tasks that were performed, aggregated by location. Data can be located either: on the node executing the map task, on a node in the same rack as the node performing the map task, or on a node located on a different rack somewhere else in the cluster (Resource: Other)
Task counters
Task counters track the results of task operations in aggregate. Each counter below represents the total across all tasks associated with a particular job.
REDUCE_INPUT_RECORDS
Number of input records for reduce tasks (Other)
SPILLED_RECORDS
Number of records spilled to disk (Resource: Saturation)
GC_TIME_MILLIS
Processing time spent in garbage collection (Other)
Custom counters
MapReduce allows users to implement custom counters that are specific to their application. Custom counters can be used to track more fine-grained counts, such as counting the number of malformed or missing records.
File system counters
Sources:
https://www.datadoghq.com/blog/monitoring-101-collecting-data/
https://www.datadoghq.com/blog/monitor-hadoop-metrics/#namenodeemitted-metrics
http://www.jointhegrid.com/hadoop-cacti-jtg-walk/
ZooKeeper Metrics
zk_followers
Number of active followers (Resource: Availability)
zk_avg_latency
Amount of time it takes to respond to a client request (in ms) (Work: Performance)
zk_num_alive_connections
Number of clients connected to ZooKeeper (Resource: Availability)