Big Data Platform Service Availability
Big Data Platform
Processing time for maps/reduces (Work: Performance)
Tracks the wall time spent across all of your map and reduce tasks.
Number of failed maps/reduces (Work: Error)
tracks the number of failed map/reduce tasks for a job.
Counters tracking where map tasks were executed
track the number of map tasks that were performed, aggregated by location. Data can be located either: on the node executing the map task, on a node in the same rack as the node performing the map task, or on a node located on a different rack somewhere else in the cluster (Resource: Other)
Task counters track the results of task operations in aggregate. Each counter below represents the total across all tasks associated with a particular job.
Number of input records for reduce tasks (Other)
Number of records spilled to disk (Resource: Saturation)
Processing time spent in garbage collection (Other)
MapReduce allows users to implement custom counters that are specific to their application. Custom counters can be used to track more fine-grained counts, such as counting the number of malformed or missing records.
File system counters
Number of unhealthy nodes (Resource: Error)
YARN considers any node with disk utilization exceeding the value specified under the property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage (in yarn-site.xml) to be unhealthy. Ample disk space is critical to ensure uninterrupted operation of a Hadoop cluster, and large numbers of unhealthyNodes (the number to alert on depends on the size of your cluster) should be quickly investigated and resolved.
Number of currently active nodes (Resource: Availability)
This metric should remain static in the absence of anticipated maintenance.
Number of lost nodes (Resource: Error)
If a NodeManager fails to maintain contact with the ResourceManager, it will eventually be marked as “lost” and its resources will become unavailable for allocation.
Number of failed applications (Work: Error)
Assuming you are running MapReduce on YARN, if the percentage of failed map or reduce tasks exceeds a specific threshold the application as a whole will fail.
Total amount of memory/amount of memory allocated (Resource: Utilization). This gives a high-level view of your cluster’s memory usage. Memory is currency in Hadoop, and if you are nearing the ceiling of your memory usage you have a couple of options. To increase the memory available to your cluster, you can add new NodeManager nodes, tweak the amount of memory reserved for YARN applications, or change the minimum amount of RAM allocated to containers.
Application execution progress meter (Work: Performance)
Progress gives you a real-time window into the execution of a YARN application. Its reported value will always be in the range of zero to one (inclusive), with a value of one indicating completed execution.
Number of containers that failed to launch (Resource: Error)
Number of active followers (Resource: Availability)
Amount of time it takes to respond to a client request (in ms) (Work: Performance)
Number of clients connected to ZooKeeper (Resource: Availability)
Name Node Metrics
Failure of the NameNode without any replica standing by will cause data stored in the cluster to become unavailable. Therefore monitoring the primary NameNode and its Secondary/Standby NameNode is critical to ensure high cluster availability.
Available capacity (Resource: Utilization)
Disk use not to exceed 80% capacity.
Number of corrupt/missing blocks (Resource: Error/Resource: Availability)
Corrupt blocks are replicated from healthy copies.
Missing blocks have no known copy.
Number of failed volumes (Resource: Error)
A failed volume will not bring your cluster to grinding halt, you most want to know when hardware failures occur, so that you can replace the failed hardware.
Count of alive DataNodes/Count of dead DataNodes (Resource: Availability)
When the NameNode does not hear from a DataNode for 30 seconds, that DataNode is marked as “stale.” Should the DataNode fail to communicate with the NameNode for 10 minutes following the transition to the “stale” state, the DataNode is marked “dead.”
Total count of files tracked by the NameNode (Resource: Utilization)
The current number of concurrent file accesses (read/write) across all DataNodes. (Resource: Utilization)
Maximum number of blocks allocable/Count of blocks tracked by NameNode (Resource: Utilization)
Count of under-replicated blocks (Resource: Availability)
Count of stale DataNodes (Resource: Availability)
NameNode JVM Metrics
the NameNode runs in the Java Virtual Machine (JVM), it relies on Java garbage collection processes to free up memory. The more activity in your HDFS cluster, the more often the garbage collection will run.
Number of old-generation collections. ConcurrentMarkSweep collections free up unused memory in the old generation of the heap. (Other)
Elapsed time of old-generation collections, in milliseconds (Other)
Data Node Metrics
DataNode metrics are host-level metrics specific to a particular DataNode.
Remaining disk space :explode:
A single DataNode running out of space could quickly cascade into failures across the entire cluster as data is written to an increasingly-shrinking pool of available DataNodes.
Alert on this metric when the remaining space falls dangerously low (less than 10 percent).
Number of failed storage volumes (Resource: Error)
By default, a single volume failing on a DataNode will cause the entire node to go offline. Depending on your environment, that may not be desired.