Platform Health Dashboard

Access / User permissions

Different level of user access

Viewer (Read-only)

Editor

Administrator - be able to change data sources & server configuration ✅

By default All BBC staff users with a desktop certificate can access dashboard - Story created (CSPHEALTH-29)

Fabl Health

Fabl-KPI1 - Availability
(SLO 99.95% of requests served successfully)

What do we mean by a request? in context of Fabl ✅ Any calls coming to Fabl request

What all requests we need to consider? - Are there more than one type of requests?

How do we identify all requests for Fabl?
Through Fabl API Call only ✅

What is the definition of
successfully served request ?

Is there anything like partially served requests? ✅(NO)

or Is it binary as either served or didn't served? ✅(YES)

What is the SLI calculation look like for this KPI -
Number of successful requests / total requests (success rate) for a defined time period
Reasonable minimum time period would be a minute ✅

Fabl-KPI2 - Response time (cached)
98% of cached requests are completed in < 50ms.

HTTP Status codes ❓
i.e. Any HTTP status other than 500–599 is considered successful ✅
Check 408 - Timeout status codes - Check with John ❓

What defines cached request? parameter x-proxy-cache: HIT ✅


Do we need to consider HTTP status codes? ❓

What parameter we look for request received date/timestamp?
duration parameter available to look at ✅


We already have parameter defined at duration level so we won't need date/timestamp

Data sources

What are the data sources we need to consider for this?
CloudWatch > WebCore account ✅

Data sources ❓

What are the data sources we need to consider for this? ✅
CloudWatch > WebCore account

Data retention ❓

Is this at each KPI level or dashboard level?

How long we want to retain the data?

Do we want to allow restore of data after deletion?

What is the data archival policy?

How long?

Who can restore it?

What is the requirement for revoking the access or removing user? (PR process - same as above) Also, JML document must be updated to instruct access removal for leavers

What is the SLI calculation? ✅


Number of cached responses that completed successfully in < 50ms / total number of cached responses

Single sign-on to access dashboard from so that users don't need BBC dev cert? (Nice to have)

Fabl-KPI3 - Response time (uncached)
98% of uncached requests are completed in < 250ms.

What defines un-cached requests? parameter x-proxy-cache: MISS ✅


Do we need to consider HTTP status codes? ❓

Data sources - What are the data sources we need to consider for this? ✅
CloudWatch > WebCore account


What more information we need for datasources? ❓

Fabl-KPI4 - Freshness
99% of single module requests serve data that is < 50 seconds old.

Monitoring

Alarms & notifications

What Fabl modules we need to consider here?
Do we need to track this for each of the Fabl modules?

Data sources ❓

Assuming freshness of data for each module differs?

What is the SLI calculation? ✅


Number of uncached requests that completed successfully in < 250ms / total number of uncached requests

Does this ensure that we are not missing any requests? ✅

What about requests coming during network failure/backend failure or any planned or unplanned outages? ✅

Define freshness ❓
The proportion of records read from the source that were updated "recently"( < 50sec old)

Any consideration for Non-BBC users? (Do we need to allow any Non-BBC users to view dashboard?) - Story (CSPHEALTH-29)

What is the process for getting Editor or administrator access? ✅


This would be via pull request with valid reason.
respective user's email address would be added to a file somewhere in the repo to have elevated level of access to dashboard

What is the process of changing the level of access?
i.e. Editor > Viewer
Editor > Admin
Admin > Editor or Viewer

RatePerMinute : How frequently we need to collect the metrics?
Existing Lamda function is triggered every minute, makes 6 API calls to Fabl

What metrics we need to collect? Existing Lamda function for monitoring Fabl Freshness collects below metrics:
ServiceToFablDependencyDuration
FablDependencyToFablConsumerDuration
FablConsumerToPreFetchDuration
PresFetchToResponseDuration
TotalDuration

Alerts ❓

Grafana configuration ❓

Grafana configuration ❓

Panel Type

Threshold

Do we need any alters?

Panel type

Do we need to any alerts?

Threshold?

Alert rule

Alert channel

Alert rule

Alert channel

Grafana configuration ❓

Panel Type

Do we need any alerts?

Alert rule

Alert channel

Threshold