Please enable JavaScript.
Coggle requires JavaScript to display documents.
Platform Health Dashboard - Coggle Diagram
Platform Health Dashboard
Access / User permissions
Different level of user access
Viewer (Read-only)
By default All BBC staff users with a desktop certificate can access dashboard -
Story created (CSPHEALTH-29)
:check:
Single sign-on to access dashboard from so that users don't need BBC dev cert? (Nice to have)
Any consideration for Non-BBC users? (Do we need to allow any Non-BBC users to view dashboard?) -
Story (CSPHEALTH-29)
:check:
Editor
Administrator - be able to change data sources & server configuration :check:
What is the requirement for revoking the access or removing user? (PR process - same as above)
Also, JML document must be updated to instruct access removal for leavers
:question:
What is the process for getting Editor or administrator access? :check:
This would be via pull request with valid reason.
respective user's email address would be added to a file somewhere in the repo to have elevated level of access to dashboard
What is the process of changing the level of access?
i.e. Editor > Viewer
Editor > Admin
Admin > Editor or Viewer
Fabl Health
Fabl-KPI1 - Availability
(SLO 99.95% of requests served successfully)
What do we mean by a request? in context of Fabl :check: Any calls coming to Fabl request
What all requests we need to consider? - Are there more than one type of requests?
Does this ensure that we are not missing any requests? :check:
What about requests coming during network failure/backend failure or any planned or unplanned outages? :check:
How do we identify all requests for Fabl?
Through Fabl API Call only :check:
What is the definition of
successfully served request ?
Is there anything like partially served requests? :check:(NO)
or Is it binary as either served or didn't served? :check:(YES)
HTTP Status codes :question:
i.e. Any HTTP status other than 500–599 is considered successful :check:
Check 408 - Timeout status codes - Check with John :question:
What is the SLI calculation look like for this KPI -
Number of successful requests / total requests (success rate) for a defined time period
Reasonable minimum time period would be a minute :check:
Data sources
:question:
What are the data sources we need to consider for this?
CloudWatch > WebCore account :check:
Grafana configuration :question:
Panel type
Do we need to any alerts?
Alert rule
Alert channel
Threshold?
Fabl-KPI2 - Response time (cached)
98% of cached requests are completed in < 50ms.
What defines cached request? parameter x-proxy-cache: HIT :check:
Do we need to consider HTTP status codes? :question:
What parameter we look for request received date/timestamp?
duration parameter available to look at :check:
We already have parameter defined at duration level so we won't need date/timestamp
Data sources :question:
What are the data sources we need to consider for this? :check:
CloudWatch > WebCore account
What is the SLI calculation? :check:
Number of cached responses that completed successfully in < 50ms / total number of cached responses
Grafana configuration :question:
Panel Type
Threshold
Do we need any alters?
Alert rule
Alert channel
Data retention :question:
Is this at each KPI level or dashboard level?
How long we want to retain the data?
Do we want to allow restore of data after deletion?
How long?
Who can restore it?
What is the data archival policy?
Fabl-KPI3 - Response time (uncached)
98% of uncached requests are completed in < 250ms.
What defines un-cached requests? parameter x-proxy-cache: MISS :check:
Do we need to consider HTTP status codes? :question:
Data sources -
What are the data sources we need to consider for this? :check:
CloudWatch > WebCore account
What more information we need for datasources? :question:
What is the SLI calculation? :check:
Number of uncached requests that completed successfully in < 250ms / total number of uncached requests
Grafana configuration :question:
Panel Type
Do we need any alerts?
Alert rule
Alert channel
Threshold
Fabl-KPI4 - Freshness
99% of single module requests serve data that is < 50 seconds old.
What Fabl modules we need to consider here?
Do we need to track this for each of the Fabl modules?
Data sources :question:
Assuming freshness of data for each module differs?
Define freshness :question:
The proportion of records read from the source that were updated "recently"( < 50sec old)
RatePerMinute
: How frequently we need to collect the metrics?
Existing Lamda function is triggered every minute, makes 6 API calls to Fabl
What metrics we need to collect? Existing Lamda function for monitoring Fabl Freshness collects below metrics:
ServiceToFablDependencyDuration
FablDependencyToFablConsumerDuration
FablConsumerToPreFetchDuration
PresFetchToResponseDuration
TotalDuration
Alerts :question:
Monitoring
Alarms & notifications