FAQ | This is a LIVE service | Changelog

Skip to content

Allow for uptime check not to run for up to 120s, and explicitly monitor for uptime check failure

Monty Dawson requested to merge issue-1-monitoring-resilience into master

Allow for up to 120s for the uptime check to run - and explicitly look for uptime check failing, so we can alert early if a failure occurs.

Currently deployed to the identity project (staging) using https://gitlab.developers.cam.ac.uk/uis/devops/iam/deploy-identity/-/commit/b22148ed3cd88914f8392d877b51b75ae097657f which enables the changes made here.

To test I undeployed the Function which proxies the card api monitoring request, which resulted in the below:

Screenshot_2021-04-14_at_17.08.51

The top metric - number of successful uptime checks - falls to 0, and therefore below threshold. The bottom metric - number of failing uptime checks - rises above the threshold of 0, and therefore the alert fires, with emails just sent to wgd23 to stop this testing worrying the team.

Note after merge (and rebasing branch v1 on master), we should redeploy deploy-identity from master on staging and production - to remove the test deployment from staging, and deploy these changes to production.

Closes #1 (closed)

Edited by Monty Dawson

Merge request reports

Loading