Allow for uptime check not to run for up to 120s, and explicitly monitor for uptime check failure (!4) · Merge requests · Information Services / DevOps / Infrastructure / Terraform Modules / GCP Minimal Site Monitoring

Monty Dawson requested to merge issue-1-monitoring-resilience into master Apr 14, 2021

Allow for up to 120s for the uptime check to run - and explicitly look for uptime check failing, so we can alert early if a failure occurs.

Currently deployed to the identity project (staging) using https://gitlab.developers.cam.ac.uk/uis/devops/iam/deploy-identity/-/commit/b22148ed3cd88914f8392d877b51b75ae097657f which enables the changes made here.

To test I undeployed the Function which proxies the card api monitoring request, which resulted in the below:

The top metric - number of successful uptime checks - falls to 0, and therefore below threshold. The bottom metric - number of failing uptime checks - rises above the threshold of 0, and therefore the alert fires, with emails just sent to wgd23 to stop this testing worrying the team.

Note after merge (and rebasing branch v1 on master), we should redeploy deploy-identity from master on staging and production - to remove the test deployment from staging, and deploy these changes to production.

Closes #1 (closed)

Edited Apr 16, 2021 by Monty Dawson

Allow for uptime check not to run for up to 120s, and explicitly monitor for uptime check failure

Merge request reports