FAQ | This is a LIVE service | Changelog

Commit 7b0e06e1 authored by Dr Rich Wareham's avatar Dr Rich Wareham
Browse files

WIP: add documentation of terraform deployments

This is a WIP containing a basic skeleton for documentation on
deployments and content for a few pages within it. I think that filling
out the rest of the pages should end up with a pretty comprehensive
documentation of how we deploy things.

Closes #64
parent f0256b15
Pipeline #80547 passed with stage
in 52 seconds
# Continuous Deployment
This page documents how we use GitLab CI to deploy new versions of our
applications automatically.
# DNS
This page documents how we configure DNS zones for our deployments. It also
discusses how we verify ownership of domains under `cam.ac.uk` with Google.
## Zone delegation
The DNS zone `gcp.uis.cam.ac.uk` is delegated to a Google Cloud DNS hosted zone
managed by
[gcp-infra](https://gitlab.developers.cam.ac.uk/uis/devops/infra/gcp-infra/)
(University members only).
When a new product is added, a *product zone* is created by configuration in
[dns.tf](https://gitlab.developers.cam.ac.uk/uis/devops/infra/gcp-infra/-/blob/master/product/dns.tf).
This will be of the form `[product-slug].gcp.uis.cam.ac.uk`. The _product admin_
service account is granted the ability to add records to this zone. DNSSEC
records are added as appropriate.
Per-product DNS configuration is handled by [our standard
boilerplate](https://gitlab.developers.cam.ac.uk/uis/devops/gcp-deploy-boilerplate/-/blob/master/%7B%7B%20cookiecutter.product_slug%20%7D%7D-deploy/dns.tf)
(University only). The _product admin_ account is used to create a
per-environment zone which the _project admin_ account can add records to. For
example, the "development" environment for the "example" product would have the
zone `devel.example.gcp.uis.cam.ac.uk` created. Again, DNSSEC records are
added.
At this point, the _project admin_ can freely add records to the
environment-specific zone.
!!! example "Example: Load Balancer IPs"
When configuring the [Cloud Load
Balancer](https://gitlab.developers.cam.ac.uk/uis/devops/gcp-deploy-boilerplate/-/blob/master/%7B%7B%20cookiecutter.product_slug%20%7D%7D-deploy/webapp_load_balancer.tf)
in our boilerplate, we create an "A" record pointing to the load balancer's
IP with the following config:
```tf
# DNS records for webapp. For load-balanced applications, these are created
# irrespective of the any custom DNS name. For custom DNS name-hosted webapps,
# you will probably need a further CNAME record pointing to this record.
resource "google_dns_record_set" "load_balancer_webapp" {
count = local.webapp_use_cloud_load_balancer ? 1 : 0
managed_zone = google_dns_managed_zone.project.name
ttl = 300
name = local.webapp_dns_name
type = "A"
rrdatas = [
module.webapp_http_load_balancer[0].external_ip
]
}
```
## Adding records under `.cam.ac.uk`
The existing [IP register database](https://www.dns.cam.ac.uk/) is used to
provision records under `.cam.ac.uk`. At the moment this cannot easily be
automated.
To add records you will need to liaise with
[ip-register@uis.cam.ac.uk](mailto:ip-register@uis.cam.ac.uk) to ensure the
following:
* The zone containing the record you wish to register is in the `UIS-DEVOPS`
_mzone_. (For example, if you want to register `foo.raven.cam.ac.uk` then
`raven.cam.ac.uk` needs to be in the mzone.)
* You have the ability to administer the UIS-DEVOPS mzone in the IP register
database. The list of administrators for the mzone can be listed via the
[single_ops page](https://jackdaw.cam.ac.uk/ipreg/single_ops).
Our general approach is to add whatever records we need for the service within
the product environment specific zone under `gcp.uis.cam.ac.uk` and add CNAME
records for the `.cam.ac.uk` "friendly" names.
Adding a CNAME is a bit of a fiddle:
* In [vbox_ops](https://jackdaw.cam.ac.uk/ipreg/vbox_ops) add a vbox which
corresponds to the `.gcp.uis.cam.ac.uk` host name.
* In [cname_ops](https://jackdaw.cam.ac.uk/ipreg/cname_ops) add a CNAME which
points to the `.gcp.uis.cam.ac.uk` host name.
## TLS Certificates and Domain Verification
Google Cloud services which can host web content generally have two options for
TLS certificates:
* **Self-managed certificates** require that one generate private keys and
corresponding TLS certificates signed by some appropriate trust root "out of
band" and provide the key and certificate to the service. They can be a
management burden since one needs to make sure that appropriate procedures are
in place to renew and replace certificates before expiry. As such we are
incentivised to have long-lived certificates.
* **Google managed certificates** are issued and managed as part of the Google
Cloud platform. They are automatically renewed and replaced as necessary. Like
many automated certificate provisioning solutions, Google managed certificates
tend to be short-lived, typically 90 days. Renewing and rotating TLS
certificates often is desirable from a security hygiene perspective. This,
coupled with the lower maintenance burden, means we generally use Google
managed certificates where possible.
In order for Google to issue certificates, you must verify ownership of the
domain.
### Domains under `.gcp.uis.cam.ac.uk`
Our boilerplate includes semi-automated verification for domains under
`.gcp.uis.cam.ac.uk` by means of verification records.
We will use the example domain `devel.example.gcp.uis.cam.ac.uk` in this
section. This section also assumes the product's terraform configuration has
already been `apply`-ed at least once.
Domain verification is started using the [gcloud
tool](https://cloud.google.com/sdk/install):
```console
gcloud domains verify devel.example.gcp.uis.cam.ac.uk
```
A browser window will appear with the Google Webmaster tools page shown.
1. Open a new browser window in incognito mode and enter the URL shown
previously. (The Chrome browser on Macs will attempt to set up
synchronisation if one signs in to a Google account outside of a private
browsing session.)
2. Click the avatar in the top-right corner and make sure that you are signed in
as the UIS DevOps bot user, devops@uis.cam.ac.uk. Credentials for this user
can be found in 1Password.
3. Select **Other** as a domain name provider and click the **Add a CNAME
record** link.
4. You will be asked to add a CNAME record of the form
`[HOST].devel.example.gcp.uis.cam.ac.uk` pointing to some target.
Add the `[HOST]` part and target to the `workspace_domain_verification`
section of
[locals.tf](https://gitlab.developers.cam.ac.uk/uis/devops/gcp-deploy-boilerplate/-/blob/master/%7B%7B%20cookiecutter.product_slug%20%7D%7D-deploy/locals.tf). For example, if your CNAME host
was `abc1234.devel.example.gcp.uis.cam.ac.uk` and the target was
`gv-XXXXXX.dv.googlehosted.com`, set `cname_host` to `abc1234` and `cname_target` to
`gv-XXXXXX.dv.googlehosted.com`.
5. Apply the configuration so that the verification CNAME record is created.
6. It will take up to 5 minutes for Google's DNS to start serving the CNAME
record. Make a cup of coffee and then click **Verify** to verify ownership.
7. When verification is successful, click **Add additional owners to
devel.example.gcp.uis.cam.ac.uk.**, add add the *project admin
service account* email address. This is available in the
`project_admin_service_account_email` terraform output.
8. Add `verified = true` to the workspace's domain verification state in
`workspace_domain_verification` in locals.tf.
9. Apply the configuration again.
### Domains under `.cam.ac.uk`
Unfortunately, we cannot yet semi-automate the verification of domains outside
of `gcp.uis.cam.ac.uk`. For all other `.cam.ac.uk` domains we need to perform
manual verification.
The G Suite admins can manually verify ownership of domains under `.cam.ac.uk`
with Google. They will need:
* The email address of the _project admin_ service account.
* The domain to verify.
One can contact the G Suite admins via
[gapps-admin@uis.cam.ac.uk](mailto:gapps-admin@uis.cam.ac.uk). Since all G Suite
admins are also in DevOps, you might find it quicker to ask in the divisional
Google Chat.
## Summary
* Domains under `.gcp.uis.cam.ac.uk` are managed via Google's Cloud DNS service.
* Each product receives a dedicated zone named `[PRODUCT].gcp.uis.cam.ac.uk`
which can be administered by the _product admin_ service account.
* Each environment is provisioned with a dedicated zone named
`[ENVIRONMENT].[PRODUCT].gcp.uis.cam.ac.uk` which can be administered by the
_project admin_ service account.
* All zones have appropriate DNSSEC records created.
* Domain ownership must be verified for Google to issue TLS certificates.
* Verification of domain ownership is semi-automated for domains under
`.gcp.uis.cam.ac.uk` but must be performed manually by a G Suite admin for
other `.cam.ac.uk` domains.
# GCP Projects and Folders
This page describes how we make use of Google Cloud projects and folders to
structure our deployments.
## Centrally managed infrastructure
We use the [gcp-infra
project](https://gitlab.developers.cam.ac.uk/uis/devops/infra/gcp-infra/)
(DevOps only link) to configure and provision Google Cloud for the wider
University. This repository contains [terraform](https://www.terraform.io/)
configuration for the scheme described below.
We use a series of Google Folders within Google Cloud to reflect a rough
institution/group hierarchy similar to the hierarchy which has developed on [the
Developer Hub](https:/gitlab.developers.cam.ac.uk/).
At the bottom of this hierarchy are *products*. At the Google Cloud management
level we care only about products and each product is given its own *product
folder*.
For example, the "Information Services" folder contains sub-folders for groups
within the institution and each group has a sub-folder for each product:
<center>![Google Cloud hierarchy](./gcp-hierarchy.png)</center>
Each product is associated with a finance cost code and a team name. These are
added to projects within the product folder as a label. These labels are
intended to be used to automate finance reporting.
In addition to the labels, products are allocated a budget and a [budget
alert](https://cloud.google.com/billing/docs/how-to/budgets) is configured to
notify us when a product is likely to go over budget for a month.
Individual user accounts can be designated as "product owners". These accounts
will be given "Owner" rights on projects under the product folder. More
information can be found on the [permissions and
roles](./permissions-and-roles.md) page.
## The product folder and meta project
When a new product is provisioned in Google Cloud, we take the budget, finance
code and list of owners and add them to the gcp-infra project. The terraform
will create a *product folder* and a *meta project* for the product.
The *product folder* contains all Google Cloud projects which relate to a given
product. We generally use completely separate Google Cloud projects for each
of our production, test and development environments. Occasionally we'll
provision additional projects within the product folder to develop larger
features which are likely to require multiple iterations before they can be
merged into the product as a whole.
The *meta project* is a project within the product folder which contains
resources which are specific to the product but which are shared by *all*
environments.
For example, here are the projects within the "Applicant Portal" product folder:
<center>![An example product folder](./product-folder.png)</center>
You can see how the meta project is tagged with the team name and cost-centre
code. A running total of costs can also be seen.
Examples of resources which belong within the meta project are:
* A Cloud DNS zone for the product. Usually of the form
`[product].gcp.uis.cam.ac.uk`. Projects within the product folder will
generally contain environment-specific zones such as
`devel.[product].gcp.uis.cam.ac.uk`.
* Product-wide secrets such as GitLab access tokens or credentials for external
services which are shared by all environments.
* A Cloud Storage bucket which hosts the terraform state.
## Admin Accounts
The personal accounts of each product owner are added to the product folder with
the "owner" role. This allows all product owners to examine and modify
configuration within the Google Cloud console. More
information can be found on the [permissions and
roles](./permissions-and-roles.md) page.
In addition a service account is added to the *meta project* which is also given
owner rights over the product folder and the right to associate projects within
the product folder with the University's Billing Account.
This service account is called the *product admin* service account.
Credentials for the product admin service account are added to a Google Secret
Manager secret within the *meta project*. Anyone who can read this secret can,
in effect, act as the product admin.
## Configuration
The gcp-infra configuration will create a generic "configuration" Cloud Storage
bucket in the *meta project* which can be used for terraform state. It will
place two objects in that bucket: a human-readable HTML configuration "cheat
sheet" and a machine-readable JSON document with the same content. Both are only
readable by Product Owners.
An example of the configuration cheat sheet is as below:
<center>![An example configuration page](./configuration-page.png)</center>
The cheat sheet contains the Product Folder id, the meta project id, the billing
account id, the configuration bucket name, the DNS zone name and the product
admin service account name along with the location of the secret which contains
credentials for the service account.
It is intended that the information on the cheat sheet should be enough for
someone to bring their own deployment tooling to GCP. In DevOps, we have a
[boilerplate
deployment](https://gitlab.developers.cam.ac.uk/uis/devops/gcp-deploy-boilerplate).
The cheat sheet links to the boilerplate and provides values which can be
directly pasted into the boilerplate parameters.
Our boilerplate also makes use of the generated JSON document to automatically
fetch many of these parameters meaning a) we can update them and have
deployments use the new values automatically and b) we don't need to copy-paste
them into the deployment. This is an example of the [Don't Repeat
Yourself](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) (DRY)
principle.
## Summary
In summary, we manage Google Cloud as a collection of *products* and each
product has a dedicated *product folder* and *meta project* created for it in
Google Cloud.
Within the *meta project* we create:
* A *product admin service account* which has rights to create projects within
the product folder and to associate them with the University billing
account.
* *Credentials for the product admin service account* which are stored in a
Google Secret.
* A *general configuration Cloud Storage bucket* which can be used to store
product-wide configuration.
* A *cheat sheet* and *configuration JSON document* providing human-readable and
machine-readable information on the Google Cloud resources provided for the
product.
* A managed *DNS zone* of the form `[product].gcp.uis.cam.ac.uk` which is
pre-wired in to the University DNS.
Each environment, such as "production", "staging" and "development" should be a
separate Google Cloud project created within the product folder.
The individual accounts for each product owner are added as owners on the
product folder and so they can read the product admin service account
credentials or create new ones.
In the [terraform section](./terraform.md) we'll discuss how our standard
terraform deployment makes use of the resources created for a given product.
# Deployments
This section of the guidebook discusses how we deploy applications to Google
Cloud. It does not discuss legacy deployments.
It is intended as a guide for those new to the division or those interested in
how we do things generally.
Not all deployments will use all of the techniques covered in this section but
all the techniques here are embodied in at least one live service. Where
possible, links to techniques being used in a live service are provided. Some of
those links may be restricted to current DevOps members or members of the
University.
**TODO**: add an "overview" here of how things generally hang together with the
following topics:
* Location and naming of deployment repos
* The "zen" of our deployments (infrastructure as code, declarative, as little
click-ops as possible)
* Our "standard" product (prod, staging, dev environments, DNS zones, billing,
etc)
* Using terraform to wire together products.
* Secrets stay within terraform state if they are a) arbitrary and b) need not
be given to anything not managed by terraform.
# Kubernetes
This page documents how we provision and manage Kubernetes clusters via the
Google Kubernetes Engine service.
# Monitoring and Alerting
This page documents how monitoring and alerting is configured.
# Permissions and Roles
This page documents how we manage personal and service account permissions.
## Permissions granted by gcp-infra
As part of the [product provisioning process](./gcp-folders.md), we have a
product folder and product meta project created automatically for each product.
A *product admin* service account is created in the meta project and has owner
rights over all projects in the product folder. Additionally the product admin
can create new projects within the folder and associate them with the University
Billing Account.
Individual accounts are added at one of three levels:
* **Viewer** These users have "viewer" rights over all projects in the product
folder. They can inspect the deployment in the Google Cloud console but
cannot see sensitive resources such as secrets. Giving someone "viewer"
access is appropriate if they're involved in initial troubleshooting, need
to see logs or are being shown around the deployment.
* **Deployer** These user have the ability to view
the value of the Google Secret Manager secret which contains credentials for
the *product admin*. As such they can run the terraform configuration. Since
they can access the product admin credentials, they can perform any actions
the product admin can but only via terraform or the `gcloud` utility; they
cannot modify things in the Google Cloud Console and can only _see_ things
in the console if they are also "Viewer" users.
* **Owner** Owner rights over all projects. Can see all resources in the Google
Cloud Console and modify them. Generally this is given to those directly
involved in maintaining or creating the product's deployment configuration.
The association between individual accounts and these roles is specified in the
gcp-infra project. For DevOps we have also made this information available via a
[team data
module](https://gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/team-data/)
(University members only).
!!! note "Future improvements"
Since this system was developed, Google Cloud gained knowledge of [Lookup
groups](https://www.lookup.cam.ac.uk/). As such we can also start assigning
these roles to members of Lookup groups by granting permissions to
the `[GROUPID]@groups.lookup.cam.ac.uk` Google Group.
## Per-environment permissiosn
### Project admin
Our terraform configurations usually create a *project admin* service account
which is given owner rights over the environment-specific Google Project but
cannot affect resources in other Google Projects. This is usually a safer
service account identity to use and is the identity we associate with the
default `google` and `google-beta` providers.
This service account is created [within the gcp-project terraform
module](https://gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/gcp-project/-/blob/master/main.tf#L29).
### Cloud Run Deployer
For [continuous deployment](./continuous-deployment.md) we make use of a service
account which has rights only to deploy new revisions within an existing Cloud
Run service, not to create new ones or delete existing ones.
We do this by means of a custom Cloud IAM role created with the following
terraform:
```tf
resource "google_project_iam_custom_role" "run_editor" {
role_id = "runEditor"
title = "Run Editor"
description = "Update existing Cloud Run deployments"
permissions = [
"run.services.get",
"run.services.list",
"run.services.update",
"run.routes.get",
"run.routes.list",
"run.configurations.get",
"run.configurations.list",
"run.revisions.get",
"run.revisions.list",
"run.locations.get",
"run.locations.list",
]
}
```
### Service identities
All processes within Google Cloud have an identity associated with them. There
is usually a default identity for each service. For example, unless configured
otherwise, applications running via Cloud Run have the [Compute Engine default
service account
identity](https://cloud.google.com/run/docs/securing/service-identity).
To allow granular control we try to configure all processes to have dedicated
service accounts. In particular you should try to create custom service accounts
for:
* Cloud Run services,
* Cloud Function functions, and
* Kubernetes workloads via [GKE workload
identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity).
!!! example
Within the [Raven
infrastructure](https://gitlab.developers.cam.ac.uk/uis/devops/raven/infrastructure/-/blob/master/raven_stats.tf)
(DevOps only) we create a Cloud Function which fetches device statistics via
the Google Analytics API and stores a summary report in a Cloud Storage
bucket. This function runs as a custom service account which is given rights
to update the summary in the bucket and is added to our Google Analytics
property as a read-only user.
## Summary
* Each *product* has a *product admin* service account which has "Owner" rights
over all projects in the product folder.
* Per *product* permissions are "Owner", "Deployer" and "Viewer". "Owners" can
do everything, "Deployers" can run terraform and use the `gcloud` CLI to do
anything and "Viewers" can see how infrastructure is configured but not some
sensitive resources such as secrets.
* There is one Google project per environment. For example, one project for
"production", one for "staging", etc.
* Each environment-specific Google project has a *project admin* service account
which has "owner" rights over that project.
* The default terraform provider uses the *project admin* service account
credentials.
* We create custom Cloud IAM roles to allow continuous deployment jobs to deploy
new versions but not create new services or delete old ones.
* We try to explicitly create custom service account identities for all
processes.
# SQL Instances
This page documents how we deploy and configure SQL database instances.
We almost always make use of [PostgreSQL](https://www.postgresql.org/) SQL
instances. PostgreSQL is best-of-breed among [Relational Database Management
Systems](https://en.wikipedia.org/wiki/Relational_database) both in terms of
performance and features.
Occasionally we are required to run a [MySQL
instance](https://en.wikipedia.org/wiki/MySQL) when it is the only database
supported by a third-party application.
Although it can have some gnarly edges, the SQL data model is mature, well
documented and there are a lot of resources on the Web which help you get to
grips with it.
!!! example
The [GitLab service](https://gitlab.developers.cam.ac.uk/) as part of the
University Developers' Hub stores all non-git state in a PostgreSQL
database and serves over 1,000 users daily with a highly complex data model.
We make use of [Cloud SQL](https://cloud.google.com/sql/) managed SQL instances.
## "Database" versus "instance"
There is a difference between a _SQL instance_ and a _database_. A SQL
instance corresponds to a single instance of a database server cluster. A
single PntgreSQL cluster supports the creation of multiple databases. A given
SQL instance can, therefore, have multiple databases within it.
Usually we configure one instance per _product_ and create one or more databases
within that instance for each _web application_.
Cloud SQL supports PostgreSQL instances, MySQL instances and [SQL
Server](https://en.wikipedia.org/wiki/Microsoft_SQL_Server) instances.
We have yet to had a need for a SQL Server instance and so this facet of
Cloud SQL is under-explored within the team.
## PostgreSQL versions
Usually we use the latest version of PostgreSQL available when a product is
first deployed. A notable omission in Cloud SQL is any method of automatic
version upgrading. Instead, the [recommended
procedure](https://cloud.google.com/sql/docs/postgres/upgrade-db) is to dump the
database as a SQL document in a Cloud Storage bucket, re-provision the database
with the new engine version and restore the SQL dump.
!!! warning
This somewhat clunky upgrade procedure means it is almost impossible to
easily update a service with zero downtime. In principle we could spin up a
new PostgreSQL instance and use PostgreSQL's native streaming replication
support to move from one instance to the other with zero downtime.
We have not yet done this and so this particular achievement is waiting for
someone to unlock it.
## Boilerplate support
In our standard boilerplate, the Cloud SQL instance is configured in
[sql.tf](https://gitlab.developers.cam.ac.uk/uis/devops/gcp-deploy-boilerplate/-/blob/master/%7B%7B%20cookiecutter.product_slug%20%7D%7D-deploy/sql.tf).
Our boilerplate can generate configuration for PostgreSQL or MySQL instances.
!!! note "SQL instance naming"
Our boilerplate uses a random name for the SQL instance since, annoyingly,
SQL instance names cannot easily be re-used for [up to a
week](https://cloud.google.com/sql/docs/postgres/delete-instance) after
deletion.
We make use of Google's [terraform
module](https://registry.terraform.io/modules/GoogleCloudPlatform/sql-db/) to
configure the instance. For non-production environments we usually go with a
`db-f1-micro` tier instance with no failover server. For production we tune the
instance size to meet expected demand and configure PostgreSQL in a
highly-available "hot-failover" configuration. The cluster configuration is
managed for us by Cloud SQL.
Automatic backups are scheduled for 1AM each morning and we set the allowable
maintenance window to start at 2AM on Sunday mornings.
## "Off-site" backups
We have [a custom
tool](https://gitlab.developers.cam.ac.uk/uis/devops/tools/gcp-sql-backup) which
runs nightly to backup our production databases to a Google Cloud Storage
bucket. This bucket is hosted in a Google project dedicated to this purpose. As
such we have a degree of resilience against deleting the SQL instance or the
host project, both of which would delete the automated nightly backups made by
Cloud SQL itself.
No setup is required within our terraform for this. The service account which
performs nightly backups is granted the appropriate IAM rights at the top-level
Google Folder level and uses the Google API to automatically discover instances
which need backing up.
!!! note "The power of APIs"
One of the advantages of using a Cloud-hosted product for our databases is
that it is designed to be introspected, managed and provisioned via an API.
As such the [code for our backup
utility](https://gitlab.developers.cam.ac.uk/uis/devops/tools/gcp-sql-backup/-/blob/master/gcpsqlbackup/__init__.py)
is very small: around 500 lines including documentation and comments.
## Database users and service accounts