Merge branch 'issue-95-further-deployment-docs' into 'master'

add deployment documentation for k8s and ingress Closes #119 See merge request !112

Merge branch 'issue-95-further-deployment-docs' into 'master'
f45de9cb · Dr Abraham Martin · 70bd7d72 · f66a55c9 · f45de9cb · f45de9cb
Commit f45de9cb authored 3 years ago by Dr Abraham Martin
--- a/docs/deployment/k8s-clusters.md
+++ b/docs/deployment/k8s-clusters.md
-# Kubernetes
+# Kubernetes Clusters

-This page documents how we provision and manage Kubernetes clusters via the
-Google Kubernetes Engine service.
+[Kubernetes](https://kubernetes.io/), often shortened to "kubernetes", is a cluster
+management and workload orchestration system which can be used to host
+container-based applications.
+
+This page documents how we configure and make use of kubernetes clusters in our
+deployments.
+
+## When to use kubernetes
+
+Kubernetes excels when your application is made up of _multiple_ containers
+which need to interact with each other and/or maintain some shared state outside
+of a database. Kubernetes provides dedicated resources for these use cases which
+are tedious to replicate via other means.
+
+That being said, we rarely make use of kubernetes in our deployments for the
+following reasons:
+
+* We need to dedicate at least one and typically three VMs along with associated
+  storage for a minimal cluster. This prevents us from leveraging "scale to
+  zero" optimisations.
+* Even for a fully occupied VM, the per second cost is unfavourable compared to
+  solutions such as Cloud Run.
+* Leaving aside cluster size optimisation, configuring autoscaling within the
+  cluster per application or per container pod is tricky.
+
+As such we tend to use kubernetes only when:
+
+* The application we are deploying requires kubernetes, for example by being packaged
+  as a helm chart. This is the case with GitLab which is deployed via terraform
+  configuration in a [Dedicated GitLab
+  project](https://gitlab.developers.cam.ac.uk/uis/devops/devhub/gitlab-deploy)
+  (DevOps only).
+* We require specific container affinity for load balancing. This is the case
+  for Raven SAML2. The Shibboleth software requires that [conversational
+  state](https://shibboleth.atlassian.net/wiki/spaces/IDP4/pages/1265631729/Clustering#Conversational-State)
+  always be maintained within a single container and so requires advanced load
+  balancing configuration.
+* We require use of advanced kubernetes features such as [sidecar
+  containers](https://www.magalix.com/blog/the-sidecar-pattern) or [stateful
+  sets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/).
+  This is the case for GitLab.
+
+We prefer the following technologies over kubernetes when possible:
+
+* Use [Cloud Run](https://cloud.google.com/run) for single-container hosting
+  where that container listens via HTTP. We use these instead of kubernetes
+  ReplicaSet or DaemonSet resources.
+* Use either Cloud Run's inbuilt HTTP load balancer or explicit Google [Cloud
+  Load Balancing](https://cloud.google.com/load-balancing) resources for
+  ingress. We use these instead of kubernetes Ingress resources.
+* Use [Cloud Scheduler](https://cloud.google.com/scheduler) for triggering
+  scheduled jobs. We use these instead of kubernetes CronJob resources.
+
+In order to increase isolation and to aid migration of service management from
+one team to another, when we use kubernetes we create one cluster per
+environment and per service.
+
+## Creating the cluster
+
+In Google Cloud, kubernetes clusters consist of one or more VMs. Clusters which
+have _regional_ high-availability must have one VM per availability zone within
+the region. We use high-availability clusters for production service instances
+and single-VM clusters for test and development instances.
+
+Cluster creation is usually via a single `gke.tf` file taking some values from
+the boilerplate's `locals.tf` file:
+
+```tf
+module "cluster" {
+  source = "git::ssh://git@gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/gke-cluster.git?ref=v2"
+
+  project  = local.project
+
+  # For single VM clusters, we need to use a zone like "europe-west2-a" rather
+  # than a region.
+  location = loca.is_production ? local.region : "${local.region}-a"
+
+  # Usually we find ourselves needing to tweak the VM size to fit a given
+  # application. This is simply an example of a 2 vCPU, 16GiB RAM machine.
+  machine_type = "e2-custom-2-16384"
+
+  # Google Cloud can associated a Google Cloud IAM identity with each workload
+  # in the cluster. It is harmless to enable this and useful to be able to call
+  # Google APIs without needing to pass additional credentials.
+  enable_workload_identity = true
+}
+```
+
+Our module can take other arguments as well. See the [full
+list](https://gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/gke-cluster/-/blob/master/variables.tf)
+in the module project itself.
+
+## Kubernetes terraform provider
+
+The standard [kubernetes
+provider](https://registry.terraform.io/providers/hashicorp/kubernetes/latest)
+can be used to create some kubernetes resources as outlined in the [provider
+documentation](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs).
+
+Our cluster module binds roles to the terraform Google Cloud user allowing it to
+perform cluster admin tasks. As such we can configure the provider to use the
+same credentials as the Google provider in `providers.tf` and `versions.tf`:
+
+```tf
+# providers.tf
+
+# The google_client_config data source fetches a token from the Google Authorization
+# server, which expires in 1 hour by default.
+data "google_client_config" "default" {
+}
+
+provider "kubernetes" {
+  host  = "https://${module.cluster[0].endpoint}"
+  token = data.google_client_config.default.access_token
+}
+
+# versions.tf
+
+terraform {
+  required_providers {
+    kubernetes = {
+      source  = "hashicorp/kubernetes"
+      version = "~> 2.7"
+    }
+  }
+}
+```
+
+Examples of using the associated `kubernetes_...` resources can be found within
+the [Raven SAML2 deployment](https://gitlab.developers.cam.ac.uk/uis/devops/raven/infrastructure/-/blob/master/shibboleth.tf) (DevOps only).
+
+## Custom resources
+
+Some kubernetes resources are not exposed directly by the terraform kubernetes
+provider. For these resources we use the
+[kubernetes_manifest](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/manifest)
+resource.
+
+For example, to create a new Google managed certificate for
+`example.apps.cam.ac.uk`:
+
+```tf
+resource "kubernetes_manifest" "managed_certificates" {
+  manifest = {
+    apiVersion = "networking.gke.io/v1"
+    kind       = "ManagedCertificate"
+
+    metadata = {
+      name = "example-cert"
+    }
+
+    spec = {
+      domains = ["example.apps.cam.ac.uk"]
+    }
+  }
+}
+```
+
+Examples of custom resources can be found within the [Raven SAML2
+deployment](https://gitlab.developers.cam.ac.uk/uis/devops/raven/infrastructure/-/blob/master/shibboleth.tf)
+(DevOps only).
+
+## Monitoring
+
+Applications hosted by kubernetes can be monitored in the usual fashion. In
+addition we add additional alerts for high memory, disk or CPU usage by
+individual nodes and pods. See the [monitoring configuration for Raven
+SAML2](https://gitlab.developers.cam.ac.uk/uis/devops/raven/infrastructure/-/blob/master/shibboleth_gke_monitoring.tf)
+as an example (DevOps only).
+
+## Summary
+
+In summary:
+
+* We use kubernetes only when we cannot make use of other cloud-hosting
+  technologies.
+* We have a standard terraform module for creating kubernetes clusters.
+* Kubernetes resources are usually managed via terraform directly.
+  * If a required resource type is supported by the hashicorp kubernetes
+    provider, use the hashicorp provider's resource.
+  * Unsupported resources can use the generic `kubernetes_manifest` resource.
+* We monitor node CPU, memory and disk usage and alert when any of these become
+  high for a sustained period.
+* We are happy to use helm for third-party applications but have decided that it
+  is one extra layer of indirection we don't need for our own applications.
--- a/docs/deployment/traffic-ingress.md
+++ b/docs/deployment/traffic-ingress.md
+# Traffic ingress
+
+Our boilerplate prefers the use of [Cloud Run](https://cloud.google.com/run) to
+host web applications. Applications hosted via Cloud run can use two different
+services to connect traffic from the outside world to them: Cloud Load Balancers
+and Domain Mappings. This page documents when and how to use both.
+
+## Our Cloud Run module
+
+We have a [standard
+module](https://gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/gcp-cloud-run-app)
+which we use to configure applications hosted in Cloud Run. This module can be
+used to configure both domain mapping and load balancer ingress. See [the
+module's documentation for more
+details](https://gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/gcp-cloud-run-app/-/tree/master#ingress-style).
+
+## Which ingress to use
+
+This section helps you select which ingress style to use.
+
+### No ingress
+
+If no ingress style is specified and the application is marked as being public,
+a URL will be generated for the application under the `.run.app` domain. This is
+unlikely to be a human-friendly name but may suffice for test or development
+instances.
+
+### Domain mapping
+
+The Cloud Run domain mapping ingress is simple to configure:
+
+1. Verify [a DNS domain](./dns.md) for the application.
+2. Specify the DNS domain via our standard module's `dns_names` variable.
+3. Add DNS records for that domain according to the `dns_resource_records`
+   output from our standard module.
+
+The application will then be served from the DNS domain provided. There are some
+restrictions surrounding domain mapping of which the most pressing is that
+**applications hosted in europe-west2 (London) cannot use domain mapping**. _De
+facto_ this restricts our more modern applications to use load balancing or the
+"no ingress" configuration above.
+
+### Load balancer
+
+A Cloud Load Balancer is appropriate when there are complex needs. Such needs
+include:
+
+* Using non-managed TLS certificates. This is often the case if one is
+  transitioning a service fom on-premises to Cloud.
+* Using [Cloud Identity Aware Proxy](https://cloud.google.com/iap) to restrict
+  resources to particular identities.
+* The need to shape or otherwise filter traffic via Cloud Armour rules or
+    backend weighting.
+
+Our standard module comes with a basic Load Balancer configuration. For advanced
+use you may find that you need to configure it yourself. Examples of manually
+configured load balancers are:
+
+* The [Raven Core IdP
+  configuration](https://gitlab.developers.cam.ac.uk/uis/devops/raven/infrastructure/-/blob/master/ravencore_load_balancer.tf)
+  (DevOps only) configures a basic load balancer. This was done ahead of support in our
+  standard module and so provides a "minimal" configuration example with no
+  fancy features.
+* The [Raven Admin
+  API configuration](https://gitlab.developers.cam.ac.uk/uis/devops/raven/legacy/infrastructure/-/blob/master/admin_scripts.tf)
+  includes an example of configuring Cloud Identity Aware Proxy to restrict
+  certain resources to individual service accounts.
+
+Load Balancers incur additional costs when used and so domain mapping should be
+used in preference if possible.
+
+## When **not** to use an ingress
+
+Sometimes you will not need to use either a load balancer nor domain mapping
+ingress. This is usually for services which are only ever called by resources
+within the parent Google project. A typical example of this is a service which
+is used in combination with [Cloud
+Scheduler](https://cloud.google.com/scheduler) to perform actions at regular
+intervals.
+
+In these cases our standard module may not prove sufficient as it assumes the
+application you are hosting is public.
+
+## Summary
+
+In summary:
+
+* Ingresses should be used for externally available production applications or
+  internal applications which need to make use of Identity Aware Proxy
+  policies.
+* Our standard Cloud Run module can configure domain mapping or load balancer
+  ingresses.
+* Domain mapping ingresses are easy to use but are inflexible.
+* Load balancer ingresses are flexible but may require additional configuration.
+* Load balancer ingresses should be used when:
+  * the application is hosted in a region not supported by domain mappings,
+  * custom TLS certificates need to be used, or
+  * custom access and routing policies are required.
+* Custom access policies for Load balancers can be implemented via Cloud Armour
+    policies.
+* There are cost implications with using load balancers which means their use
+  should be considered carefully.
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -47,6 +47,7 @@ nav:
    - deployment/dns.md
    - deployment/sql-instances.md
    - deployment/web-applications.md
+    - deployment/traffic-ingress.md
    - deployment/k8s-clusters.md
    - deployment/monitoring-and-alerting.md
    - deployment/continuous-deployment.md