FAQ | This is a LIVE service | Changelog

Skip to content
Snippets Groups Projects
Commit f45de9cb authored by Dr Abraham Martin's avatar Dr Abraham Martin
Browse files

Merge branch 'issue-95-further-deployment-docs' into 'master'

add deployment documentation for k8s and ingress

Closes #119

See merge request !112
parents 70bd7d72 f66a55c9
No related branches found
No related tags found
1 merge request!112add deployment documentation for k8s and ingress
Pipeline #150507 passed
# Kubernetes
# Kubernetes Clusters
This page documents how we provision and manage Kubernetes clusters via the
Google Kubernetes Engine service.
[Kubernetes](https://kubernetes.io/), often shortened to "kubernetes", is a cluster
management and workload orchestration system which can be used to host
container-based applications.
This page documents how we configure and make use of kubernetes clusters in our
deployments.
## When to use kubernetes
Kubernetes excels when your application is made up of _multiple_ containers
which need to interact with each other and/or maintain some shared state outside
of a database. Kubernetes provides dedicated resources for these use cases which
are tedious to replicate via other means.
That being said, we rarely make use of kubernetes in our deployments for the
following reasons:
* We need to dedicate at least one and typically three VMs along with associated
storage for a minimal cluster. This prevents us from leveraging "scale to
zero" optimisations.
* Even for a fully occupied VM, the per second cost is unfavourable compared to
solutions such as Cloud Run.
* Leaving aside cluster size optimisation, configuring autoscaling within the
cluster per application or per container pod is tricky.
As such we tend to use kubernetes only when:
* The application we are deploying requires kubernetes, for example by being packaged
as a helm chart. This is the case with GitLab which is deployed via terraform
configuration in a [Dedicated GitLab
project](https://gitlab.developers.cam.ac.uk/uis/devops/devhub/gitlab-deploy)
(DevOps only).
* We require specific container affinity for load balancing. This is the case
for Raven SAML2. The Shibboleth software requires that [conversational
state](https://shibboleth.atlassian.net/wiki/spaces/IDP4/pages/1265631729/Clustering#Conversational-State)
always be maintained within a single container and so requires advanced load
balancing configuration.
* We require use of advanced kubernetes features such as [sidecar
containers](https://www.magalix.com/blog/the-sidecar-pattern) or [stateful
sets](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/).
This is the case for GitLab.
We prefer the following technologies over kubernetes when possible:
* Use [Cloud Run](https://cloud.google.com/run) for single-container hosting
where that container listens via HTTP. We use these instead of kubernetes
ReplicaSet or DaemonSet resources.
* Use either Cloud Run's inbuilt HTTP load balancer or explicit Google [Cloud
Load Balancing](https://cloud.google.com/load-balancing) resources for
ingress. We use these instead of kubernetes Ingress resources.
* Use [Cloud Scheduler](https://cloud.google.com/scheduler) for triggering
scheduled jobs. We use these instead of kubernetes CronJob resources.
In order to increase isolation and to aid migration of service management from
one team to another, when we use kubernetes we create one cluster per
environment and per service.
## Creating the cluster
In Google Cloud, kubernetes clusters consist of one or more VMs. Clusters which
have _regional_ high-availability must have one VM per availability zone within
the region. We use high-availability clusters for production service instances
and single-VM clusters for test and development instances.
Cluster creation is usually via a single `gke.tf` file taking some values from
the boilerplate's `locals.tf` file:
```tf
module "cluster" {
source = "git::ssh://git@gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/gke-cluster.git?ref=v2"
project = local.project
# For single VM clusters, we need to use a zone like "europe-west2-a" rather
# than a region.
location = loca.is_production ? local.region : "${local.region}-a"
# Usually we find ourselves needing to tweak the VM size to fit a given
# application. This is simply an example of a 2 vCPU, 16GiB RAM machine.
machine_type = "e2-custom-2-16384"
# Google Cloud can associated a Google Cloud IAM identity with each workload
# in the cluster. It is harmless to enable this and useful to be able to call
# Google APIs without needing to pass additional credentials.
enable_workload_identity = true
}
```
Our module can take other arguments as well. See the [full
list](https://gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/gke-cluster/-/blob/master/variables.tf)
in the module project itself.
## Kubernetes terraform provider
The standard [kubernetes
provider](https://registry.terraform.io/providers/hashicorp/kubernetes/latest)
can be used to create some kubernetes resources as outlined in the [provider
documentation](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs).
Our cluster module binds roles to the terraform Google Cloud user allowing it to
perform cluster admin tasks. As such we can configure the provider to use the
same credentials as the Google provider in `providers.tf` and `versions.tf`:
```tf
# providers.tf
# The google_client_config data source fetches a token from the Google Authorization
# server, which expires in 1 hour by default.
data "google_client_config" "default" {
}
provider "kubernetes" {
host = "https://${module.cluster[0].endpoint}"
token = data.google_client_config.default.access_token
}
# versions.tf
terraform {
required_providers {
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.7"
}
}
}
```
Examples of using the associated `kubernetes_...` resources can be found within
the [Raven SAML2 deployment](https://gitlab.developers.cam.ac.uk/uis/devops/raven/infrastructure/-/blob/master/shibboleth.tf) (DevOps only).
## Custom resources
Some kubernetes resources are not exposed directly by the terraform kubernetes
provider. For these resources we use the
[kubernetes_manifest](https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs/resources/manifest)
resource.
For example, to create a new Google managed certificate for
`example.apps.cam.ac.uk`:
```tf
resource "kubernetes_manifest" "managed_certificates" {
manifest = {
apiVersion = "networking.gke.io/v1"
kind = "ManagedCertificate"
metadata = {
name = "example-cert"
}
spec = {
domains = ["example.apps.cam.ac.uk"]
}
}
}
```
Examples of custom resources can be found within the [Raven SAML2
deployment](https://gitlab.developers.cam.ac.uk/uis/devops/raven/infrastructure/-/blob/master/shibboleth.tf)
(DevOps only).
## Monitoring
Applications hosted by kubernetes can be monitored in the usual fashion. In
addition we add additional alerts for high memory, disk or CPU usage by
individual nodes and pods. See the [monitoring configuration for Raven
SAML2](https://gitlab.developers.cam.ac.uk/uis/devops/raven/infrastructure/-/blob/master/shibboleth_gke_monitoring.tf)
as an example (DevOps only).
## Summary
In summary:
* We use kubernetes only when we cannot make use of other cloud-hosting
technologies.
* We have a standard terraform module for creating kubernetes clusters.
* Kubernetes resources are usually managed via terraform directly.
* If a required resource type is supported by the hashicorp kubernetes
provider, use the hashicorp provider's resource.
* Unsupported resources can use the generic `kubernetes_manifest` resource.
* We monitor node CPU, memory and disk usage and alert when any of these become
high for a sustained period.
* We are happy to use helm for third-party applications but have decided that it
is one extra layer of indirection we don't need for our own applications.
# Traffic ingress
Our boilerplate prefers the use of [Cloud Run](https://cloud.google.com/run) to
host web applications. Applications hosted via Cloud run can use two different
services to connect traffic from the outside world to them: Cloud Load Balancers
and Domain Mappings. This page documents when and how to use both.
## Our Cloud Run module
We have a [standard
module](https://gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/gcp-cloud-run-app)
which we use to configure applications hosted in Cloud Run. This module can be
used to configure both domain mapping and load balancer ingress. See [the
module's documentation for more
details](https://gitlab.developers.cam.ac.uk/uis/devops/infra/terraform/gcp-cloud-run-app/-/tree/master#ingress-style).
## Which ingress to use
This section helps you select which ingress style to use.
### No ingress
If no ingress style is specified and the application is marked as being public,
a URL will be generated for the application under the `.run.app` domain. This is
unlikely to be a human-friendly name but may suffice for test or development
instances.
### Domain mapping
The Cloud Run domain mapping ingress is simple to configure:
1. Verify [a DNS domain](./dns.md) for the application.
2. Specify the DNS domain via our standard module's `dns_names` variable.
3. Add DNS records for that domain according to the `dns_resource_records`
output from our standard module.
The application will then be served from the DNS domain provided. There are some
restrictions surrounding domain mapping of which the most pressing is that
**applications hosted in europe-west2 (London) cannot use domain mapping**. _De
facto_ this restricts our more modern applications to use load balancing or the
"no ingress" configuration above.
### Load balancer
A Cloud Load Balancer is appropriate when there are complex needs. Such needs
include:
* Using non-managed TLS certificates. This is often the case if one is
transitioning a service fom on-premises to Cloud.
* Using [Cloud Identity Aware Proxy](https://cloud.google.com/iap) to restrict
resources to particular identities.
* The need to shape or otherwise filter traffic via Cloud Armour rules or
backend weighting.
Our standard module comes with a basic Load Balancer configuration. For advanced
use you may find that you need to configure it yourself. Examples of manually
configured load balancers are:
* The [Raven Core IdP
configuration](https://gitlab.developers.cam.ac.uk/uis/devops/raven/infrastructure/-/blob/master/ravencore_load_balancer.tf)
(DevOps only) configures a basic load balancer. This was done ahead of support in our
standard module and so provides a "minimal" configuration example with no
fancy features.
* The [Raven Admin
API configuration](https://gitlab.developers.cam.ac.uk/uis/devops/raven/legacy/infrastructure/-/blob/master/admin_scripts.tf)
includes an example of configuring Cloud Identity Aware Proxy to restrict
certain resources to individual service accounts.
Load Balancers incur additional costs when used and so domain mapping should be
used in preference if possible.
## When **not** to use an ingress
Sometimes you will not need to use either a load balancer nor domain mapping
ingress. This is usually for services which are only ever called by resources
within the parent Google project. A typical example of this is a service which
is used in combination with [Cloud
Scheduler](https://cloud.google.com/scheduler) to perform actions at regular
intervals.
In these cases our standard module may not prove sufficient as it assumes the
application you are hosting is public.
## Summary
In summary:
* Ingresses should be used for externally available production applications or
internal applications which need to make use of Identity Aware Proxy
policies.
* Our standard Cloud Run module can configure domain mapping or load balancer
ingresses.
* Domain mapping ingresses are easy to use but are inflexible.
* Load balancer ingresses are flexible but may require additional configuration.
* Load balancer ingresses should be used when:
* the application is hosted in a region not supported by domain mappings,
* custom TLS certificates need to be used, or
* custom access and routing policies are required.
* Custom access policies for Load balancers can be implemented via Cloud Armour
policies.
* There are cost implications with using load balancers which means their use
should be considered carefully.
......@@ -47,6 +47,7 @@ nav:
- deployment/dns.md
- deployment/sql-instances.md
- deployment/web-applications.md
- deployment/traffic-ingress.md
- deployment/k8s-clusters.md
- deployment/monitoring-and-alerting.md
- deployment/continuous-deployment.md
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment