Kubernetes Cluster Architecture
Deployment
All apps are deployed via Cloud Build, as defined in cloudbuild.yaml.
Shared resources
Terraform
First, it sets up some common infrastructure:
Using Terraform templates, it deploys a regional GKE cluster with
a dedicated VPC network/subnetwork in
us-east1
regionVPC-native networking with alias IPs, to optimize pod communication and provide some security benefits
a minimal-privilege GKE node service account with access to Stackdriver logging/monitoring and Google Container Registry
system
andsandbox
node pools, spread across 2 availability zones in the regionGKE Sandbox enabled for
sandbox
pool, which will be used for app deployments (with additional configuration for the pods, see below).Container-Optimized OS with
containerd
runtime image for the nodes (this is required by GKE Sandbox, but we also use it for "system" nodes, to minimize Docker exposure surface)Shielded VMs for all nodes
Workload Identity for minimal-privilege app deployments
Network Policy cluster add-on for future Pod deployments
Cluster Autoscaler for
sandbox
pool, to gradually grow the number of nodes as more workloads are deployed to themRegular release channel for cluster/node upgrades
GKE Config Connector
Then, it installs GKE Config Connector (via shared/cnrm.sh script) into
system
pool.Config Connector is a new GKE cluster add-on that enables declarative deployment of GCP resources directly from Kubernetes templates.
This is highly convient, as we can deploy a Pod, an associated minimal privilege Service Account, and related GCP resources (e.g. CloudSQL instances) within a single template.
Notice that Config Connector itself needs access to create these GCP resources, and as such is assigned a highly-privileged Service Account with the required permissions via a custom IAM role in Terraform. However, we schedule its pods into
system
node pool, so it is physically separated from all "third-party apps" in the cluster.If this becomes a big concern, it's possible to deploy Config Connector onto a separate project/cluster altogether, and then managing "target project" resources from there (or alternatively, we could use Terraform to deploy these GCP resources, but that introduces its own complexities).
However, even the current setup may provide sufficient isolation, and is still better than the regular "Default Compute Engine Service Account Identity" typically used for GKE nodes and pods, since the workloads (apps) are now isolated by multiple layers (workload identity, containerization, GKE Sandbox where possible, physical nodes, and Network Policies in the future).
It's also possible to protect GCP resources from deletion when an associated Config Connector resource is removed. This can be done via
cnrm.cloud.google.com/deletion-policy: abandon
annotation for the resource. We use it for such resources asSQLInstance/Database/User
,ComputeAddress
,DNSManagedZone
,ComputeDisk
, andComputeResourcePolicy
, as those generally need to "survive" any cluster/app re-deployments.Since Config Connector relies on the project-global names of such resources, they are automatically "acquired" by it on re-deployment. This allows us to safely delete an entire namespace or even the cluster, and re-create it later on, so the cluster/namespaces can be treated as "dispensable" resources that can be completely and easily reproduced via Cloud Build, when needed.
Here is a comprehensive list of all possible GCP resources available via Config Connector.
Global Namespace in GKE
Finally, using Config Connector we set up
global
namespace (see below) and the following GCP resources in it (shared/global.yaml):Cloud Armor Security Policy, which firewalls all Ingress endpoints to the Broad CIDRs.
Cloud DNS Managed Zone for
dsp-appsec.broadinstitute.org.
GCE Resource Policy for disk snapshots (see below).
Apps
Next, Cloud Build applies both app-specific and some shared Kubernetes templates to deploy each app.
Some apps require patches to work properly with Docker/Kubernetes, so we build patched images and push them to GCR along the way. We use Kaniko Docker builder to speed up the builds through caching.
Each Kubernetes deployment is done via an associated deploy.sh
script
(e.g. one for CodeDx).
A quick overview of what that script should do:
Create a Kubernetes namespace for the app. Namespaces provide mainly logical/deployment separation, but also some security boundaries for the apps (notably, while Network Policy is not enabled by default, it can be configured to accept internal cluster traffic only from specific namespaces).
Namespaces are also very convenient for avoiding "spillover" of resources between the apps, as one namespace can be created/destroyed completely without affecting any others. This is particularly useful during the development of each app.
We use shared/kube-apply.py script to replace
PROJECT_ID
andNAMESPACE
environment variables in namespace.yaml template with app-specific values, and then apply it. This pattern is used throughout other resource deployments later as well.Generate Kubernetes secret(s) used by the app, unless they're "external" (e.g. a Slack token). This is done via shared/gen-secret.sh script, which currently uses
/dev/urandom
to generate a random sequence of alphanumeric characters. However, it may be better to use a crypto library in the future instead.The secrets are generated only once, when they don't exist yet. Otherwise, the script doesn't overwrite them. This provides a simple way to rotate the secrets, should we need that, by just removing them in Kubernetes and re-running the latest build in Cloud Build.
For the manually created secrets, we'll provide a sample template to deploy them from your local shell.
Set up a Kubernetes volume for the app, if it stores some of its state on disk (e.g. DefectDojo and CodeDx).
This is done via shared/volume.sh script that
- declares a regional GCE disk (which can be accessed from any of the 2 zones in the cluster, while also providing regional replication)
- sets up daily snapshots of disk content for disaster recovery
- waits for the disk to be created on GCE
- sets up Kubernetes PersistentVolume and PersistentVolumeClaim, where the latter can be used to associate the volume with a Pod.
If the app needs access to GCP services (e.g. Cloud SQL), apply shared/service-account.yaml template to:
- create a Kubernetes Service Account (KSA) for its Pod
- create a Google Service Account (GSA) that will have access to GCP resources
- binds these 2 accounts via a Kubernetes annotation and
iam.workloadIdentityUser
role.
Note that
service-account.yaml
is generic and doesn't grant access to GCP resources, so after applying it, the actual role binding has to be done via an app-specific template (but at least, that step is simple).Apply the app-specific template, which creates either a Deployment (for stateless apps), or a StatefulSet (if the app uses disk).
We also deploy any app-specific GCP resources (e.g. SQLInstance, SQLUser etc.) via Config Connector resource types, and create a role binding for the service account (if any in use), e.g.
roles/cloudsql.client
for Cloud SQL.Finally, it's recommended to set
runtimeClassName: gvisor
for the Pod, to enable GKE Sandbox around it. However, some apps (e.g. DefectDojo) may fail to work properly with it. In that case, the app can still be deployed without it, but will need an extra configuration snippet fornodeSelector
andtolerations
(e.g. see how this is done in defectdojo/deployment.yaml).For the next steps, this template should also expose a
containerPort
and areadinessProbe
, both of which are used to reach the app http endpoint from Ingress and to mark it as healthy.Take a look, for example, at how all of this is done in codedx/deployment.yaml.
Call shared/host.sh script, which:
- declares a global static IP for the service
- waits for the IP to be created
- sets up its DNS hostname record in the Managed Zone
- waits for DNS propagation by repeated host resolution
- deploys GKE Managed Certificate for the hostname
Finally, deploy shared/ingress.yaml, which sets up:
Backend Config for Cloud Armor
Kubernetes Service, which:
- exposes internal port(s) to the load balancer
- applies Backend Config to those
- sets up container-native load balancing via a Network Endpoint Group
GKE Ingress, which ties together all of the above using:
- IP address binding
- Disallowing raw HTTP (to keep only HTTPS)
- Managed Certificate binding
- DNS hostname mapping via Host header of the request
- URL path mapping for each Service port
Please note that
ingress.yaml
may need to be adjusted in an app-specific way (e.g. for Sdarq, which exposes multiple internal paths/ports, and could use Cloud DNS for Backend Config), in which case a customized copy ofingress.yaml
would need to be referenced fromdeploy.sh
.
Questions
appsec@broadinstitute.org