Slow CI performance on GKE runner seems to cause python:tox tests to fail

👓 What did you see?

I've been debugging a failing job in uis/devops/lib/ucam-faas-python and it seems to have been caused by a performance degradation when on the GKE runners.

The failing python:tox job runs two service containers which Python unit tests interact with. The tests were failing due to being unable to connect to the containers: https://gitlab.developers.cam.ac.uk/uis/devops/lib/ucam-faas-python/-/jobs/4030508

I've found that by configuring services with a HEALTHCHECK_TCP_PORT variable, the Kubernetes executor will delay the job starting until the services start listening on their ports, and this allows the tests to succeed: https://gitlab.developers.cam.ac.uk/uis/devops/lib/ucam-faas-python/-/jobs/4047758

However the tests now take over 4 minutes in some cases (4 run in parallel) whereas previously they took about 1 minute on the previous runner/executor: https://gitlab.developers.cam.ac.uk/uis/devops/lib/ucam-faas-python/-/jobs/3668192

If you retry one of the passing jobs and watch it, you can see the pubsub-emulator service takes a long time to start, perhaps a minute or two. Locally it'll start in a few seconds.

I wonder if the resource constraints configured on the jobs are too tight, or there's resource contention with other jobs?

✅ What did you expect to see?

The job passing as it was previously.

💻 Where does this happen?

CI, running on a gke-devops-general tag runner.

🔬 How do I recreate this?

Re-run one of the failing CI job links above.

📚 Any additional information?

Originally posted in Teams: https://teams.microsoft.com/l/message/19:691ded9e24ef49d0bca8b1a6162e2ebf@thread.tacv2/1768219267501?tenantId=49a50445-bdfa-4b79-ade3-547b4f3986e9&groupId=8b9ab893-3917-42bb-ba20-6cbd4bd2d304&parentMessageId=1768219267501&teamName=UIS_DevOps&channelName=%F0%9F%91%A5%20Cloud%20Team&createdTime=1768219267501
We can try adjusting Kubernetes resource variables to increase job/service resources:
- https://guidebook.devops.uis.cam.ac.uk/howtos/gke-gitlab-runners/kubernetes-runner-resource-tuning/#summary
- https://docs.gitlab.com/runner/executors/kubernetes/#overwrite-container-resources