FAQ | This is a LIVE service | Changelog

Slow CI performance on GKE runner seems to cause python:tox tests to fail

👓 What did you see?

I've been debugging a failing job in uis/devops/lib/ucam-faas-python and it seems to have been caused by a performance degradation when on the GKE runners.

The failing python:tox job runs two service containers which Python unit tests interact with. The tests were failing due to being unable to connect to the containers: https://gitlab.developers.cam.ac.uk/uis/devops/lib/ucam-faas-python/-/jobs/4030508

I've found that by configuring services with a HEALTHCHECK_TCP_PORT variable, the Kubernetes executor will delay the job starting until the services start listening on their ports, and this allows the tests to succeed: https://gitlab.developers.cam.ac.uk/uis/devops/lib/ucam-faas-python/-/jobs/4047758

However the tests now take over 4 minutes in some cases (4 run in parallel) whereas previously they took about 1 minute on the previous runner/executor: https://gitlab.developers.cam.ac.uk/uis/devops/lib/ucam-faas-python/-/jobs/3668192

If you retry one of the passing jobs and watch it, you can see the pubsub-emulator service takes a long time to start, perhaps a minute or two. Locally it'll start in a few seconds.

I wonder if the resource constraints configured on the jobs are too tight, or there's resource contention with other jobs?

What did you expect to see?

The job passing as it was previously.

💻 Where does this happen?

CI, running on a gke-devops-general tag runner.

🔬 How do I recreate this?

Re-run one of the failing CI job links above.

📚 Any additional information?