Slow CI performance on GKE runner seems to cause python:tox tests to fail
👓 What did you see?
I've been debugging a failing job in uis/devops/lib/ucam-faas-python and it seems to have been caused by a performance degradation when on the GKE runners.
The failing python:tox job runs two service containers which Python unit tests interact with. The tests were failing due to being unable to connect to the containers: https://gitlab.developers.cam.ac.uk/uis/devops/lib/ucam-faas-python/-/jobs/4030508
I've found that by configuring services with a HEALTHCHECK_TCP_PORT variable, the Kubernetes executor will delay the job starting until the services start listening on their ports, and this allows the tests to succeed: https://gitlab.developers.cam.ac.uk/uis/devops/lib/ucam-faas-python/-/jobs/4047758
However the tests now take over 4 minutes in some cases (4 run in parallel) whereas previously they took about 1 minute on the previous runner/executor: https://gitlab.developers.cam.ac.uk/uis/devops/lib/ucam-faas-python/-/jobs/3668192
If you retry one of the passing jobs and watch it, you can see the pubsub-emulator service takes a long time to start, perhaps a minute or two. Locally it'll start in a few seconds.
I wonder if the resource constraints configured on the jobs are too tight, or there's resource contention with other jobs?
✅ What did you expect to see?
The job passing as it was previously.
💻 Where does this happen?
CI, running on a gke-devops-general tag runner.
🔬 How do I recreate this?
Re-run one of the failing CI job links above.
📚 Any additional information?
- Originally posted in Teams: https://teams.microsoft.com/l/message/19:691ded9e24ef49d0bca8b1a6162e2ebf@thread.tacv2/1768219267501?tenantId=49a50445-bdfa-4b79-ade3-547b4f3986e9&groupId=8b9ab893-3917-42bb-ba20-6cbd4bd2d304&parentMessageId=1768219267501&teamName=UIS_DevOps&channelName=%F0%9F%91%A5%20Cloud%20Team&createdTime=1768219267501
- We can try adjusting Kubernetes resource variables to increase job/service resources: