- Jun 13, 2024
-
-
Dr Catherine Pitt authored
The cronjob script attempts one retry if jobs fail, as we sometimes see probelsm on the dom0s where they time out gathering facts. However if some of the failed hosts have genuinely failed this can lead to problems: when the playbook run gets to a play where there is only one applicable host in the set being retried and fails on that host in that play (quite likely to happen once we are into the specialist plays) then ansible terminates the whole playbook, because 100% of the hosts in the last play failed. This leads to confusing results as other hosts that needed a retry but hadn't reached their failing play at the time, probably because it was much later in the playbook, may report green not red, as they hadn't yet reached a changed or failing task when the playbook ended. The ansible-site test remains green unless the ansible-xymon2027 machine happened to be one of the failing hosts, which is unlikely. This updates the cronjob to retry the failed hosts one at a time.
-
- Dec 02, 2022
-
-
Dr Adam Thorn authored
This is already in the cron script; having it in oneoff also seems sensible and useful. It's needed on jammy and I don't know why it wasn't needed in the bionic instance...
-
Dr Adam Thorn authored
-
- Nov 29, 2022
-
-
Dr Adam Thorn authored
-
- Feb 07, 2022
-
-
Dr Adam Thorn authored
-
Dr Adam Thorn authored
The scripts make a point of exporting ANSIBLE_INVENTORY which can then optionally be overridden when sourcing our config file. Except.. that's unconditionally overridden because we've hard-coded "-i inventory.py" Being able to set an alternate ANSIBLE_INVENTORY is useful for testing changes to these scripts without necessarily using the real live inventory.
-
Dr Adam Thorn authored
We're seeing some particular tasks fail with "Unable to execute ssh command line on a controller due to: [Errno 12] Cannot allocate memory" and in particular, the tasks which configure all the VLAN interfaces on our dom0s and/or fileservers. They always run to completion when subsequently re-run using the ./oneoff script though. Whilst it would be good to understand why those tasks seem to trigger unusual memory usage, I've not made any headway in doing so. Let's make the cronjob script re-run the playbook on failed hosts for us to avoid spurious tickets.
-
Dr Adam Thorn authored
NB change was made and deployed a couple of weeks ago but I seem to have failed to commit. See also e.g. https://gitlab.developers.cam.ac.uk/ch/co/ansibleconf/-/commit/5e06fc9d597e2550458adf9e77281e71f71691ae
-
- Jan 09, 2022
-
-
Dr Adam Thorn authored
Needed as of d7d386d2
-
- Jan 06, 2022
-
-
Dr Adam Thorn authored
Hmm, if only we had some mechanism for putting the right things on PATH, PYTHON_PATH etc rather than having to manually faff with those environment variables - and then forget to update them when we switch the version of ansible we're using ....
-
Dr Adam Thorn authored
These ceased to be useful when we switched over to a custom xymon-aware callback
-
Dr Adam Thorn authored
-
- Apr 26, 2021
-
-
Dr Catherine Pitt authored
As recommended by @ajh221 , eval the ssh-agent -k when killing the agent. Although it works without, this is tidier.
-
Dr Catherine Pitt authored
Our ansibleconf repository has some tasks which require sshing from one client machine to another. This fails when run by ansible-xymon because the ssh key isn't available to the client machines. This change starts up an ssh-agent and loads the key into it before running ansible so that this type of task works. The agent is killed at the end.
-
- Feb 12, 2021
-
-
Dr Catherine Pitt authored
Have extended the xymon callback plugin and site.yml to handle this instead, see https://gitlab.developers.cam.ac.uk/ch/co/ansibleconf/-/commit/3aa092bc8b9c955172175ce494c7646fd50bea50
-
- Feb 10, 2021
-
-
Dr Catherine Pitt authored
-
Dr Catherine Pitt authored
-
Dr Catherine Pitt authored
The live server was running 1.0-ch18 but there's no 1.0-ch18 or 1.0-ch17 tag in the repo.
-
Dr Catherine Pitt authored
Otherwise we may not notice when the playbook starts to fail halfway through.
-
- Nov 16, 2020
-
-
Dr Adam Thorn authored
-
- May 11, 2020
-
-
A.J. Hall authored
-
- Apr 18, 2019
-
-
Dr Adam Thorn authored
-
- Apr 17, 2019
-
-
Dr Adam Thorn authored
-
- Mar 21, 2019
-
-
Dr Adam Thorn authored
-
- Mar 20, 2019
-
-
Dr Adam Thorn authored
-
Dr Adam Thorn authored
-
- Nov 16, 2017
-
-
Dr Catherine Pitt authored
-
- Jun 06, 2017
-
-
Dr Catherine Pitt authored
This is so I don't keep leaving the main logfile root owned when running oneoff, which then mucks up the main cron run which runs as xymon-ansible.
-
- May 23, 2017
-
-
Dr Catherine Pitt authored
-
- May 22, 2017
-
-
Dr Catherine Pitt authored
Had got a spurious '-l' parameter to ansible-playbook introduced when I tidied up.
-
- May 19, 2017
-
-
Dr Catherine Pitt authored
- Add args check to oneoff script - Make scripts slightly more configurable - use -f on rm commands - ensure permissions correct on homedir
-
Dr Catherine Pitt authored
-