- Oct 23, 2024
-
-
Dr Catherine Pitt authored
-
- Sep 25, 2024
-
-
Dr Adam Thorn authored
Hmm. My newly-created backup was failing due to this dir missing, but it surely isn't the first backup that has been added when we've had the current version of the prepare script in place. Yet I see us calling mkdir in the collection of old prepare scripts, and there's no attempt to create it in the script that creates new backups ....
-
- Aug 21, 2024
-
-
Dr Adam Thorn authored
We had been looking for the string 'none', but deliberately run a zfs command which returns a numeric parseable value, so get the value 0 when no quota is set. We'd been doing this properly on thisquota but not parentquota.
-
- Jul 10, 2024
-
-
Owen Johnson authored
-
- Jun 10, 2024
-
-
Dr Adam Thorn authored
We were referencing a db field that doesn't exist.
-
- May 24, 2024
-
-
Dr. Frank Lee authored
-
Dr Catherine Pitt authored
Closes #6 When running psql commands to insert rows in the database, psql normally returns an message about what it did, eg "INSERT 0 1" if it inserted a row. This can be suppressed with -q . Several of the scripts use psql commands to get primary keys from the database, inserting the row if necessary. This can lead to the host id variable in the script being set to 'INSERT 0 1 <thehostid>' which causes problems when this variable is used in other SQL commands. This always used to work; I suspect the thing that changed is our upgrading to Postgres 16 on the backup servers, but I'm struggling to see how as Postgres 13 seems to behave the same for me.
-
- May 07, 2024
-
-
Dr. Frank Lee authored
-
- May 01, 2024
-
-
Dr. Frank Lee authored
-
- Oct 23, 2023
-
-
Dr Adam Thorn authored
We had ended up, somehow, with a few hosts on one backup server which appeared twice in `host` - one with disabled=f and some backup tasks as expected, and one with disabled=t. I manually (necessarily) deleted the latter before adding this constraint on the live servers. (all servers listed in zfs_backup_server.conf have had the db table manually updated)
-
Dr Adam Thorn authored
I think the intention here was perhaps: - create new host, marked as disabled - finish setting up backups - once done, mark host as enabled ...except the script runs as "set -e", so if something goes wrong we just never get as far as enabling the host, which means not only do no backups run but no failure reports get sent to xymon so we don't even notice the failure. This is not good. Given the entry of a row in the `host` table doesn't do much in and of itself, I see no reason why we shouldn't just mark the host as initially enabled. We won't try to actually perform a backup until a `backup_task` has been created. Perhaps this leads to a brief transient behaviour where xymon reports a backup as failing whilst the script is still running - but OTOH the xymon report for a new machine will always be red for "a while" until the first backup has actually run OK.
-
Dr Adam Thorn authored
We don't need to record the ssh host key in most cases given that we generally deploy signed ssh host keys, but I suspect we might have the occasional backup target where that doesn't apply (e.g. clusters?) Regardless, if we can't scan the host key the right behaviour is for the script to continue on and set up the backup. If the backup then fails due to the absent host key, we will be alerted and take suitable action. Right now, the failure mechanism is that we silently don't finish setting up the backup, the backup never gets enabled, and we don't realise we don't have a backup - eek.
-
- Sep 27, 2023
-
-
Dr Catherine Pitt authored
The Xymon test that reports on backup status runs every 45 minutes. But the status of an individual backup does not change very frequently - we try to back most things up a few times a day. This change makes the individual backup statuses valid for three hours, rather than the one hour they were previously. This is to avoid getting purple dots when we we reboot a backup server and interrupt the 45 minute check, which then won't run again for another 45 minutes causing a 90 minute gap between reports for some hosts and hence purple dots.
-
- Sep 08, 2023
-
-
Dr Catherine Pitt authored
For machines like nest-backup and cerebro-backup we have lots of backup tasks for the same host spread across several zpools, so move-machine-to-zpool.sh can't be used to migrate the contents of a failing zpool/disk. This adds a script to move an individual ZFS which is the target of a backup task to another zpool. It assumes all necessary parent ZFSes already exist on the target. If they don't it fails. It does not yet clean up the old ZFS as it's not had a lot of use.
-
- Sep 01, 2023
-
-
Dr Catherine Pitt authored
We have started putting extra configuration for sshing to a host in a file in the /etc/chem-zfs-backup-server/zfs-rsync.d/$HOSTNAME directory. This updates the backup migration script to copy that as well as the main config file for the machine. I've chosen to copy the entire directory to catch other files we might want to add in future. There is often an 'exclude' file in there that's autogenerated by the prepare scripts, but copying that won't do any damage; it's just redundant because it will be regenerated when the backup runs.
-
- Aug 30, 2023
-
-
Dr Adam Thorn authored
1. The export is done via set sharenfs which means we shouldn't need to manually manage exports 2. This part of the script does not work because it tries to unexport the old export but by looking up the db record that we have already updated to refer to the new zpool.
-
- Aug 09, 2023
-
-
Dr Catherine Pitt authored
This adds a new config file which allows setting the command to use for 'rsync' and global options for that command. This is motivated by the need to use an alternative rsync command on Jammy machines, as the system one is too slow. The option for global rsync arguments was added as a way to add the '--trust-sender' flag to all backups to turn off certain checks that we suspect to be the cause of the slowdown, but it didn't help enough to fix the speed problem. Instead we are going to use our own package of an older rsync from before the checking code was added, which of course doesn't support --trust-sender so the global args are left blank.
-
- Jul 27, 2023
-
-
Dr Adam Thorn authored
Custom options need to be a file passed via -F because we want to specify options for both ssh and scp. They don't have a compatible set of CLI options but both take -F. This supercedes 53f5ba49 ; I had only deployed SSHOPTIONS for one host which I've updated. This also removes the SSHPORT option, which had only been used in the config for one host which I've updated.
-
Dr Adam Thorn authored
These are static files provided by our package, not config files.
-
Dr Adam Thorn authored
These are not config files, and we should not be modifying the package-provided versions of these files. I'm leaving symlinks behind to make sure we don't break all our existing backups though!
-
- Jul 24, 2023
-
-
Dr Adam Thorn authored
This could/should probably supercede the specific option for SSHPORT as I think usage of that is minimal or perhaps even zero, but we'd have to check if that's in use and make suitable updates to config files before removing it.
-
- May 31, 2023
-
-
Dr Adam Thorn authored
-
- May 30, 2023
-
-
Dr Adam Thorn authored
We had been raising a failure report if we had never seen a succesful backup for a host. However, when we have a host with more than one backup task, we can have the situation where one backup is working OK but the other has never completed correctly. This let to a green report as we had a non-zero number of rows, but we require number_of_good_backups == number_of_tasks !
-
- Dec 19, 2022
-
-
Dr Adam Thorn authored
This can happen for a number of reasons: - child ZFS has a quota bigger than its parent - we have over-provisioned such that the sum of quotas is bigger than the disk - an individual ZFS has a quota bigger than the disk
-
Dr Adam Thorn authored
-
Dr Adam Thorn authored
-
Dr Adam Thorn authored
..and convert such things to human-friendly versions if required. This is to facilitate extra checks where we want to make numeric calculations involving quotas and other similar properties.
-
Dr Adam Thorn authored
-
- Mar 10, 2022
-
-
Dr Adam Thorn authored
i.e. we can now simply delete from host where hostname='example.ch.private.cam.ac.uk'; without having to chase the foreign keys. I've made the equivalent change on our live backup servers with an ad hoc script.
-
- Mar 09, 2022
-
-
Dr Adam Thorn authored
-
Dr Adam Thorn authored
i.e. send the ZFS, update db records and copy the config files. The sql inserts should broadly mirror those done when setting up a new backup, though with field values matching those in the source database rather than just using the defaults. This script has so only been tested for the case of moving a "simple" backup where a host has a single backup task and no special config. It's quite likely there'll be bugs to fix for other cases that we'll find in due course.
-
Dr Adam Thorn authored
I'm about to add a script to send a backup to a different backup server. It's thus probably best if the script names describe their functions in a little more detail
-
- Dec 20, 2021
-
-
Dr Adam Thorn authored
-
- Nov 17, 2021
-
-
Dr Adam Thorn authored
-
Dr Adam Thorn authored
-
Dr Adam Thorn authored
https://tickets.ch.cam.ac.uk/rt/Ticket/Display.html?id=211460 e.g. RT 211460, 211465. spri-musuem-rt-2025 had been set up with an adhoc script which was not properly tidying up after itself due to bailing early on a "set -e" error.
-
- Jul 14, 2021
-
-
Dr Adam Thorn authored
This will let us use zfs_target as the name of a subtest which in turn means we would be able to separately log and graph multiple backup targets associated with a single host. This change does not affect the current parsing performed when we input data into postgres: it uses non-anchored regexps to identify SpaceUsed etc so prepending extra text won't change anything
-
- Jul 09, 2021
-
-
Dr Adam Thorn authored
In some versions of backup_queue (I think just on splot4 now), we use backup_log.isrunning as part of the logic to determine if a task should be enqueued. The problem is that scheduler.pl makes three writes on the table: 1) an insert when the task is queued (a trigger sets isrunning='t' here) 2) an update to set started_processing when the task begins (a trigger sets isrunning='f' here!!!!) 3) an update to set ended_processing when the task finishes (a trigger again sets isrunning='f' here) Thus, being careful to only set isrunning='f' when a backup task is finished (i.e. when we set ended_processing=now() in scheduler.pl) seems sensible, and empirically does seem to lead to the right backup_queue without duplicates. This commit will only affect new setups of backup servers; the change has been deployed to live servers with an ad hoc script I've run. I think we only see this on splot4 because it has a very different definition of the backup_queue view to a) the one defined in this file, b) the one that's on all the other backup servers. If I just try to replace the view on splot4, though, any attempt to select from it just times out so there may be other relations on splot4 that need updating too. NB the obvious thing missing on splot4 is WHERE ((backup_log.backup_task_id = a.backup_task_id) AND (backup_log.ended_processing IS NULL))) < 1)) which feels like a hack but nonetheless ensures in practice that we don't get duplicate queued tasks.
-
- Jul 08, 2021
-
-
Dr Adam Thorn authored
We don't always need the role data, if the presumption is that we'll be doing a pg_restore in conjunction with an ansible role which creates all required roles. But, having a copy of the role data will never hurt! It also gives us a straightforward way of restoring a database to a standalone postgres instance without having to have provisioned a dedicated VM with the relevant ansible roles.
-
Dr Adam Thorn authored
At present we use myriad one-off per host scripts to do a pg_dump, and they all do (or probably should do) the same thing. In combination with setting options in the host's backup config file, I think this single script covers all our routine pg backups.
-