Commits · 5ae63e5443ae9c33e9aa9dd589079eccbd1b6f6d · Yusuf Hamied Department of Chemistry / COs / backup-scheduler

Oct 23, 2024
- Fix quoting in send-backup-to-server.sh · 5ae63e54
  Dr Catherine Pitt authored 5 months ago
  
  5ae63e54
Sep 25, 2024

ensure host config dir exists before copying exclude file to it · 036b518c

Dr Adam Thorn authored 6 months ago

Hmm. My newly-created backup was failing due to this dir missing, but it surely
isn't the first backup that has been added when we've had the current
version of the prepare script in place. Yet I see us calling mkdir in
the collection of old prepare scripts, and there's no attempt to create
it in the script that creates new backups ....

036b518c

Jul 10, 2024
- As root on dest. server sudo postgres to run psql · 95bd3861
  Owen Johnson authored 8 months ago
  
  95bd3861
Jun 10, 2024
- fix cron-thread.log contents · 3de479aa
  Dr Adam Thorn authored 9 months ago
  
  We were referencing a db field that doesn't exist.
  View commits for tag 1.0.4 1.0.4
  
  3de479aa
May 24, 2024

Reinstate per-machine SSH config on the preparation script · bad7fd41
Dr. Frank Lee authored 10 months ago

bad7fd41

Add -q flag to psql calls to suppress 'INSERT 0 1' etc · 59d01a57

Dr Catherine Pitt authored 10 months ago

Closes #6

When running psql commands to insert rows in the database, psql normally
returns an message about what it did, eg "INSERT 0 1" if it inserted a
row. This can be suppressed with -q . Several of the scripts use psql
commands to get primary keys from the database, inserting the row if
necessary. This can lead to the host id variable in the script being set
to 'INSERT 0 1 <thehostid>' which causes problems when this variable is
used in other SQL commands.

This always used to work; I suspect the thing that changed is our
upgrading to Postgres 16 on the backup servers, but I'm struggling to
see how as Postgres 13 seems to behave the same for me.

59d01a57

May 07, 2024
- Add script to call client-side postgres dump scripts · def5995e
  Dr. Frank Lee authored 10 months ago
  
  View commits for tag 1.0.1 1.0.1
  
  def5995e
May 01, 2024
- Add new zfs-rsync-prep script · 5eb022fc
  Dr. Frank Lee authored 10 months ago
  
  5eb022fc
Oct 23, 2023

add unique hostname constraint to host table · b902dac5

Dr Adam Thorn authored 1 year ago

We had ended up, somehow, with a few hosts on one backup server
which appeared twice in `host` - one with disabled=f and some backup
tasks as expected, and one with disabled=t. I manually (necessarily)
deleted the latter before adding this constraint on the live servers.

(all servers listed in zfs_backup_server.conf have had the db table
manually updated)

b902dac5

don't mark hosts as disabled at the point they get created · 594be148

Dr Adam Thorn authored 1 year ago

I think the intention here was perhaps:

- create new host, marked as disabled
- finish setting up backups
- once done, mark host as enabled

...except the script runs as "set -e", so if something goes wrong
we just never get as far as enabling the host, which means not only
do no backups run but no failure reports get sent to xymon so we
don't even notice the failure. This is not good.

Given the entry of a row in the `host` table doesn't do much in and
of itself, I see no reason why we shouldn't just mark the host as
initially enabled. We won't try to actually perform a backup until
a `backup_task` has been created. Perhaps this leads to a brief
transient behaviour where xymon reports a backup as failing whilst
the script is still running - but OTOH the xymon report for a new
machine will always be red for "a while" until the first backup
has actually run OK.

594be148

don't bail (due to set -e) if ssh-keyscan fails · 20de2933

Dr Adam Thorn authored 1 year ago

We don't need to record the ssh host key in most cases given that we
generally deploy signed ssh host keys, but I suspect we might have the
occasional backup target where that doesn't apply (e.g. clusters?)

Regardless, if we can't scan the host key the right behaviour is for
the script to continue on and set up the backup. If the backup then
fails due to the absent host key, we will be alerted and take suitable
action. Right now, the failure mechanism is that we silently don't finish
setting up the backup, the backup never gets enabled, and we don't
realise we don't have a backup - eek.

20de2933

Sep 08, 2023

Add script for moving a backup task rather than a host · 43774528

Dr Catherine Pitt authored 1 year ago

For machines like nest-backup and cerebro-backup we have lots of backup
tasks for the same host spread across several zpools, so
move-machine-to-zpool.sh can't be used to migrate the contents of a
failing zpool/disk. This adds a script to move an individual ZFS which is
the target of a backup task to another zpool.

It assumes all necessary parent ZFSes already exist on the target. If
they don't it fails.

It does not yet clean up the old ZFS as it's not had a lot of use.

43774528

Sep 01, 2023

send-backup-to-server.sh copies additional config · 0f5bbdc6

Dr Catherine Pitt authored 1 year ago

We have started putting extra configuration for sshing to a host in a
file in the /etc/chem-zfs-backup-server/zfs-rsync.d/$HOSTNAME directory.
This updates the backup migration script to copy that as well as the
main config file for the machine. I've chosen to copy the entire
directory to catch other files we might want to add in future. There is
often an 'exclude' file in there that's autogenerated by the prepare
scripts, but copying that won't do any damage; it's just redundant
because it will be regenerated when the backup runs.

0f5bbdc6

Aug 30, 2023

Cease attempt to unexport ZFS when moving backup to a different zpool · 2dd5cd07

Dr Adam Thorn authored 1 year ago

1. The export is done via set sharenfs which means we shouldn't need
   to manually manage exports

2. This part of the script does not work because it tries to unexport
   the old export but by looking up the db record that we have already
   updated to refer to the new zpool.

2dd5cd07

Aug 09, 2023

Allow setting of global rsync command and rsync args · 46533d5a

Dr Catherine Pitt authored 1 year ago

This adds a new config file which allows setting the command to use for
'rsync' and global options for that command. This is motivated by the
need to use an alternative rsync command on Jammy machines, as the
system one is too slow.

The option for global rsync arguments was added as a way to add the
'--trust-sender' flag to all backups to turn off certain checks that we
suspect to be the cause of the slowdown, but it didn't help enough to
fix the speed problem. Instead we are going to use our own package of an
older rsync from before the checking code was added, which of course
doesn't support --trust-sender so the global args are left blank.

46533d5a

Jul 27, 2023

Add abilty to have a per-host ssh config file · 6fda699c

Dr Adam Thorn authored 1 year ago

Custom options need to be a file passed via -F because we want
to specify options for both ssh and scp. They don't have a compatible
set of CLI options but both take -F.

This supercedes 53f5ba49 ; I had only deployed SSHOPTIONS for one host which
I've updated.

This also removes the SSHPORT option, which had only been used in the config
for one host which I've updated.

6fda699c

move zfs-rync template config files out of /etc · 871511b2
Dr Adam Thorn authored 1 year ago
```
These are static files provided by our package, not config files.
```
871511b2

move prepare scripts out of /etc · ced34c01

Dr Adam Thorn authored 1 year ago

These are not config files, and we should not be modifying the package-provided
versions of these files. I'm leaving symlinks behind to make sure we don't break
all our existing backups though!

ced34c01

Jul 24, 2023

add option to specify SSHOPTIONS to rsync tasks · 53f5ba49

Dr Adam Thorn authored 1 year ago

This could/should probably supercede the specific option for SSHPORT
as I think usage of that is minimal or perhaps even zero, but we'd
have to check if that's in use and make suitable updates to config
files before removing it.

53f5ba49

Mar 10, 2022

Ensure foreign keys will ON DELETE CASCADE when deleting a host · c206b71f

Dr Adam Thorn authored 3 years ago

i.e. we can now simply

delete from host where hostname='example.ch.private.cam.ac.uk';

without having to chase the foreign keys.

I've made the equivalent change on our live backup servers with
an ad hoc script.

c206b71f

Mar 09, 2022

Set TASKNAME correctly in send-backup-to-server.sh · 520764f0
Dr Adam Thorn authored 3 years ago

View commits for tag 0.9-ch91 0.9-ch91

520764f0

Add script to move a backup to another server · 3d245d54

Dr Adam Thorn authored 3 years ago

i.e. send the ZFS, update db records and copy the config files. The
sql inserts should broadly mirror those done when setting up a new
backup, though with field values matching those in the source database
rather than just using the defaults.

This script has so only been tested for the case of moving a "simple"
backup where a host has a single backup task and no special config.
It's quite likely there'll be bugs to fix for other cases that
we'll find in due course.

3d245d54

Remove move-machine.sh script · 56a40eed

Dr Adam Thorn authored 3 years ago

I'm about to add a script to send a backup to a different backup server.
It's thus probably best if the script names describe their functions in a
little more detail

56a40eed

Dec 20, 2021
- move-machine.sh: update/fix usage message · 0f146873
  Dr Adam Thorn authored 3 years ago
  
  View commits for tag 0.9-ch89 0.9-ch89
  
  0f146873
Nov 17, 2021
- Update usage message to specify quotas should be set on sub-ZFSs, not the parent · 16445075
  Dr Adam Thorn authored 3 years ago
  
  View commits for tag 0.9-ch88 0.9-ch88
  
  16445075
- Further improve help message for new postgres backups · 90ae5b4e
  Dr Adam Thorn authored 3 years ago
  
  View commits for tag 0.9-ch87 0.9-ch87
  
  90ae5b4e
- Update help message for postgres targets to encourage usage of canonical script · 9d0bc09f
  Dr Adam Thorn authored 3 years ago
  
  https://tickets.ch.cam.ac.uk/rt/Ticket/Display.html?id=211460 e.g. RT 211460, 211465. spri-musuem-rt-2025 had been set up with an adhoc script which was not properly tidying up after itself due to bailing early on a "set -e" error.
  View commits for tag 0.9-ch86 0.9-ch86
  
  9d0bc09f
Jul 09, 2021

Partial fix for behaviour where we see multiple backups for one task running at once · 3941a9df

Dr Adam Thorn authored 3 years ago

In some versions of backup_queue (I think just on splot4 now), we use
backup_log.isrunning as part of the logic to determine if a task should be
enqueued. The problem is that scheduler.pl makes three writes on the table:

1) an insert when the task is queued (a trigger sets isrunning='t' here)
2) an update to set started_processing when the task begins (a trigger
sets isrunning='f' here!!!!)
3) an update to set ended_processing when the task finishes (a trigger
again sets isrunning='f' here)

Thus, being careful to only set isrunning='f' when a backup task is finished
(i.e. when we set ended_processing=now() in scheduler.pl) seems sensible, and
empirically does seem to lead to the right backup_queue without duplicates.

This commit will only affect new setups of backup servers; the change has been
deployed to live servers with an ad hoc script I've run.

I think we only see this on splot4 because it has a very different definition of
the backup_queue view to a) the one defined in this file, b) the one that's on
all the other backup servers. If I just try to replace the view on splot4, though,
any attempt to select from it just times out so there may be other relations on
splot4 that need updating too.

NB the obvious thing missing on splot4 is

WHERE ((backup_log.backup_task_id = a.backup_task_id) AND (backup_log.ended_processing IS NULL))) < 1))

which feels like a hack but nonetheless ensures in practice that we don't get
duplicate queued tasks.

3941a9df

Jul 08, 2021

Ensure pg-dump-script includes a dump of roles · 67d141b5

Dr Adam Thorn authored 3 years ago

We don't always need the role data, if the presumption is that we'll
be doing a pg_restore in conjunction with an ansible role which creates
all required roles. But, having a copy of the role data will never hurt!
It also gives us a straightforward way of restoring a database to a
standalone postgres instance without having to have provisioned a
dedicated VM with the relevant ansible roles.

67d141b5

Add a script to do a postgres backup via pg_dump · 5b4a8757

Dr Adam Thorn authored 3 years ago

At present we use myriad one-off per host scripts to do a pg_dump,
and they all do (or probably should do) the same thing. In combination
with setting options in the host's backup config file, I think
this single script covers all our routine pg backups.

5b4a8757

Call PRE and POST with same args as zfs-rsync.sh · e01d7ebc

Dr Adam Thorn authored 3 years ago

we were just passing the hostname. Adding extra args should
not impact any existing script, but will let us write better/
more maintainable/deduplicated PRE scripts

e01d7ebc

Jun 29, 2021

Add an outline script for moving a whole zpool · 7ae60f97

Dr Catherine Pitt authored 3 years ago

This came about because a disk has failed on nest-backup, which only has
subdirectory backups of nest-filestore-0 and so move-machine.sh was not
going to be helpful - it assumes all tasks for a machine are on the same
zpool which isn't true there. In this case I did the move by hand, but
have sketched out the steps in the script in the hope that next time we
have to do this we'll do it by looking at the script and running bits by
hand, then improve the script a bit, and continue until it's usable.

7ae60f97

Jun 18, 2021

Add a reminder to set a quota on a newly-created backup · c08df044

Dr Adam Thorn authored 3 years ago

I don't think there's a sensible default quota; the value for
a workstation will be very different to a tiny VM, for example.

c08df044

Jun 08, 2021

Add crossmnt to list of default NFS options · ca4eaf3b

Dr Adam Thorn authored 3 years ago

This is needed on focal if a client is to be able to access snapshots over NFS.
From the docs I don't see why we didn't also need this option on xenial,
but empirically, we need it on focal. (e.g. RT-207229)

ca4eaf3b

May 12, 2021

Fix a bug in the move-machine script · 75db08dc

Dr Catherine Pitt authored 3 years ago

The generation of the command to unexport NFS filesystems could generate
an invalid command. Leading spaces were not being stripped, and in cases
where there is more than one backup target for a machine we need to
unexport every target. Because we also had 'set -e' in operation at this
point, the script would fail there and never clean up the moved ZFS. I
don't mind if we fail to unexport; if that's subsequently a problem for
removing the ZFS then the script will fail at that point.

This change makes the script generate better exportfs -u commands and
not exit if they fail.

75db08dc

Apr 30, 2021

Make database connections short-lived · e40c1a55

Dr Catherine Pitt authored 3 years ago

The code used to open a database connection for each thread and leave
them open for as long as the scheduler ran. This worked reasonably well
until we moved to PostgreSQL 13 on Focal, although the scheduler would
fail if the database was restarted because there was no logic to
reconnect after a connection dropped.

On Focal/PG13 the connection for the 'cron' thread steadily consumes
memory until it has exhausted everything in the machine. This appears to
be a Postgres change rather than a Perl DBI change: the problem can be
reproduced by sitting in psql and running 'select * from backup_queue'
repeatedly. Once or twice a minute an instance of this query will cause
the connection to consume another MB of RAM which is not released until
the database connection is closed. The cron thread runs that query every
two seconds. My guess is it's something peculiar about the view that
query selects from - the time interval thing is interesting.
This needs more investigation.

But in the meantime I'd like to have backup servers that don't endlessly
gobble RAM, so this change makes the threads connect to the database
only when they need to, and closes the connection afterwards. This
should also make things work better over database restarts but that's
not been carefully tested.

e40c1a55

Dec 11, 2020

Remove deprecated 'keys on scalar' usage · ef6e3f36

Dr Catherine Pitt authored 4 years ago

Can't say "keys $foo", must say "keys %{$foo}" to get the keys of a hash
pointed to by a hash reference $foo. This broke with the perl in focal.

ef6e3f36

Oct 06, 2020
- Make zfs-rsync.sh exit in error if PRE script fails · 9f8b5dc4
  Dr Adam Thorn authored 4 years ago
  
  9f8b5dc4
- Log starting time of zfs-rsync.sh · 33980780
  Dr Adam Thorn authored 4 years ago
  
  33980780
- Move setting of vars earlier in zfs-rsync.sh · 4a32235f
  Dr Adam Thorn authored 4 years ago
  
  NB this is so we can put things in LOGFILE at earlier points than the script does at present
  4a32235f

Admin message