Commits · d3c8ab49bf4213ec6d9b4f47b93a715dc03b721f · Yusuf Hamied Department of Chemistry / COs / backup-scheduler

May 03, 2023

replace deprecated tempfile call with mktemp · d3c8ab49

Dr Adam Thorn authored 1 year ago

tempfile is a debian-ism and jammy warns us that use is deprecated.
mktemp has been available on all our debian/ubuntu machines for a long time
via coreutils (e.g. it was definitely in wheezy and trusty, and I'm pretty
sure since before then too)

d3c8ab49

Jan 05, 2023

Redhat prepare scripts compresses MySQL backup · d2a7ea19

Dr Catherine Pitt authored 2 years ago

The incremental backups on some cluster head nodes are growing quite
large, and most of the churn is the uncompressed MySQL dumpfile which
changes with every backup and can be over 1GB. This commit compresses
that data, which reduces the size of the file by 90% on at least one
machine.

d2a7ea19

Dec 19, 2022
- Raise xymon alert if quota for a ZFS exceeds available+used · 17296322
  Dr Adam Thorn authored 2 years ago
  
  This can happen for a number of reasons: - child ZFS has a quota bigger than its parent - we have over-provisioned such that the sum of quotas is bigger than the disk - an individual ZFS has a quota bigger than the disk
  0.9-ch98
  
  17296322
- rename variable in xymon script · ee1dd04d
  Dr Adam Thorn authored 2 years ago
  
  ee1dd04d
- Add generic thisproperty() in xymon script, analagous to thisquota() · 82023deb
  Dr Adam Thorn authored 2 years ago
  
  82023deb
- switch to internally using 'parsable (exact) values' for zfs commands · a04d74be
  Dr Adam Thorn authored 2 years ago
  
  ..and convert such things to human-friendly versions if required. This is to facilitate extra checks where we want to make numeric calculations involving quotas and other similar properties.
  a04d74be
- whitespace only: replace tabs with spaces · 62ab3e5d
  Dr Adam Thorn authored 2 years ago
  
  62ab3e5d
Oct 31, 2022
- analyse-snaphot-usage: add optional second argument for specifying the initial snapshot · 563ae0b1
  Dr Adam Thorn authored 2 years ago
  
  0.9-ch97
  
  563ae0b1
Oct 26, 2022
- Add script to report the space that would be reclaimed by deleting older snapshots · 48485319
  Dr Adam Thorn authored 2 years ago
  
  0.9-ch96
  
  48485319
May 25, 2022

Ignore non-zero exit from systemd-detect-virt in prepare script · 73ad8e1d

Dr Adam Thorn authored 2 years ago

As of b4b89ed3 we need the prepare script to exit zero on success.

We only use the stdout from this command, not the return code, for
determining if the target is a xen VM.

73ad8e1d

May 24, 2022

Stop returning "true" from default prepare script · b4b89ed3

Dr Adam Thorn authored 2 years ago

This has been silently masking failures, which is a Bad Thing. The only
errors I've spotted so far are ones hopefully fixed by ad91da90, which
has the effect that we've been backing up more things than we needed
(which is better than the opposite possibility, at least!)

b4b89ed3

More ensuring errors propagate upwards from the prepare script · 238e3bfc
Dr Adam Thorn authored 2 years ago

238e3bfc

Fix generation of list of package files to exclude in the standard prepare script · ad91da90

Dr Adam Thorn authored 2 years ago

c714a331 introduced an annoying bug here. If the filenames piped to xargs include
an apostrophe, xargs will complain and stop. We have thus, in practice, been
including all files that appear in the list after

/usr/share/sounds/ubuntu/ringtones/Sam's Song.ogg

!! Also, we've not been properly handling filenames with spaces in, and perhaps
other filenames too. We thus null-terminate the filenames for ourself.

Because I want to get to the point where we can have a sensible return code
from the prepare script, this commit also adds some set -e commands to try
to ensure errors bubble up. This is a little tedious to achieve due to all
the subshells in this script.

For the same reason, we split off the "diff" command into a "set +e" block
because diff returns non-zero if differences are found. We do not consider
that to be an error!

ad91da90

Improve check of whether mysql is(/should be) running in default prepare script · 40f64d5c

Dr Adam Thorn authored 2 years ago

We think the intention of the old version of this block is

"if mysql is running, dump the databases".

The check has been buggy for a long long time: it reads the contents
of my.cnf ... which nowadays just has some !includedir directives.
This lead to setting

SOCKET=""

and, it turns out, [ -S "" ] returns true.

The main intention of this part of the prepare script is to backup
vaguely normal machines running a simple mysql database, such as
small group webservers. We thus don't need to consider every
eventuality.

40f64d5c

Stop trying to dump ldap in prepare script (as we don't have any debian/ubuntu ldap servers!!) · b67e2b82
Dr Adam Thorn authored 2 years ago

b67e2b82

Cease trying to configure dateext logrotation in prepare script · dd28c9af

Dr Adam Thorn authored 2 years ago

On servers we configure this via ansible.

On workstations it is not having the desired effect because the stock
logrotate.conf includes the line ...

and so the simple grep thinks it's already configured! I'll add an
ansible task in due course to actually configure this.

dd28c9af

Apr 25, 2022

Make default prepare scripts skip /lib/{modules,firmware} on xen VMs · 491ba4b1

Dr Adam Thorn authored 2 years ago

Now that we use pygrub, these dirs are populated by quite a lot of files
that we don't want to back up but are dynamically built by the kernel/related
packages so do not make it into the the prepared "excludes" file

491ba4b1

Mar 10, 2022

Ensure foreign keys will ON DELETE CASCADE when deleting a host · c206b71f

Dr Adam Thorn authored 3 years ago

i.e. we can now simply

delete from host where hostname='example.ch.private.cam.ac.uk';

without having to chase the foreign keys.

I've made the equivalent change on our live backup servers with
an ad hoc script.

c206b71f

Mar 09, 2022

Set TASKNAME correctly in send-backup-to-server.sh · 520764f0
Dr Adam Thorn authored 3 years ago

0.9-ch91

520764f0
Release 0.9-ch90 · 63abf952
Dr Adam Thorn authored 3 years ago

0.9-ch90

63abf952

Add script to move a backup to another server · 3d245d54

Dr Adam Thorn authored 3 years ago

i.e. send the ZFS, update db records and copy the config files. The
sql inserts should broadly mirror those done when setting up a new
backup, though with field values matching those in the source database
rather than just using the defaults.

This script has so only been tested for the case of moving a "simple"
backup where a host has a single backup task and no special config.
It's quite likely there'll be bugs to fix for other cases that
we'll find in due course.

3d245d54

Remove move-machine.sh script · 56a40eed

Dr Adam Thorn authored 3 years ago

I'm about to add a script to send a backup to a different backup server.
It's thus probably best if the script names describe their functions in a
little more detail

56a40eed

Dec 20, 2021
- move-machine.sh: update/fix usage message · 0f146873
  Dr Adam Thorn authored 3 years ago
  
  0.9-ch89
  
  0f146873
Nov 17, 2021
- Update usage message to specify quotas should be set on sub-ZFSs, not the parent · 16445075
  Dr Adam Thorn authored 3 years ago
  
  0.9-ch88
  
  16445075
- Further improve help message for new postgres backups · 90ae5b4e
  Dr Adam Thorn authored 3 years ago
  
  0.9-ch87
  
  90ae5b4e
- Update help message for postgres targets to encourage usage of canonical script · 9d0bc09f
  Dr Adam Thorn authored 3 years ago
  
  https://tickets.ch.cam.ac.uk/rt/Ticket/Display.html?id=211460 e.g. RT 211460, 211465. spri-musuem-rt-2025 had been set up with an adhoc script which was not properly tidying up after itself due to bailing early on a "set -e" error.
  0.9-ch86
  
  9d0bc09f
Jul 14, 2021

Prepend reporting lines with the zfs target name · 6e536df5

Dr Adam Thorn authored 3 years ago

This will let us use zfs_target as the name of a subtest which
in turn means we would be able to separately log and graph multiple
backup targets associated with a single host.

This change does not affect the current parsing performed when
we input data into postgres: it uses non-anchored regexps to
identify SpaceUsed etc so prepending extra text won't change
anything

6e536df5

Jul 09, 2021

Partial fix for behaviour where we see multiple backups for one task running at once · 3941a9df

Dr Adam Thorn authored 3 years ago

In some versions of backup_queue (I think just on splot4 now), we use
backup_log.isrunning as part of the logic to determine if a task should be
enqueued. The problem is that scheduler.pl makes three writes on the table:

1) an insert when the task is queued (a trigger sets isrunning='t' here)
2) an update to set started_processing when the task begins (a trigger
sets isrunning='f' here!!!!)
3) an update to set ended_processing when the task finishes (a trigger
again sets isrunning='f' here)

Thus, being careful to only set isrunning='f' when a backup task is finished
(i.e. when we set ended_processing=now() in scheduler.pl) seems sensible, and
empirically does seem to lead to the right backup_queue without duplicates.

This commit will only affect new setups of backup servers; the change has been
deployed to live servers with an ad hoc script I've run.

I think we only see this on splot4 because it has a very different definition of
the backup_queue view to a) the one defined in this file, b) the one that's on
all the other backup servers. If I just try to replace the view on splot4, though,
any attempt to select from it just times out so there may be other relations on
splot4 that need updating too.

NB the obvious thing missing on splot4 is

WHERE ((backup_log.backup_task_id = a.backup_task_id) AND (backup_log.ended_processing IS NULL))) < 1))

which feels like a hack but nonetheless ensures in practice that we don't get
duplicate queued tasks.

3941a9df

Jul 08, 2021

Ensure pg-dump-script includes a dump of roles · 67d141b5

Dr Adam Thorn authored 3 years ago

We don't always need the role data, if the presumption is that we'll
be doing a pg_restore in conjunction with an ansible role which creates
all required roles. But, having a copy of the role data will never hurt!
It also gives us a straightforward way of restoring a database to a
standalone postgres instance without having to have provisioned a
dedicated VM with the relevant ansible roles.

67d141b5

Add a script to do a postgres backup via pg_dump · 5b4a8757

Dr Adam Thorn authored 3 years ago

At present we use myriad one-off per host scripts to do a pg_dump,
and they all do (or probably should do) the same thing. In combination
with setting options in the host's backup config file, I think
this single script covers all our routine pg backups.

5b4a8757

Call PRE and POST with same args as zfs-rsync.sh · e01d7ebc

Dr Adam Thorn authored 3 years ago

we were just passing the hostname. Adding extra args should
not impact any existing script, but will let us write better/
more maintainable/deduplicated PRE scripts

e01d7ebc

Jun 29, 2021

Add an outline script for moving a whole zpool · 7ae60f97

Dr Catherine Pitt authored 3 years ago

This came about because a disk has failed on nest-backup, which only has
subdirectory backups of nest-filestore-0 and so move-machine.sh was not
going to be helpful - it assumes all tasks for a machine are on the same
zpool which isn't true there. In this case I did the move by hand, but
have sketched out the steps in the script in the hope that next time we
have to do this we'll do it by looking at the script and running bits by
hand, then improve the script a bit, and continue until it's usable.

7ae60f97

Jun 18, 2021

Release with correct list of conffiles · 92b25e88

Dr Adam Thorn authored 3 years ago

This is just a change to the packaging, not to the actual deployed
contents of the package. This deb has quite a few conffiles which
made unattended-upgrades flag the mistake when I tried to upgrade.

We should now have the right list of conffiles:

makedeb@08d12c3c

92b25e88

Add a reminder to set a quota on a newly-created backup · c08df044

Dr Adam Thorn authored 3 years ago

I don't think there's a sensible default quota; the value for
a workstation will be very different to a tiny VM, for example.

c08df044

Jun 15, 2021

Add prepare-redhat script · 83e54e26

Dr Catherine Pitt authored 3 years ago

prepare-nondebian does not work on RedHat machines running MySQL as the
paths are different, so providing a fixed version. prepare-nondebian has
historically been used more widely than just RedHat, hence the decision
to provide a RedHat-specific version and not just edit it.

83e54e26

Jun 08, 2021

Add crossmnt to list of default NFS options · ca4eaf3b

Dr Adam Thorn authored 3 years ago

This is needed on focal if a client is to be able to access snapshots over NFS.
From the docs I don't see why we didn't also need this option on xenial,
but empirically, we need it on focal. (e.g. RT-207229)

ca4eaf3b

May 12, 2021

Fix a bug in the move-machine script · 75db08dc

Dr Catherine Pitt authored 3 years ago

The generation of the command to unexport NFS filesystems could generate
an invalid command. Leading spaces were not being stripped, and in cases
where there is more than one backup target for a machine we need to
unexport every target. Because we also had 'set -e' in operation at this
point, the script would fail there and never clean up the moved ZFS. I
don't mind if we fail to unexport; if that's subsequently a problem for
removing the ZFS then the script will fail at that point.

This change makes the script generate better exportfs -u commands and
not exit if they fail.

75db08dc

Apr 30, 2021

Make database connections short-lived · e40c1a55

Dr Catherine Pitt authored 3 years ago

The code used to open a database connection for each thread and leave
them open for as long as the scheduler ran. This worked reasonably well
until we moved to PostgreSQL 13 on Focal, although the scheduler would
fail if the database was restarted because there was no logic to
reconnect after a connection dropped.

On Focal/PG13 the connection for the 'cron' thread steadily consumes
memory until it has exhausted everything in the machine. This appears to
be a Postgres change rather than a Perl DBI change: the problem can be
reproduced by sitting in psql and running 'select * from backup_queue'
repeatedly. Once or twice a minute an instance of this query will cause
the connection to consume another MB of RAM which is not released until
the database connection is closed. The cron thread runs that query every
two seconds. My guess is it's something peculiar about the view that
query selects from - the time interval thing is interesting.
This needs more investigation.

But in the meantime I'd like to have backup servers that don't endlessly
gobble RAM, so this change makes the threads connect to the database
only when they need to, and closes the connection afterwards. This
should also make things work better over database restarts but that's
not been carefully tested.

e40c1a55

Jan 18, 2021
- Add missing template file for directory backups · 86279e65
  Dr Catherine Pitt authored 4 years ago
  
  0.9-ch74
  
  86279e65
Jan 06, 2021

Pipe list of package-files through realpath, to resolve any symlinks · c714a331

Dr Adam Thorn authored 4 years ago

As of focal a bunch of top-level dirs are symlinks (eg /lib -> /usr/lib) but
the deb packages still deploy files to the symlink rather than the real dir.
Thus if we just take the contents of all the *.list files we end up not
excluding lots of files that are in fact provided by debs

c714a331

Admin message