FAQ | This is a LIVE service | Changelog

Skip to content
Snippets Groups Projects
  1. Oct 23, 2024
  2. Sep 25, 2024
  3. Jul 10, 2024
  4. Jun 10, 2024
  5. May 24, 2024
    • Dr. Frank Lee's avatar
    • Dr Catherine Pitt's avatar
      Add -q flag to psql calls to suppress 'INSERT 0 1' etc · 59d01a57
      Dr Catherine Pitt authored
      Closes #6
      
      When running psql commands to insert rows in the database, psql normally
      returns an message about what it did, eg "INSERT 0 1" if it inserted a
      row. This can be suppressed with -q . Several of the scripts use psql
      commands to get primary keys from the database, inserting the row if
      necessary. This can lead to the host id variable in the script being set
      to 'INSERT 0 1 <thehostid>' which causes problems when this variable is
      used in other SQL commands.
      
      This always used to work; I suspect the thing that changed is our
      upgrading to Postgres 16 on the backup servers, but I'm struggling to
      see how as Postgres 13 seems to behave the same for me.
  6. May 07, 2024
  7. May 01, 2024
  8. Oct 23, 2023
    • Dr Adam Thorn's avatar
      add unique hostname constraint to host table · b902dac5
      Dr Adam Thorn authored
      We had ended up, somehow, with a few hosts on one backup server
      which appeared twice in `host` - one with disabled=f and some backup
      tasks as expected, and one with disabled=t. I manually (necessarily)
      deleted the latter before adding this constraint on the live servers.
      
      (all servers listed in zfs_backup_server.conf have had the db table
      manually updated)
    • Dr Adam Thorn's avatar
      don't mark hosts as disabled at the point they get created · 594be148
      Dr Adam Thorn authored
      I think the intention here was perhaps:
      
      - create new host, marked as disabled
      - finish setting up backups
      - once done, mark host as enabled
      
      ...except the script runs as "set -e", so if something goes wrong
      we just never get as far as enabling the host, which means not only
      do no backups run but no failure reports get sent to xymon so we
      don't even notice the failure. This is not good.
      
      Given the entry of a row in the `host` table doesn't do much in and
      of itself, I see no reason why we shouldn't just mark the host as
      initially enabled. We won't try to actually perform a backup until
      a `backup_task` has been created. Perhaps this leads to a brief
      transient behaviour where xymon reports a backup as failing whilst
      the script is still running - but OTOH the xymon report for a new
      machine will always be red for "a while" until the first backup
      has actually run OK.
      594be148
    • Dr Adam Thorn's avatar
      don't bail (due to set -e) if ssh-keyscan fails · 20de2933
      Dr Adam Thorn authored
      We don't need to record the ssh host key in most cases given that we
      generally deploy signed ssh host keys, but I suspect we might have the
      occasional backup target where that doesn't apply (e.g. clusters?)
      
      Regardless, if we can't scan the host key the right behaviour is for
      the script to continue on and set up the backup. If the backup then
      fails due to the absent host key, we will be alerted and take suitable
      action. Right now, the failure mechanism is that we silently don't finish
      setting up the backup, the backup never gets enabled, and we don't
      realise we don't have a backup - eek.
      20de2933
  9. Sep 08, 2023
    • Dr Catherine Pitt's avatar
      Add script for moving a backup task rather than a host · 43774528
      Dr Catherine Pitt authored
      For machines like nest-backup and cerebro-backup we have lots of backup
      tasks for the same host spread across several zpools, so
      move-machine-to-zpool.sh can't be used to migrate the contents of a
      failing zpool/disk. This adds a script to move an individual ZFS which is
      the target of a backup task to another zpool.
      
      It assumes all necessary parent ZFSes already exist on the target. If
      they don't it fails.
      
      It does not yet clean up the old ZFS as it's not had a lot of use.
      43774528
  10. Sep 01, 2023
    • Dr Catherine Pitt's avatar
      send-backup-to-server.sh copies additional config · 0f5bbdc6
      Dr Catherine Pitt authored
      We have started putting extra configuration for sshing to a host in a
      file in the /etc/chem-zfs-backup-server/zfs-rsync.d/$HOSTNAME directory.
      This updates the backup migration script to copy that as well as the
      main config file for the machine. I've chosen to copy the entire
      directory to catch other files we might want to add in future. There is
      often an 'exclude' file in there that's autogenerated by the prepare
      scripts, but copying that won't do any damage; it's just redundant
      because it will be regenerated when the backup runs.
      0f5bbdc6
  11. Aug 30, 2023
  12. Aug 09, 2023
    • Dr Catherine Pitt's avatar
      Allow setting of global rsync command and rsync args · 46533d5a
      Dr Catherine Pitt authored
      This adds a new config file which allows setting the command to use for
      'rsync' and global options for that command. This is motivated by the
      need to use an alternative rsync command on Jammy machines, as the
      system one is too slow.
      
      The option for global rsync arguments was added as a way to add the
      '--trust-sender' flag to all backups to turn off certain checks that we
      suspect to be the cause of the slowdown, but it didn't help enough to
      fix the speed problem. Instead we are going to use our own package of an
      older rsync from before the checking code was added, which of course
      doesn't support --trust-sender so the global args are left blank.
      46533d5a
  13. Jul 27, 2023
    • Dr Adam Thorn's avatar
      Add abilty to have a per-host ssh config file · 6fda699c
      Dr Adam Thorn authored
      Custom options need to be a file passed via -F because we want
      to specify options for both ssh and scp. They don't have a compatible
      set of CLI options but both take -F.
      
      This supercedes 53f5ba49 ; I had only deployed SSHOPTIONS for one host which
      I've updated.
      
      This also removes the SSHPORT option, which had only been used in the config
      for one host which I've updated.
      6fda699c
    • Dr Adam Thorn's avatar
      move zfs-rync template config files out of /etc · 871511b2
      Dr Adam Thorn authored
      These are static files provided by our package, not config files.
      871511b2
    • Dr Adam Thorn's avatar
      move prepare scripts out of /etc · ced34c01
      Dr Adam Thorn authored
      These are not config files, and we should not be modifying the package-provided
      versions of these files. I'm leaving symlinks behind to make sure we don't break
      all our existing backups though!
      ced34c01
  14. Jul 24, 2023
    • Dr Adam Thorn's avatar
      add option to specify SSHOPTIONS to rsync tasks · 53f5ba49
      Dr Adam Thorn authored
      This could/should probably supercede the specific option for SSHPORT
      as I think usage of that is minimal or perhaps even zero, but we'd
      have to check if that's in use and make suitable updates to config
      files before removing it.
      53f5ba49
  15. Mar 10, 2022
  16. Mar 09, 2022
  17. Dec 20, 2021
  18. Nov 17, 2021
  19. Jul 09, 2021
    • Dr Adam Thorn's avatar
      Partial fix for behaviour where we see multiple backups for one task running at once · 3941a9df
      Dr Adam Thorn authored
      In some versions of backup_queue (I think just on splot4 now), we use
      backup_log.isrunning as part of the logic to determine if a task should be
      enqueued. The problem is that scheduler.pl makes three writes on the table:
      
      1) an insert when the task is queued (a trigger sets isrunning='t' here)
      2) an update to set started_processing when the task begins (a trigger
         sets isrunning='f' here!!!!)
      3) an update to set ended_processing when the task finishes (a trigger
         again sets isrunning='f' here)
      
      Thus, being careful to only set isrunning='f' when a backup task is finished
      (i.e. when we set ended_processing=now() in scheduler.pl) seems sensible, and
      empirically does seem to lead to the right backup_queue without duplicates.
      
      This commit will only affect new setups of backup servers; the change has been
      deployed to live servers with an ad hoc script I've run.
      
      I think we only see this on splot4 because it has a very different definition of
      the backup_queue view to a) the one defined in this file, b) the one that's on
      all the other backup servers. If I just try to replace the view on splot4, though,
      any attempt to select from it just times out so there may be other relations on
      splot4 that need updating too.
      
      NB the obvious thing missing on splot4 is
      
      WHERE ((backup_log.backup_task_id = a.backup_task_id) AND (backup_log.ended_processing IS NULL))) < 1))
      
      which feels like a hack but nonetheless ensures in practice that we don't get
      duplicate queued tasks.
      3941a9df
  20. Jul 08, 2021
  21. Jun 29, 2021
    • Dr Catherine Pitt's avatar
      Add an outline script for moving a whole zpool · 7ae60f97
      Dr Catherine Pitt authored
      This came about because a disk has failed on nest-backup, which only has
      subdirectory backups of nest-filestore-0 and so move-machine.sh was not
      going to be helpful - it assumes all tasks for a machine are on the same
      zpool which isn't true there. In this case I did the move by hand, but
      have sketched out the steps in the script in the hope that next time we
      have to do this we'll do it by looking at the script and running bits by
      hand, then improve the script a bit, and continue until it's usable.
      7ae60f97
  22. Jun 18, 2021
  23. Jun 08, 2021
  24. May 12, 2021
    • Dr Catherine Pitt's avatar
      Fix a bug in the move-machine script · 75db08dc
      Dr Catherine Pitt authored
      The generation of the command to unexport NFS filesystems could generate
      an invalid command. Leading spaces were not being stripped, and in cases
      where there is more than one backup target for a machine we need to
      unexport every target. Because we also had 'set -e' in operation at this
      point, the script would fail there and never clean up the moved ZFS. I
      don't mind if we fail to unexport; if that's subsequently a problem for
      removing the ZFS then the script will fail at that point.
      
      This change makes the script generate better exportfs -u commands and
      not exit if they fail.
  25. Apr 30, 2021
    • Dr Catherine Pitt's avatar
      Make database connections short-lived · e40c1a55
      Dr Catherine Pitt authored
      The code used to open a database connection for each thread and leave
      them open for as long as the scheduler ran. This worked reasonably well
      until we moved to PostgreSQL 13 on Focal, although the scheduler would
      fail if the database was restarted because there was no logic to
      reconnect after a connection dropped.
      
      On Focal/PG13 the connection for the 'cron' thread steadily consumes
      memory until it has exhausted everything in the machine. This appears to
      be a Postgres change rather than a Perl DBI change: the problem can be
      reproduced by sitting in psql and running 'select * from backup_queue'
      repeatedly. Once or twice a minute an instance of this query will cause
      the connection to consume another MB of RAM which is not released until
      the database connection is closed. The cron thread runs that query every
      two seconds. My guess is it's something peculiar about the view that
      query selects from - the time interval thing is interesting.
      This needs more investigation.
      
      But in the meantime I'd like to have backup servers that don't endlessly
      gobble RAM, so this change makes the threads connect to the database
      only when they need to, and closes the connection afterwards. This
      should also make things work better over database restarts but that's
      not been carefully tested.
  26. Dec 11, 2020
  27. Oct 06, 2020
Loading