FAQ | This is a LIVE service | Changelog

Skip to content
Snippets Groups Projects
  1. Nov 17, 2021
  2. Jul 09, 2021
    • Dr Adam Thorn's avatar
      Partial fix for behaviour where we see multiple backups for one task running at once · 3941a9df
      Dr Adam Thorn authored
      In some versions of backup_queue (I think just on splot4 now), we use
      backup_log.isrunning as part of the logic to determine if a task should be
      enqueued. The problem is that scheduler.pl makes three writes on the table:
      
      1) an insert when the task is queued (a trigger sets isrunning='t' here)
      2) an update to set started_processing when the task begins (a trigger
         sets isrunning='f' here!!!!)
      3) an update to set ended_processing when the task finishes (a trigger
         again sets isrunning='f' here)
      
      Thus, being careful to only set isrunning='f' when a backup task is finished
      (i.e. when we set ended_processing=now() in scheduler.pl) seems sensible, and
      empirically does seem to lead to the right backup_queue without duplicates.
      
      This commit will only affect new setups of backup servers; the change has been
      deployed to live servers with an ad hoc script I've run.
      
      I think we only see this on splot4 because it has a very different definition of
      the backup_queue view to a) the one defined in this file, b) the one that's on
      all the other backup servers. If I just try to replace the view on splot4, though,
      any attempt to select from it just times out so there may be other relations on
      splot4 that need updating too.
      
      NB the obvious thing missing on splot4 is
      
      WHERE ((backup_log.backup_task_id = a.backup_task_id) AND (backup_log.ended_processing IS NULL))) < 1))
      
      which feels like a hack but nonetheless ensures in practice that we don't get
      duplicate queued tasks.
      3941a9df
  3. Jul 08, 2021
  4. Jun 29, 2021
    • Dr Catherine Pitt's avatar
      Add an outline script for moving a whole zpool · 7ae60f97
      Dr Catherine Pitt authored
      This came about because a disk has failed on nest-backup, which only has
      subdirectory backups of nest-filestore-0 and so move-machine.sh was not
      going to be helpful - it assumes all tasks for a machine are on the same
      zpool which isn't true there. In this case I did the move by hand, but
      have sketched out the steps in the script in the hope that next time we
      have to do this we'll do it by looking at the script and running bits by
      hand, then improve the script a bit, and continue until it's usable.
      7ae60f97
  5. Jun 18, 2021
  6. Jun 08, 2021
  7. May 12, 2021
    • Dr Catherine Pitt's avatar
      Fix a bug in the move-machine script · 75db08dc
      Dr Catherine Pitt authored
      The generation of the command to unexport NFS filesystems could generate
      an invalid command. Leading spaces were not being stripped, and in cases
      where there is more than one backup target for a machine we need to
      unexport every target. Because we also had 'set -e' in operation at this
      point, the script would fail there and never clean up the moved ZFS. I
      don't mind if we fail to unexport; if that's subsequently a problem for
      removing the ZFS then the script will fail at that point.
      
      This change makes the script generate better exportfs -u commands and
      not exit if they fail.
  8. Apr 30, 2021
    • Dr Catherine Pitt's avatar
      Make database connections short-lived · e40c1a55
      Dr Catherine Pitt authored
      The code used to open a database connection for each thread and leave
      them open for as long as the scheduler ran. This worked reasonably well
      until we moved to PostgreSQL 13 on Focal, although the scheduler would
      fail if the database was restarted because there was no logic to
      reconnect after a connection dropped.
      
      On Focal/PG13 the connection for the 'cron' thread steadily consumes
      memory until it has exhausted everything in the machine. This appears to
      be a Postgres change rather than a Perl DBI change: the problem can be
      reproduced by sitting in psql and running 'select * from backup_queue'
      repeatedly. Once or twice a minute an instance of this query will cause
      the connection to consume another MB of RAM which is not released until
      the database connection is closed. The cron thread runs that query every
      two seconds. My guess is it's something peculiar about the view that
      query selects from - the time interval thing is interesting.
      This needs more investigation.
      
      But in the meantime I'd like to have backup servers that don't endlessly
      gobble RAM, so this change makes the threads connect to the database
      only when they need to, and closes the connection afterwards. This
      should also make things work better over database restarts but that's
      not been carefully tested.
  9. Dec 11, 2020
  10. Oct 06, 2020
  11. Apr 07, 2020
  12. Dec 18, 2019
  13. Jul 30, 2019
  14. Jul 23, 2019
  15. Apr 23, 2019
  16. Oct 18, 2018
  17. May 22, 2017
  18. Mar 28, 2017
  19. Feb 14, 2017
  20. Jan 12, 2017
  21. Dec 08, 2016
  22. Sep 07, 2016
  23. Apr 12, 2016
  24. Apr 05, 2016
    • Dr Catherine Pitt's avatar
      Logging enhancements and prepare-nondebian tweak · bd5e7e8d
      Dr Catherine Pitt authored
      Split logs for tasks with same target machine but different target filesystems into
      different logfiles for easier debugging.
      
      prepare-nondebian script creates a fake excludes file for the machine now if
      one doesn't exist, because I keep forgetting to do it myself for Redhat
      machines and then the zfs-rsync script fails.
      bd5e7e8d
  25. Jan 15, 2016
  26. Jan 14, 2016
  27. Dec 15, 2015
  28. Sep 18, 2015
  29. Jul 02, 2015
Loading