FAQ | This is a LIVE service | Changelog

Skip to content
  • Dr Adam Thorn's avatar
    Ensure script exits zero when asked to start a resource, no matter what · 2652b6ac
    Dr Adam Thorn authored
    The resource script is called by heartbeat's ResourceManager. In the
    acquireresourcegroup() function therein, if a resource fails to be
    started then it calls giveupresourcegroup to stop all resources.
    This might mean all the VMs get pushed back to the node they just
    came from, or it might mean that all the VMs get stopped if twin
    is down.
    
    We hit the latter condition when I ran "poweroff" on a dom0, and
    one of the resources failed to migrate. Once the resources had
    finished migrating off the node I turned off, the remaining node
    tried to start the missing resource - and failed, due to the
    various checks in place in the "start" block to prevent split
    brains and the like. ResourceManager thus proceeded to turn off
    all of our VMs.
    
    In our environment, we view the different parts of the resource group
    (i.e., our VMs) as independent resources: the failure of one of them
    does not mean the failure of the entire group, and thus shouldn't
    trigger a "CRIT" status in heartbeat. Instead, if we can't start
    a resource, heartbeat should move on - we rely on other external
    monitoring to catch the situation where an individual VM has failed.
    
    NB I've also changed the exit codes for the "migrate" block to zero.
    The "migrate" argument is only ever passed by a human, not heartbeat,
    and exists purely for our convenience. None the less, having the
    exit codes be consistent with "start" seems like a good idea
    2652b6ac