-
Dr Adam Thorn authored
The resource script is called by heartbeat's ResourceManager. In the acquireresourcegroup() function therein, if a resource fails to be started then it calls giveupresourcegroup to stop all resources. This might mean all the VMs get pushed back to the node they just came from, or it might mean that all the VMs get stopped if twin is down. We hit the latter condition when I ran "poweroff" on a dom0, and one of the resources failed to migrate. Once the resources had finished migrating off the node I turned off, the remaining node tried to start the missing resource - and failed, due to the various checks in place in the "start" block to prevent split brains and the like. ResourceManager thus proceeded to turn off all of our VMs. In our environment, we view the different parts of the resource group (i.e., our VMs) as independent resources: the failure of one of them does not mean the failure of the entire group, and thus shouldn't trigger a "CRIT" status in heartbeat. Instead, if we can't start a resource, heartbeat should move on - we rely on other external monitoring to catch the situation where an individual VM has failed. NB I've also changed the exit codes for the "migrate" block to zero. The "migrate" argument is only ever passed by a human, not heartbeat, and exists purely for our convenience. None the less, having the exit codes be consistent with "start" seems like a good idea
Dr Adam Thorn authoredThe resource script is called by heartbeat's ResourceManager. In the acquireresourcegroup() function therein, if a resource fails to be started then it calls giveupresourcegroup to stop all resources. This might mean all the VMs get pushed back to the node they just came from, or it might mean that all the VMs get stopped if twin is down. We hit the latter condition when I ran "poweroff" on a dom0, and one of the resources failed to migrate. Once the resources had finished migrating off the node I turned off, the remaining node tried to start the missing resource - and failed, due to the various checks in place in the "start" block to prevent split brains and the like. ResourceManager thus proceeded to turn off all of our VMs. In our environment, we view the different parts of the resource group (i.e., our VMs) as independent resources: the failure of one of them does not mean the failure of the entire group, and thus shouldn't trigger a "CRIT" status in heartbeat. Instead, if we can't start a resource, heartbeat should move on - we rely on other external monitoring to catch the situation where an individual VM has failed. NB I've also changed the exit codes for the "migrate" block to zero. The "migrate" argument is only ever passed by a human, not heartbeat, and exists purely for our convenience. None the less, having the exit codes be consistent with "start" seems like a good idea
Loading