| ============= |
| HSqueeze tool |
| ============= |
| |
| .. contents:: :depth: 4 |
| |
| This is a design document detailing the node-freeing scheduler, HSqueeze. |
| |
| |
| Current state and shortcomings |
| ============================== |
| |
| Externally-mirrored instances can be moved between nodes at low |
| cost. Therefore, it is attractive to free up nodes and power them down |
| at times of low usage, even for small periods of time, like nights or |
| weekends. |
| |
| Currently, the best way to find out a suitable set of nodes to shut down |
| is to use the property of our balancedness metric to move instances |
| away from drained nodes. So, one would manually drain more and more |
| nodes and see, if `hbal` could find a solution freeing up all those |
| drained nodes. |
| |
| |
| Proposed changes |
| ================ |
| |
| We propose the addition of a new htool command-line tool, called |
| `hsqueeze`, that aims at keeping resource usage at a constant high |
| level by evacuating and powering down nodes, or powering up nodes and |
| rebalancing, as appropriate. By default, only externally-mirrored |
| instances are moved, but options are provided to additionally take |
| DRBD instances (which can be moved without downtimes), or even all |
| instances into consideration. |
| |
| Tagging of standy nodes |
| ----------------------- |
| |
| Powering down nodes that are technically healthy effectively creates a |
| new node state: nodes on standby. To avoid further state |
| proliferation, and as this information is only used by `hsqueeze`, |
| this information is recorded in node tags. `hsqueeze` will assume |
| that offline nodes having a tag with prefix `htools:standby:` can |
| easily be powered on at any time. |
| |
| Minimum available resources |
| --------------------------- |
| |
| To keep the squeezed cluster functional, a minimal amount of resources |
| will be left available on every node. While the precise amount will |
| be specifiable via command-line options, a sensible default is chosen, |
| like enough resource to start an additional instance at standard |
| allocation on each node. If the available resources fall below this |
| limit, `hsqueeze` will, in fact, try to power on more nodes, till |
| enough resources are available, or all standy nodes are online. |
| |
| To avoid flapping behavior, a second, higher, amount of reserve |
| resources can be specified, and `hsqueeze` will only power down nodes, |
| if after the power down this higher amount of reserve resources is |
| still available. |
| |
| Computation of the set to free up |
| --------------------------------- |
| |
| To determine which nodes can be powered down, `hsqueeze` basically |
| follows the same algorithm as the manual process. It greedily goes |
| through all non-master nodes and tries if the algorithm used by `hbal` |
| would find a solution (with the appropriate move restriction) that |
| frees up the extended set of nodes to be drained, while keeping enough |
| resources free. Being based on the algorithm used by `hbal`, all |
| restrictions respected by `hbal`, in particular memory reservation |
| for N+1 redundancy, are also respected by `hsqueeze`. |
| The order in which the nodes are tried is choosen by a |
| suitable heuristics, like trying the nodes in order of increasing |
| number of instances; the hope is that this reduces the number of |
| instances that actually have to be moved. |
| |
| If the amount of free resources has fallen below the lower limit, |
| `hsqueeze` will determine the set of nodes to power up in a similar |
| way; it will hypothetically add more and more of the standby |
| nodes (in some suitable order) till the algorithm used by `hbal` will |
| finally balance the cluster in a way that enough resources are available, |
| or all standy nodes are online. |
| |
| |
| Instance moves and execution |
| ---------------------------- |
| |
| Once the final set of nodes to power down is determined, the instance |
| moves are determined by the algorithm used by `hbal`. If |
| requested by the `-X` option, the nodes freed up are drained, and the |
| instance moves are executed in the same way as `hbal` does. Finally, |
| those of the freed-up nodes that do not already have a |
| `htools:standby:` tag are tagged as `htools:standby:auto`, all free-up |
| nodes are marked as offline and powered down via the |
| :doc:`design-oob`. |
| |
| Similarly, if it is determined that nodes need to be added, then first |
| the nodes are powered up via the :doc:`design-oob`, then they're marked |
| as online and finally, |
| the cluster is balanced in the same way, as `hbal` would do. For the |
| newly powered up nodes, the `htools:standby:auto` tag, if present, is |
| removed, but no other tags are removed (including other |
| `htools:standby:` tags). |
| |
| |
| Design choices |
| ============== |
| |
| The proposed algorithm builds on top of the already present balancing |
| algorithm, instead of greedily packing nodes as full as possible. The |
| reason is, that in the end, a balanced cluster is needed anyway; |
| therefore, basing on the balancing algorithm reduces the number of |
| instance moves. Additionally, the final configuration will also |
| benefit from all improvements to the balancing algorithm, like taking |
| dynamic CPU data into account. |
| |
| We decided to have a separate program instead of adding an option to |
| `hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is |
| not unlikely that, over time, additional `hsqueeze`-specific options |
| might be added, specifying, e.g., which nodes to prefer for |
| shutdown. With the approach of the `htools` of having a single binary |
| showing different behaviors, having an additional program also does not |
| introduce significant additional cost. |
| |
| We decided to have a whole prefix instead of a single tag reserved |
| for marking standby nodes (we consider all tags starting with |
| `htools:standby:` as serving only this purpose). This is not only in |
| accordance with the tag |
| reservations for other tools, but it also allows for further extension |
| (like specifying priorities on which nodes to power up first) without |
| changing name spaces. |