doc/design-hsqueeze.rst - ganeti - Git at Google

 =============
 HSqueeze tool
 =============

 .. contents:: :depth: 4

 This is a design document detailing the node-freeing scheduler, HSqueeze.


 Current state and shortcomings
 ==============================

 Externally-mirrored instances can be moved between nodes at low
 cost. Therefore, it is attractive to free up nodes and power them down
 at times of low usage, even for small periods of time, like nights or
 weekends.

 Currently, the best way to find out a suitable set of nodes to shut down
 is to use the property of our balancedness metric to move instances
 away from drained nodes. So, one would manually drain more and more
 nodes and see, if `hbal` could find a solution freeing up all those
 drained nodes.


 Proposed changes
 ================

 We propose the addition of a new htool command-line tool, called
 `hsqueeze`, that aims at keeping resource usage at a constant high
 level by evacuating and powering down nodes, or powering up nodes and
 rebalancing, as appropriate. By default, only externally-mirrored
 instances are moved, but options are provided to additionally take
 DRBD instances (which can be moved without downtimes), or even all
 instances into consideration.

 Tagging of standy nodes
 -----------------------

 Powering down nodes that are technically healthy effectively creates a
 new node state: nodes on standby. To avoid further state
 proliferation, and as this information is only used by `hsqueeze`,
 this information is recorded in node tags. `hsqueeze` will assume
 that offline nodes having a tag with prefix `htools:standby:` can
 easily be powered on at any time.

 Minimum available resources
 ---------------------------

 To keep the squeezed cluster functional, a minimal amount of resources
 will be left available on every node. While the precise amount will
 be specifiable via command-line options, a sensible default is chosen,
 like enough resource to start an additional instance at standard
 allocation on each node. If the available resources fall below this
 limit, `hsqueeze` will, in fact, try to power on more nodes, till
 enough resources are available, or all standy nodes are online.

 To avoid flapping behavior, a second, higher, amount of reserve
 resources can be specified, and `hsqueeze` will only power down nodes,
 if after the power down this higher amount of reserve resources is
 still available.

 Computation of the set to free up
 ---------------------------------

 To determine which nodes can be powered down, `hsqueeze` basically
 follows the same algorithm as the manual process. It greedily goes
 through all non-master nodes and tries if the algorithm used by `hbal`
 would find a solution (with the appropriate move restriction) that
 frees up the extended set of nodes to be drained, while keeping enough
 resources free. Being based on the algorithm used by `hbal`, all
 restrictions respected by `hbal`, in particular memory reservation
 for N+1 redundancy, are also respected by `hsqueeze`.
 The order in which the nodes are tried is choosen by a
 suitable heuristics, like trying the nodes in order of increasing
 number of instances; the hope is that this reduces the number of
 instances that actually have to be moved.

 If the amount of free resources has fallen below the lower limit,
 `hsqueeze` will determine the set of nodes to power up in a similar
 way; it will hypothetically add more and more of the standby
 nodes (in some suitable order) till the algorithm used by `hbal` will
 finally balance the cluster in a way that enough resources are available,
 or all standy nodes are online.


 Instance moves and execution
 ----------------------------

 Once the final set of nodes to power down is determined, the instance
 moves are determined by the algorithm used by `hbal`. If
 requested by the `-X` option, the nodes freed up are drained, and the
 instance moves are executed in the same way as `hbal` does. Finally,
 those of the freed-up nodes that do not already have a
 `htools:standby:` tag are tagged as `htools:standby:auto`, all free-up
 nodes are marked as offline and powered down via the
 :doc:`design-oob`.

 Similarly, if it is determined that nodes need to be added, then first
 the nodes are powered up via the :doc:`design-oob`, then they're marked
 as online and finally,
 the cluster is balanced in the same way, as `hbal` would do. For the
 newly powered up nodes, the `htools:standby:auto` tag, if present, is
 removed, but no other tags are removed (including other
 `htools:standby:` tags).


 Design choices
 ==============

 The proposed algorithm builds on top of the already present balancing
 algorithm, instead of greedily packing nodes as full as possible. The
 reason is, that in the end, a balanced cluster is needed anyway;
 therefore, basing on the balancing algorithm reduces the number of
 instance moves. Additionally, the final configuration will also
 benefit from all improvements to the balancing algorithm, like taking
 dynamic CPU data into account.

 We decided to have a separate program instead of adding an option to
 `hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is
 not unlikely that, over time, additional `hsqueeze`-specific options
 might be added, specifying, e.g., which nodes to prefer for
 shutdown. With the approach of the `htools` of having a single binary
 showing different behaviors, having an additional program also does not
 introduce significant additional cost.

 We decided to have a whole prefix instead of a single tag reserved
 for marking standby nodes (we consider all tags starting with
 `htools:standby:` as serving only this purpose). This is not only in
 accordance with the tag
 reservations for other tools, but it also allows for further extension
 (like specifying priorities on which nodes to power up first) without
 changing name spaces.
	=============
	HSqueeze tool
	=============

	.. contents:: :depth: 4

	This is a design document detailing the node-freeing scheduler, HSqueeze.


	Current state and shortcomings
	==============================

	Externally-mirrored instances can be moved between nodes at low
	cost. Therefore, it is attractive to free up nodes and power them down
	at times of low usage, even for small periods of time, like nights or
	weekends.

	Currently, the best way to find out a suitable set of nodes to shut down
	is to use the property of our balancedness metric to move instances
	away from drained nodes. So, one would manually drain more and more
	nodes and see, if `hbal` could find a solution freeing up all those
	drained nodes.


	Proposed changes
	================

	We propose the addition of a new htool command-line tool, called
	`hsqueeze`, that aims at keeping resource usage at a constant high
	level by evacuating and powering down nodes, or powering up nodes and
	rebalancing, as appropriate. By default, only externally-mirrored
	instances are moved, but options are provided to additionally take
	DRBD instances (which can be moved without downtimes), or even all
	instances into consideration.

	Tagging of standy nodes
	-----------------------

	Powering down nodes that are technically healthy effectively creates a
	new node state: nodes on standby. To avoid further state
	proliferation, and as this information is only used by `hsqueeze`,
	this information is recorded in node tags. `hsqueeze` will assume
	that offline nodes having a tag with prefix `htools:standby:` can
	easily be powered on at any time.

	Minimum available resources
	---------------------------

	To keep the squeezed cluster functional, a minimal amount of resources
	will be left available on every node. While the precise amount will
	be specifiable via command-line options, a sensible default is chosen,
	like enough resource to start an additional instance at standard
	allocation on each node. If the available resources fall below this
	limit, `hsqueeze` will, in fact, try to power on more nodes, till
	enough resources are available, or all standy nodes are online.

	To avoid flapping behavior, a second, higher, amount of reserve
	resources can be specified, and `hsqueeze` will only power down nodes,
	if after the power down this higher amount of reserve resources is
	still available.

	Computation of the set to free up
	---------------------------------

	To determine which nodes can be powered down, `hsqueeze` basically
	follows the same algorithm as the manual process. It greedily goes
	through all non-master nodes and tries if the algorithm used by `hbal`
	would find a solution (with the appropriate move restriction) that
	frees up the extended set of nodes to be drained, while keeping enough
	resources free. Being based on the algorithm used by `hbal`, all
	restrictions respected by `hbal`, in particular memory reservation
	for N+1 redundancy, are also respected by `hsqueeze`.
	The order in which the nodes are tried is choosen by a
	suitable heuristics, like trying the nodes in order of increasing
	number of instances; the hope is that this reduces the number of
	instances that actually have to be moved.

	If the amount of free resources has fallen below the lower limit,
	`hsqueeze` will determine the set of nodes to power up in a similar
	way; it will hypothetically add more and more of the standby
	nodes (in some suitable order) till the algorithm used by `hbal` will
	finally balance the cluster in a way that enough resources are available,
	or all standy nodes are online.


	Instance moves and execution
	----------------------------

	Once the final set of nodes to power down is determined, the instance
	moves are determined by the algorithm used by `hbal`. If
	requested by the `-X` option, the nodes freed up are drained, and the
	instance moves are executed in the same way as `hbal` does. Finally,
	those of the freed-up nodes that do not already have a
	`htools:standby:` tag are tagged as `htools:standby:auto`, all free-up
	nodes are marked as offline and powered down via the
	:doc:`design-oob`.

	Similarly, if it is determined that nodes need to be added, then first
	the nodes are powered up via the :doc:`design-oob`, then they're marked
	as online and finally,
	the cluster is balanced in the same way, as `hbal` would do. For the
	newly powered up nodes, the `htools:standby:auto` tag, if present, is
	removed, but no other tags are removed (including other
	`htools:standby:` tags).


	Design choices
	==============

	The proposed algorithm builds on top of the already present balancing
	algorithm, instead of greedily packing nodes as full as possible. The
	reason is, that in the end, a balanced cluster is needed anyway;
	therefore, basing on the balancing algorithm reduces the number of
	instance moves. Additionally, the final configuration will also
	benefit from all improvements to the balancing algorithm, like taking
	dynamic CPU data into account.

	We decided to have a separate program instead of adding an option to
	`hbal` to keep the interfaces, especially that of `hbal`, cleaner. It is
	not unlikely that, over time, additional `hsqueeze`-specific options
	might be added, specifying, e.g., which nodes to prefer for
	shutdown. With the approach of the `htools` of having a single binary
	showing different behaviors, having an additional program also does not
	introduce significant additional cost.

	We decided to have a whole prefix instead of a single tag reserved
	for marking standby nodes (we consider all tags starting with
	`htools:standby:` as serving only this purpose). This is not only in
	accordance with the tag
	reservations for other tools, but it also allows for further extension
	(like specifying priorities on which nodes to power up first) without
	changing name spaces.