doc/design-hroller.rst - ganeti - Git at Google

 ============
 HRoller tool
 ============

 .. contents:: :depth: 4

 This is a design document detailing the cluster maintenance scheduler,
 HRoller.


 Current state and shortcomings
 ==============================

 To enable automating cluster-wide reboots a new htool, called HRoller,
 was added to Ganeti starting from version 2.7. This tool helps
 parallelizing cluster offline maintenances by calculating which nodes
 are not both primary and secondary for a DRBD instance, and thus can be
 rebooted at the same time, when all instances are down.

 The way this is done is documented in the :manpage:`hroller(1)` manpage.

 We would now like to perform online maintenance on the cluster by
 rebooting nodes after evacuating their primary instances (rolling
 reboots).

 Proposed changes
 ================

 New options
 -----------

 - HRoller should be able to operate on single nodegroups (-G flag) or
   select its target node through some other mean (eg. via a tag, or a
   regexp). (Note that individual node selection is already possible via
   the -O flag, that makes hroller ignore a node altogether).
 - HRoller should handle non redundant instances: currently these are
   ignored but there should be a way to select its behavior between "it's
   ok to reboot a node when a non-redundant instance is on it" or "skip
   nodes with non-redundant instances". This will only be selectable
   globally, and not per instance.
 - Hroller will make sure to keep any instance which is up in its current
   state, via live migrations, unless explicitly overridden. The
   algorithm that will be used calculate the rolling reboot with live
   migrations is described below, and any override on considering the
   instance status will only be possible on the whole run, and not
   per-instance.


 Calculating rolling maintenances
 --------------------------------

 In order to perform rolling maintenance we need to migrate instances off
 the nodes before a reboot. How this can be done depends on the
 instance's disk template and status:

 Down instances
 ++++++++++++++

 If an instance was shutdown when the maintenance started it will be
 considered for avoiding contemporary reboot of its primary and secondary
 nodes, but will *not* be considered as a target for the node evacuation.
 This allows avoiding needlessly moving its primary around, since it
 won't suffer a downtime anyway.

 Note that a node with non-redundant instances will only ever be
 considered good for rolling-reboot if these are down (or the checking of
 status is overridden) *and* an explicit option to allow it is set.

 DRBD
 ++++

 Each node must migrate all instances off to their secondaries, and then
 can either be rebooted, or the secondaries can be evacuated as well.

 Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
 it's not any safer than temporarily rebooting a node with secondaries on
 them (citation needed). As such we'll implement for now just the
 "migrate+reboot" mode, and focus later on replace-disks as well.

 In order to do that we can use the following algorithm:

 1) Compute node sets that don't contain both the primary and the
    secondary of any instance, and also don't contain the primary
    nodes of two instances that have the same node as secondary. These
    can be obtained by computing a coloring of the graph with nodes
    as vertexes and an edge between two nodes, if either condition
    prevents simultaneous maintenance. (This is the current algorithm of
    :manpage:`hroller(1)` with the extension that the graph to be colored
    has additional edges between the primary nodes of two instances sharing
    their secondary node.)
 2) It is then possible to migrate in parallel all nodes in a set
    created at step 1, and then reboot/perform maintenance on them, and
    migrate back their original primaries, which allows the computation
    above to be reused for each following set without N+1 failures
    being triggered, if none were present before. See below about the
    actual execution of the maintenance.

 Non-DRBD
 ++++++++

 All non-DRBD disk templates that can be migrated have no "secondary"
 concept. As such instances can be migrated to any node (in the same
 nodegroup). In order to do the job we can either:

 - Perform migrations on one node at a time, perform the maintenance on
   that node, and proceed (the node will then be targeted again to host
   instances automatically, as hail chooses targets for the instances
   between all nodes in a group. Nodes in different nodegroups can be
   handled in parallel.
 - Perform migrations on one node at a time, but without waiting for the
   first node to come back before proceeding. This allows us to continue,
   restricting the cluster, until no more capacity in the nodegroup is
   available, and then having to wait for some nodes to come back so that
   capacity is available again for the last few nodes.
 - Pre-Calculate sets of nodes that can be migrated together (probably
   with a greedy algorithm) and parallelize between them, with the
   migrate-back approach discussed for DRBD to perform the calculation
   only once.

 Note that for non-DRBD disks that still use local storage (eg. RBD and
 plain) redundancy might break anyway, and nothing except the first
 algorithm might be safe. This perhaps would be a good reason to consider
 managing better RBD pools, if those are implemented on top of nodes
 storage, rather than on dedicated storage machines.

 Full-Evacuation
 +++++++++++++++

 If full evacuation of the nodes to be rebooted is desired, a simple
 migration is not enough for the DRBD instances. To keep the number of
 disk operations small, we restrict moves to ``migrate, replace-secondary``.
 That is, after migrating instances out of the nodes to be rebooted,
 replacement secondaries are searched for, for all instances that have
 their then secondary on one of the rebooted nodes. This is done by a
 greedy algorithm, refining the initial reboot partition, if necessary.

 Future work
 ===========

 Hroller should become able to execute rolling maintenances, rather than
 just calculate them. For this to succeed properly one of the following
 must happen:

 - HRoller handles rolling maintenances that happen at the same time as
   unrelated cluster jobs, and thus recalculates the maintenance at each
   step
 - HRoller can selectively drain the cluster so it's sure that only the
   rolling maintenance can be going on

 DRBD nodes' ``replace-disks``' functionality should be implemented. Note
 that when we will support a DRBD version that allows multi-secondary
 this can be done safely, without losing replication at any time, by
 adding a temporary secondary and only when the sync is finished dropping
 the previous one.

 Non-redundant (plain or file) instances should have a way to be moved
 off as well via plain storage live migration or ``gnt-instance move``
 (which requires downtime).

 If/when RBD pools can be managed inside Ganeti, care can be taken so
 that the pool is evacuated as well from a node before it's put into
 maintenance. This is equivalent to evacuating DRBD secondaries.

 Master failovers during the maintenance should be performed by hroller.
 This requires RPC/RAPI support for master failover. Hroller should also
 be modified to better support running on the master itself and
 continuing on the new master.

 .. vim: set textwidth=72 :
 .. Local Variables:
 .. mode: rst
 .. fill-column: 72
 .. End:
	============
	HRoller tool
	============

	.. contents:: :depth: 4

	This is a design document detailing the cluster maintenance scheduler,
	HRoller.


	Current state and shortcomings
	==============================

	To enable automating cluster-wide reboots a new htool, called HRoller,
	was added to Ganeti starting from version 2.7. This tool helps
	parallelizing cluster offline maintenances by calculating which nodes
	are not both primary and secondary for a DRBD instance, and thus can be
	rebooted at the same time, when all instances are down.

	The way this is done is documented in the :manpage:`hroller(1)` manpage.

	We would now like to perform online maintenance on the cluster by
	rebooting nodes after evacuating their primary instances (rolling
	reboots).

	Proposed changes
	================

	New options
	-----------

	- HRoller should be able to operate on single nodegroups (-G flag) or
	select its target node through some other mean (eg. via a tag, or a
	regexp). (Note that individual node selection is already possible via
	the -O flag, that makes hroller ignore a node altogether).
	- HRoller should handle non redundant instances: currently these are
	ignored but there should be a way to select its behavior between "it's
	ok to reboot a node when a non-redundant instance is on it" or "skip
	nodes with non-redundant instances". This will only be selectable
	globally, and not per instance.
	- Hroller will make sure to keep any instance which is up in its current
	state, via live migrations, unless explicitly overridden. The
	algorithm that will be used calculate the rolling reboot with live
	migrations is described below, and any override on considering the
	instance status will only be possible on the whole run, and not
	per-instance.


	Calculating rolling maintenances
	--------------------------------

	In order to perform rolling maintenance we need to migrate instances off
	the nodes before a reboot. How this can be done depends on the
	instance's disk template and status:

	Down instances
	++++++++++++++

	If an instance was shutdown when the maintenance started it will be
	considered for avoiding contemporary reboot of its primary and secondary
	nodes, but will not be considered as a target for the node evacuation.
	This allows avoiding needlessly moving its primary around, since it
	won't suffer a downtime anyway.

	Note that a node with non-redundant instances will only ever be
	considered good for rolling-reboot if these are down (or the checking of
	status is overridden) and an explicit option to allow it is set.

	DRBD
	++++

	Each node must migrate all instances off to their secondaries, and then
	can either be rebooted, or the secondaries can be evacuated as well.

	Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
	it's not any safer than temporarily rebooting a node with secondaries on
	them (citation needed). As such we'll implement for now just the
	"migrate+reboot" mode, and focus later on replace-disks as well.

	In order to do that we can use the following algorithm:

	1) Compute node sets that don't contain both the primary and the
	secondary of any instance, and also don't contain the primary
	nodes of two instances that have the same node as secondary. These
	can be obtained by computing a coloring of the graph with nodes
	as vertexes and an edge between two nodes, if either condition
	prevents simultaneous maintenance. (This is the current algorithm of
	:manpage:`hroller(1)` with the extension that the graph to be colored
	has additional edges between the primary nodes of two instances sharing
	their secondary node.)
	2) It is then possible to migrate in parallel all nodes in a set
	created at step 1, and then reboot/perform maintenance on them, and
	migrate back their original primaries, which allows the computation
	above to be reused for each following set without N+1 failures
	being triggered, if none were present before. See below about the
	actual execution of the maintenance.

	Non-DRBD
	++++++++

	All non-DRBD disk templates that can be migrated have no "secondary"
	concept. As such instances can be migrated to any node (in the same
	nodegroup). In order to do the job we can either:

	- Perform migrations on one node at a time, perform the maintenance on
	that node, and proceed (the node will then be targeted again to host
	instances automatically, as hail chooses targets for the instances
	between all nodes in a group. Nodes in different nodegroups can be
	handled in parallel.
	- Perform migrations on one node at a time, but without waiting for the
	first node to come back before proceeding. This allows us to continue,
	restricting the cluster, until no more capacity in the nodegroup is
	available, and then having to wait for some nodes to come back so that
	capacity is available again for the last few nodes.
	- Pre-Calculate sets of nodes that can be migrated together (probably
	with a greedy algorithm) and parallelize between them, with the
	migrate-back approach discussed for DRBD to perform the calculation
	only once.

	Note that for non-DRBD disks that still use local storage (eg. RBD and
	plain) redundancy might break anyway, and nothing except the first
	algorithm might be safe. This perhaps would be a good reason to consider
	managing better RBD pools, if those are implemented on top of nodes
	storage, rather than on dedicated storage machines.

	Full-Evacuation
	+++++++++++++++

	If full evacuation of the nodes to be rebooted is desired, a simple
	migration is not enough for the DRBD instances. To keep the number of
	disk operations small, we restrict moves to ``migrate, replace-secondary``.
	That is, after migrating instances out of the nodes to be rebooted,
	replacement secondaries are searched for, for all instances that have
	their then secondary on one of the rebooted nodes. This is done by a
	greedy algorithm, refining the initial reboot partition, if necessary.

	Future work
	===========

	Hroller should become able to execute rolling maintenances, rather than
	just calculate them. For this to succeed properly one of the following
	must happen:

	- HRoller handles rolling maintenances that happen at the same time as
	unrelated cluster jobs, and thus recalculates the maintenance at each
	step
	- HRoller can selectively drain the cluster so it's sure that only the
	rolling maintenance can be going on

	DRBD nodes' ``replace-disks``' functionality should be implemented. Note
	that when we will support a DRBD version that allows multi-secondary
	this can be done safely, without losing replication at any time, by
	adding a temporary secondary and only when the sync is finished dropping
	the previous one.

	Non-redundant (plain or file) instances should have a way to be moved
	off as well via plain storage live migration or ``gnt-instance move``
	(which requires downtime).

	If/when RBD pools can be managed inside Ganeti, care can be taken so
	that the pool is evacuated as well from a node before it's put into
	maintenance. This is equivalent to evacuating DRBD secondaries.

	Master failovers during the maintenance should be performed by hroller.
	This requires RPC/RAPI support for master failover. Hroller should also
	be modified to better support running on the master itself and
	continuing on the new master.

	.. vim: set textwidth=72 :
	.. Local Variables:
	.. mode: rst
	.. fill-column: 72
	.. End: