doc/design-shared-storage-redundancy.rst - ganeti - Git at Google

 =================================
 N+1 redundancy for shared storage
 =================================

 .. contents:: :depth: 4

 This document describes how N+1 redundancy is achieved
 for instanes using shared storage.


 Current state and shortcomings
 ==============================

 For instances with DRBD as disk template, in case of failures
 of their primary node, there is only one node where the instance
 can be restarted immediately. Therefore, ``htools`` reserve enough
 memory on that node to cope with failure of a single node.
 For instances using shared storage, however, they can be restarted
 on any node---implying that on no particular node memory has to
 be reserved. This, however, motivated the current state where no
 memory is reserved at all. And even a large cluster can run out
 of capacity.

 Proposed changes
 ================

 Definition on N+1 redundancy in the presence of shared storage
 --------------------------------------------------------------

 A cluster is considered N+1 redundant, if, for every node, all
 DRBD instances can be migrated out and then all shared-storage
 instances can be relocated to a different node without moving
 instances on other nodes. This is precisely the operation done
 after a node breaking. Obviously, simulating failure and evacuation
 for every single node is an expensive operation.

 Basic Considerations
 --------------------

 For DRBD, keeping N+1 redundancy is affected by moving instances and
 balancing the cluster. Moreover, taking is into account for balancing
 can help :doc:`design-allocation-efficiency`. Hence, N+1 redundancy
 for DRBD is to be taken into account for all choices affecting instance
 location, including instance allocation and balancing.

 For shared-storage instances, they can move everywhere within the
 node group. So, in practise, this is mainly a question of capacity
 planing, especially is most instances have the same size. Nevertheless,
 offcuts if instances don't fill a node entirely may not be ignored.


 Modifications to existing tools
 -------------------------------

 - ``hail`` will compute and rank possible allocations as usual. However,
   before returing a choice it will filter out allocations that are
   not N+1 redundant.

 - Normal ``gnt-cluster verify`` will not be changed; in particular,
   it will still check for DRBD N+1 redundancy, but not for shared
   storage N+1 redundancy. However, ``hcheck`` will verify shared storage
   N+1 redundancy and report it that fails.

 - ``hbal`` will consider and rank moves as usual. However, before deciding
   on the next move, it will filter out those moves that lead from a
   shared storage N+1 redundant configuration into one that isn't.

 - ``hspace`` computing the capacity for DRBD instances will be unchanged;
   In particular, the options ``--accept-exisiting`` and ``--independent-groups``
   will continue to work. For shared storage instances, however, will strictly
   iterate over the same allocation step as hail does.


 Other modifications related to opportunistic locking
 ----------------------------------------------------

 To allow parallel instance creation, instance creation jobs can be instructed
 to run with just whatever node locks currently available. In this case, an
 allocation has to be chosen from that restricted set of nodes. Currently, this
 is achieved by sending ``hail`` a cluster description where all other nodes
 are marked offline; that works as long as only local properties are considered.
 With global properties, however, the capacity of the cluster is materially
 underestimated, causing spurious global N+1 failures.

 Therefore, we conservatively extend the request format of ``hail`` by an
 optional parameter ``restrict-to-nodes``. If that parameter is given, only
 allocations on those nodes will be considered. This will be an additional
 restriction to ones currently considered (e.g., node must be online, a
 particular group might have been requested). If opportunistic locking is
 enabled, calls to the IAllocator will use this extension to signal which
 nodes to restrict to, instead of marking other nodes offline.

 It should be noted that this change brings a race. Two concurrent allocations
 might bring the cluster over the global N+1 capacity limit. As, however, the
 reason for opportunistic locking is an urgent need for instances, this seems
 acceptable; Ganeti generally follows the guideline that current problems are
 more important than future ones. Also, even with that change allocation is
 more careful than the current approach of completely ignoring N+1 redundancy
 for shared storage.
	=================================
	N+1 redundancy for shared storage
	=================================

	.. contents:: :depth: 4

	This document describes how N+1 redundancy is achieved
	for instanes using shared storage.


	Current state and shortcomings
	==============================

	For instances with DRBD as disk template, in case of failures
	of their primary node, there is only one node where the instance
	can be restarted immediately. Therefore, ``htools`` reserve enough
	memory on that node to cope with failure of a single node.
	For instances using shared storage, however, they can be restarted
	on any node---implying that on no particular node memory has to
	be reserved. This, however, motivated the current state where no
	memory is reserved at all. And even a large cluster can run out
	of capacity.

	Proposed changes
	================

	Definition on N+1 redundancy in the presence of shared storage
	--------------------------------------------------------------

	A cluster is considered N+1 redundant, if, for every node, all
	DRBD instances can be migrated out and then all shared-storage
	instances can be relocated to a different node without moving
	instances on other nodes. This is precisely the operation done
	after a node breaking. Obviously, simulating failure and evacuation
	for every single node is an expensive operation.

	Basic Considerations
	--------------------

	For DRBD, keeping N+1 redundancy is affected by moving instances and
	balancing the cluster. Moreover, taking is into account for balancing
	can help :doc:`design-allocation-efficiency`. Hence, N+1 redundancy
	for DRBD is to be taken into account for all choices affecting instance
	location, including instance allocation and balancing.

	For shared-storage instances, they can move everywhere within the
	node group. So, in practise, this is mainly a question of capacity
	planing, especially is most instances have the same size. Nevertheless,
	offcuts if instances don't fill a node entirely may not be ignored.


	Modifications to existing tools
	-------------------------------

	- ``hail`` will compute and rank possible allocations as usual. However,
	before returing a choice it will filter out allocations that are
	not N+1 redundant.

	- Normal ``gnt-cluster verify`` will not be changed; in particular,
	it will still check for DRBD N+1 redundancy, but not for shared
	storage N+1 redundancy. However, ``hcheck`` will verify shared storage
	N+1 redundancy and report it that fails.

	- ``hbal`` will consider and rank moves as usual. However, before deciding
	on the next move, it will filter out those moves that lead from a
	shared storage N+1 redundant configuration into one that isn't.

	- ``hspace`` computing the capacity for DRBD instances will be unchanged;
	In particular, the options ``--accept-exisiting`` and ``--independent-groups``
	will continue to work. For shared storage instances, however, will strictly
	iterate over the same allocation step as hail does.


	Other modifications related to opportunistic locking
	----------------------------------------------------

	To allow parallel instance creation, instance creation jobs can be instructed
	to run with just whatever node locks currently available. In this case, an
	allocation has to be chosen from that restricted set of nodes. Currently, this
	is achieved by sending ``hail`` a cluster description where all other nodes
	are marked offline; that works as long as only local properties are considered.
	With global properties, however, the capacity of the cluster is materially
	underestimated, causing spurious global N+1 failures.

	Therefore, we conservatively extend the request format of ``hail`` by an
	optional parameter ``restrict-to-nodes``. If that parameter is given, only
	allocations on those nodes will be considered. This will be an additional
	restriction to ones currently considered (e.g., node must be online, a
	particular group might have been requested). If opportunistic locking is
	enabled, calls to the IAllocator will use this extension to signal which
	nodes to restrict to, instead of marking other nodes offline.

	It should be noted that this change brings a race. Two concurrent allocations
	might bring the cluster over the global N+1 capacity limit. As, however, the
	reason for opportunistic locking is an urgent need for instances, this seems
	acceptable; Ganeti generally follows the guideline that current problems are
	more important than future ones. Also, even with that change allocation is
	more careful than the current approach of completely ignoring N+1 redundancy
	for shared storage.