doc/design-multi-storage-htools.rst - ganeti - Git at Google

 ==================================================
 HTools support for multiple storage units per node
 ==================================================

 .. contents:: :depth: 4

 This design document describes changes to hbal and related components (first
 and foremost LUXI), that will allow it to handle nodes that can't be considered
 monolithic in regard to disk layout, for example because they have multiple
 different storage units available.

 Current state and shortcomings
 ==============================

 Currently the htools assume that there is one storage unit per node and that it can
 be arbitrarily split among instances. This leads to problems in clusters
 where multiple storage units are present: There might be 10GB DRBD and 10GB
 plain storage available on a node, for a total of 20GB. If an instance that
 uses 15GB of a single type of storage is requested, it can't actually fit on
 the node, but the current implementation of hail doesn't notice this.

 This behaviour is clearly wrong, but the problem doesn't arise often in current
 setup, due to the fact that instances currently only have a single
 storage type and that users typically use node groups to differentiate between
 different node storage layouts.

 For the node show action, RAPI only returns

 * ``dfree``: The total amount of free disk space
 * ``dtotal``: The total amount of disk space

 which is insufficient for the same reasons.


 Proposed changes
 ================

 Definitions
 -----------

 * All disks have exactly one *desired storage unit*, which determines where and
   how the disk can be stored. If the disk is transfered, the desired storage
   unit remains unchanged. The desired storage unit includes specifics like the
   volume group in the case of LVM based storage.
 * A *storage unit* is a specific storage location on a specific node. Storage
   units have exactly one desired storage unit they can contain. A storage unit
   further has an identifier (containing the storage type, a key and possibly
   parameters), a total capacity, and a free capacity. A node cannot
   contain multiple storage units of the same desired storage unit.
 * For the purposes of this document a *disk* has a desired storage unit and a size.
 * A *disk can be moved* to a node, if there is at least one storage unit on
   that node which can contain the desired storage unit of the disk and if the
   free capacity is at least the size of the disk.
 * An *instance can be moved* to a node, if all its disks can be moved there
   one-by-one.

 LUXI and IAllocator protocol extension
 --------------------------------------

 The LUXI and IAllocator protocols are extended to include in the ``node``:

 * ``storage``: a list of objects (storage units) with
   #. Storage unit, containing in order:

      #. storage type
      #. storage key (e.g. volume group name)
      #. extra parameters (e.g. flag for exclusive storage) as a list.

   #. Amount free in MiB
   #. Amount total in MiB

 .. code-block:: javascript

     {
       "storage": [
         { "sunit": ["drbd8", "xenvg", []]
         , "free": 2000,
         , "total": 4000
         },
         { "sunit": ["file", "/path/to/storage1", []]
         , "free": 5000,
         , "total": 10000
         },
         { "sunit": ["file", "/path/to/storage2", []]
         , "free": 1000,
         , "total": 20000
         },
         { "sunit": ["lvm-vg", "xenssdvg", [false]]
         , "free": 1024,
         , "total": 1024
         }
       ]
     }

 is a node with an LVM volume group mirrored over DRBD, two file storage
 directories, one half full, one mostly full, and a non-mirrored volume group.

 The storage type ``drbd8`` needs to be added in order to differentiate between
 mirrored storage and non-mirrored storage.
 The storage key signals the volume group used and the storage unit takes no
 additional parameters.

 Text protocol extension
 -----------------------

 The same field is optionally present in the HTools text protocol:

 * a new "storage" column is added to the node section, which is a semicolon
   separated list of comma separated fields in the order
   #. ``free``
   #. ``total``
   #. ``sunit``, which in itself contains

      #. the storage type
      #. the storage key
      #. extra arguments

 For example:

     2000,4000,drbd,xenvg;5000,10000,file,/path/to/storage1;1000,20000;
     [...]

 Interpretation
 --------------

 ``hbal`` and ``hail`` will use this information only if available, if the data
 file doesn't contain the ``storage`` field the old algorithm is used.

 If the node information contains the ``storage`` field, hbal and hail will
 assume that only the space compatible with the disk's requirements is
 available. For an instance to fit a node, all it's disks need to fit there
 separately. For a disk to fit a node, a storage unit of the type of
 the disk needs to have enough free space to contain it. The total free storage
 is not taken into consideration.

 Ignoring the old information will in theory introduce a backwards
 incompatibility: If the total free storage is smaller than to the sum of the
 free storage reported in the ``storage`` field a previously illegal move will
 become legal.

 Balancing
 ---------

 In order to determine a storage location for an instance, we collect analogous
 metrics to the current total node free space metric -- namely the standard deviation
 statistic of the free space per storage unit.

 The *standard deviation metric* of a desired storage unit is the sample standard
 deviation of the percentage of free space of storage units compatible.

 The *full storage metric* is a average of the standard deviation metrics of the
 desired storage units.

 This is backwards compatible in-so-far as that

 #. For a single storage unit per node it will have the same value.
 #. The weight of the storage versus the other metrics remains unchanged.

 Further this retains the property that scarce resources with low total will
 tend to have bigger impact on the metric than those with large totals, because
 in latter case the relative differences will not make for a large standard
 deviation.

 Ignoring nodes that do not contain the desired storage unit additionally
 boosts the importance of the scarce desired storage units, because having more
 storage units of a desired storage unit will tend to make the standard
 deviation metric smaller.
	==================================================
	HTools support for multiple storage units per node
	==================================================

	.. contents:: :depth: 4

	This design document describes changes to hbal and related components (first
	and foremost LUXI), that will allow it to handle nodes that can't be considered
	monolithic in regard to disk layout, for example because they have multiple
	different storage units available.

	Current state and shortcomings
	==============================

	Currently the htools assume that there is one storage unit per node and that it can
	be arbitrarily split among instances. This leads to problems in clusters
	where multiple storage units are present: There might be 10GB DRBD and 10GB
	plain storage available on a node, for a total of 20GB. If an instance that
	uses 15GB of a single type of storage is requested, it can't actually fit on
	the node, but the current implementation of hail doesn't notice this.

	This behaviour is clearly wrong, but the problem doesn't arise often in current
	setup, due to the fact that instances currently only have a single
	storage type and that users typically use node groups to differentiate between
	different node storage layouts.

	For the node show action, RAPI only returns

	* ``dfree``: The total amount of free disk space
	* ``dtotal``: The total amount of disk space

	which is insufficient for the same reasons.


	Proposed changes
	================

	Definitions
	-----------

	* All disks have exactly one desired storage unit, which determines where and
	how the disk can be stored. If the disk is transfered, the desired storage
	unit remains unchanged. The desired storage unit includes specifics like the
	volume group in the case of LVM based storage.
	* A storage unit is a specific storage location on a specific node. Storage
	units have exactly one desired storage unit they can contain. A storage unit
	further has an identifier (containing the storage type, a key and possibly
	parameters), a total capacity, and a free capacity. A node cannot
	contain multiple storage units of the same desired storage unit.
	* For the purposes of this document a disk has a desired storage unit and a size.
	* A disk can be moved to a node, if there is at least one storage unit on
	that node which can contain the desired storage unit of the disk and if the
	free capacity is at least the size of the disk.
	* An instance can be moved to a node, if all its disks can be moved there
	one-by-one.

	LUXI and IAllocator protocol extension
	--------------------------------------

	The LUXI and IAllocator protocols are extended to include in the ``node``:

	* ``storage``: a list of objects (storage units) with
	#. Storage unit, containing in order:

	#. storage type
	#. storage key (e.g. volume group name)
	#. extra parameters (e.g. flag for exclusive storage) as a list.

	#. Amount free in MiB
	#. Amount total in MiB

	.. code-block:: javascript

	{
	"storage": [
	{ "sunit": ["drbd8", "xenvg", []]
	, "free": 2000,
	, "total": 4000
	},
	{ "sunit": ["file", "/path/to/storage1", []]
	, "free": 5000,
	, "total": 10000
	},
	{ "sunit": ["file", "/path/to/storage2", []]
	, "free": 1000,
	, "total": 20000
	},
	{ "sunit": ["lvm-vg", "xenssdvg", [false]]
	, "free": 1024,
	, "total": 1024
	}
	]
	}

	is a node with an LVM volume group mirrored over DRBD, two file storage
	directories, one half full, one mostly full, and a non-mirrored volume group.

	The storage type ``drbd8`` needs to be added in order to differentiate between
	mirrored storage and non-mirrored storage.
	The storage key signals the volume group used and the storage unit takes no
	additional parameters.

	Text protocol extension
	-----------------------

	The same field is optionally present in the HTools text protocol:

	* a new "storage" column is added to the node section, which is a semicolon
	separated list of comma separated fields in the order
	#. ``free``
	#. ``total``
	#. ``sunit``, which in itself contains

	#. the storage type
	#. the storage key
	#. extra arguments

	For example:

	2000,4000,drbd,xenvg;5000,10000,file,/path/to/storage1;1000,20000;
	[...]

	Interpretation
	--------------

	``hbal`` and ``hail`` will use this information only if available, if the data
	file doesn't contain the ``storage`` field the old algorithm is used.

	If the node information contains the ``storage`` field, hbal and hail will
	assume that only the space compatible with the disk's requirements is
	available. For an instance to fit a node, all it's disks need to fit there
	separately. For a disk to fit a node, a storage unit of the type of
	the disk needs to have enough free space to contain it. The total free storage
	is not taken into consideration.

	Ignoring the old information will in theory introduce a backwards
	incompatibility: If the total free storage is smaller than to the sum of the
	free storage reported in the ``storage`` field a previously illegal move will
	become legal.

	Balancing
	---------

	In order to determine a storage location for an instance, we collect analogous
	metrics to the current total node free space metric -- namely the standard deviation
	statistic of the free space per storage unit.

	The standard deviation metric of a desired storage unit is the sample standard
	deviation of the percentage of free space of storage units compatible.

	The full storage metric is a average of the standard deviation metrics of the
	desired storage units.

	This is backwards compatible in-so-far as that

	#. For a single storage unit per node it will have the same value.
	#. The weight of the storage versus the other metrics remains unchanged.

	Further this retains the property that scarce resources with low total will
	tend to have bigger impact on the metric than those with large totals, because
	in latter case the relative differences will not make for a large standard
	deviation.

	Ignoring nodes that do not contain the desired storage unit additionally
	boosts the importance of the scarce desired storage units, because having more
	storage units of a desired storage unit will tend to make the standard
	deviation metric smaller.