doc/design-partitioned.rst - ganeti - Git at Google

 ==================
 Partitioned Ganeti
 ==================

 .. contents:: :depth: 4

 Current state and shortcomings
 ==============================

 Currently Ganeti can be used to easily share a node between multiple
 virtual instances. While it's easy to do a completely "best effort"
 sharing it's quite harder to completely reserve resources for the use of
 a particular instance. In particular this has to be done manually for
 CPUs and disk, is implemented for RAM under Xen, but not under KVM, and
 there's no provision for network level QoS.

 Proposed changes
 ================

 We want to make it easy to partition a node between machines with
 exclusive use of hardware resources. While some sharing will anyway need
 to happen (e.g. for operations that use the host domain, or use
 resources, like buses, which are unique or very scarce on host systems)
 we'll strive to maintain contention at a minimum, but won't try to avoid
 all possible sources of it.

 Exclusive use of disks
 ----------------------

 ``exclusive_storage`` is a new node parameter. When it's enabled, Ganeti
 will allocate entire disks to instances. Though it's possible to think
 of ways of doing something similar for other storage back-ends, this
 design targets only ``plain`` and ``drbd``. The name is generic enough
 in case the feature will be extended to other back-ends. The flag value
 should be homogeneous within a node-group; ``cluster-verify`` will report
 any violation of this condition.

 Ganeti will consider each physical volume in the destination volume
 group as a host disk (for proper isolation, an administrator should
 make sure that there aren't multiple PVs on the same physical
 disk). When ``exclusive_storage`` is enabled in a node group, all PVs
 in the node group must have the same size (within a certain margin, say
 1%, defined through a new parameter). Ganeti will check this condition
 when the ``exclusive_storage`` flag is set, whenever a new node is added
 and as part of ``cluster-verify``.

 When creating a new disk for an instance, Ganeti will allocate the
 minimum number of PVs to hold the disk, and those PVs will be excluded
 from the pool of available PVs for further disk creations. The
 underlying LV will be striped, when striping is allowed by the current
 configuration. Ganeti will continue to track only the LVs, and query the
 LVM layer to figure out which PVs are available and how much space is
 free. Yet, creation, disk growing, and free-space reporting will ignore
 any partially allocated PVs, so that PVs won't be shared between
 instance disks.

 For compatibility with the DRBD template and to take into account disk
 variability, Ganeti will always subtract 2% (this will be a parameter)
 from the PV space when calculating how many PVs are needed to allocate
 an instance and when nodes report free space.

 The obvious target for this option is plain disk template, which doesn't
 provide redundancy. An administrator can still provide resilience
 against disk failures by setting up RAID under PVs, but this is
 transparent to Ganeti.

 Spindles as a resource
 ~~~~~~~~~~~~~~~~~~~~~~

 When resources are dedicated and there are more spindles than instances
 on a node, it is natural to assign more spindles to instances than what
 is strictly needed. For this reason, we introduce a new resource:
 spindles. A spindle is a PV in LVM. The number of spindles required for
 a disk of an instance is specified together with the size. Specifying
 the number of spindles is possible only when ``exclusive_storage`` is
 enabled. It is an error to specify a number of spindles insufficient to
 contain the requested disk size.

 When ``exclusive_storage`` is not enabled, spindles are not used in free
 space calculation, in allocation algorithms, and policies. When it's
 enabled, ``hspace``, ``hbal``, and allocators will use spindles instead
 of disk size for their computation. For each node, the number of all the
 spindles in every LVM group is recorded, and different LVM groups are
 accounted separately in allocation and balancing.

 There is already a concept of spindles in Ganeti. It's not related to
 any actual spindle or volume count, but it's used in ``spindle_use`` to
 measure the pressure of an instance on the storage system and in
 ``spindle_ratio`` to balance the I/O load on the nodes. When
 ``exclusive_storage`` is enabled, these parameters as currently defined
 won't make any sense, so their meaning will be changed in this way:

 - ``spindle_use`` refers to the resource, hence to the actual spindles
   (PVs in LVM), used by an instance. The values specified in the instance
   policy specifications are compared to the run-time numbers of spindle
   used by an instance. The ``spindle_use`` back-end parameter will be
   ignored.
 - ``spindle_ratio`` in instance policies and ``spindle_count`` in node
   parameters are ignored, as the exclusive assignment of PVs already
   implies a value of 1.0 for the first, and the second is replaced by
   the actual number of spindles.

 When ``exclusive_storage`` is disabled, the existing spindle parameters
 behave as before.

 Dedicated CPUs
 --------------

 ``vpcu_ratio`` can be used to tie the number of VCPUs to the number of
 CPUs provided by the hardware. We need to take into account the CPU
 usage of the hypervisor. For Xen, this means counting the number of
 VCPUs assigned to ``Domain-0``.

 For KVM, it's more difficult to limit the number of CPUs used by the
 node OS. ``cgroups`` could be a solution to restrict the node OS to use
 some of the CPUs, leaving the other ones to instances and KVM processes.
 For KVM, the number of CPUs for the host system should also be a
 hypervisor parameter (set at the node group level).

 Dedicated RAM
 -------------

 Instances should not compete for RAM. This is easily done on Xen, but it
 is tricky on KVM.

 Xen
 ~~~

 Memory is already fully segregated under Xen, if sharing mechanisms
 (transcendent memory, auto ballooning, etc) are not in use.

 KVM
 ~~~
 Under KVM or LXC memory is fully shared between the host system and all
 the guests, and instances can even be swapped out by the host OS.

 It's not clear if the problem can be solved by limiting the size of the
 instances, so that there is plenty of room for the host OS.

 We could implement segregation using cgroups to limit the memory used by
 the host OS. This requires finishing the implementation of the memory
 hypervisor status (set at the node group level) that changes how free
 memory is computed under KVM systems. Then we have to add a way to
 enforce this limit on the host system itself, rather than leaving it as
 a calculation tool only.

 Another problem for KVM is that we need to decide about the size of the
 cgroup versus the size of the VM: some overhead will in particular
 exist, due to the fact that an instance and its encapsulating KVM
 process share the same space. For KVM systems the physical memory
 allocatable to instances should be computed by subtracting an overhead
 for the KVM processes, whose value can be either statically configured
 or set in a hypervisor status parameter.

 NUMA
 ~~~~

 If instances are pinned to CPUs, and the amount of memory used for every
 instance is proportionate to the number of VCPUs, NUMA shouldn't be a
 problem, as the hypervisors allocate memory in the appropriate NUMA
 node. Work is in progress in Xen and the Linux kernel to always allocate
 memory correctly even without pinning. Therefore, we don't need to
 address this problem specifically; it will be solved by future versions
 of the hypervisors or by implementing CPU pinning.

 Constrained instance sizes
 --------------------------

 In order to simplify allocation and resource provisioning we want to
 limit the possible sizes of instances to a finite set of specifications,
 defined at node-group level.

 Currently it's possible to define an instance policy that limits the
 minimum and maximum value for CPU, memory, and disk usage (and spindles
 and any other resource, when implemented), independently from each other. We
 extend the policy by allowing it to contain more occurrences of the
 specifications for both the limits for the instance resources. Each
 specification pair (minimum and maximum) has a unique priority
 associated to it (or in other words, specifications are ordered), which
 is used by ``hspace`` (see below). The standard specification doesn't
 change: there is one for the whole cluster.

 For example, a policy could be set up to allow instances with this
 constraints:

 - between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of
   disk space;
 - 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space.

 Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be
 legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk,
 while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be
 illegal.

 Ganeti will refuse to create (or modify) instances that violate instance
 policy constraints, unless the flag ``--ignore-ipolicy`` is passed.

 While the changes needed to check constraint violations are
 straightforward, ``hspace`` behavior needs some adjustments for tiered
 allocation. ``hspace`` will start to allocate instances using the
 maximum specification with the highest priority, then it will try to
 lower the most constrained resources (without breaking the policy)
 before moving to the second highest priority, and so on.

 For consistent results in capacity calculation, the specifications
 inside a policy should be ordered so that the biggest specifications
 have the highest priorities. Also, specifications should not overlap.
 Ganeti won't check nor enforce such constraints, though.

 Implementation order
 ====================

 We will implement this design in the following order:

 - Exclusive use of disks (without spindles as a resource)
 - Constrained instance sizes
 - Spindles as a resource
 - Dedicated CPU and memory

 In this way have always new features that are immediately useful.
 Spindles as a resource are not needed for correct capacity calculation,
 as long as allowed disk sizes are multiples of spindle size, so it's
 been moved after constrained instance sizes. If it turns out that it's
 easier to implement dedicated disks with spindles as a resource, then we
 will do that.

 Possible future enhancements
 ============================

 This section briefly describes some enhancements to the current design.
 They may require their own design document, and must be re-evaluated
 when considered for implementation, as Ganeti and the hypervisors may
 change substantially in the meantime.

 Network bandwidth
 -----------------

 A new resource is introduced: network bandwidth. An administrator must
 be able to assign some network bandwidth to the virtual interfaces of an
 instance, and set limits in instance policies. Also, a list of the
 physical network interfaces available for Ganeti use and their maximum
 bandwidth must be kept at node-group or node level. This information
 will be taken into account for allocation, balancing, and free-space
 calculation.

 An additional enhancement is Ganeti enforcing the values set in the
 bandwidth resource. This can be done by configuring limits for example
 via openvswitch or normal QoS for bridging or routing. The bandwidth
 resource represents the average bandwidth usage, so a few new back-end
 parameters are needed to configure how to deal with bursts (they depend
 on the actual way used to enforce the limit).

 CPU pinning
 -----------

 In order to avoid unwarranted migrations between CPUs and to deal with
 NUMA effectively we may need CPU pinning. CPU scheduling is a complex
 topic and still under active development in Xen and the Linux kernel, so
 we wont' try to outsmart their developers. If we need pinning it's more
 to have predictable performance than to get the maximum performance
 (which is best done by the hypervisor), so we'll implement a very simple
 algorithm that allocates CPUs when an instance is assigned to a node
 (either when it's created or when it's moved) and takes into account
 NUMA and maybe CPU multithreading. A more refined version might run also
 when an instance is deleted, but that would involve reassigning CPUs,
 which could be bad with NUMA.

 Overcommit for RAM and disks
 ----------------------------

 Right now it is possible to assign more VCPUs to the instances running
 on a node than there are CPU available. This works as normally CPU usage
 on average is way below 100%. There are ways to share memory pages
 (e.g. KSM, transcendent memory) and disk blocks, so we could add new
 parameters to overcommit memory and disks, similar to ``vcpu_ratio``.

 .. vim: set textwidth=72 :
 .. Local Variables:
 .. mode: rst
 .. fill-column: 72
 .. End:
	==================
	Partitioned Ganeti
	==================

	.. contents:: :depth: 4

	Current state and shortcomings
	==============================

	Currently Ganeti can be used to easily share a node between multiple
	virtual instances. While it's easy to do a completely "best effort"
	sharing it's quite harder to completely reserve resources for the use of
	a particular instance. In particular this has to be done manually for
	CPUs and disk, is implemented for RAM under Xen, but not under KVM, and
	there's no provision for network level QoS.

	Proposed changes
	================

	We want to make it easy to partition a node between machines with
	exclusive use of hardware resources. While some sharing will anyway need
	to happen (e.g. for operations that use the host domain, or use
	resources, like buses, which are unique or very scarce on host systems)
	we'll strive to maintain contention at a minimum, but won't try to avoid
	all possible sources of it.

	Exclusive use of disks
	----------------------

	``exclusive_storage`` is a new node parameter. When it's enabled, Ganeti
	will allocate entire disks to instances. Though it's possible to think
	of ways of doing something similar for other storage back-ends, this
	design targets only ``plain`` and ``drbd``. The name is generic enough
	in case the feature will be extended to other back-ends. The flag value
	should be homogeneous within a node-group; ``cluster-verify`` will report
	any violation of this condition.

	Ganeti will consider each physical volume in the destination volume
	group as a host disk (for proper isolation, an administrator should
	make sure that there aren't multiple PVs on the same physical
	disk). When ``exclusive_storage`` is enabled in a node group, all PVs
	in the node group must have the same size (within a certain margin, say
	1%, defined through a new parameter). Ganeti will check this condition
	when the ``exclusive_storage`` flag is set, whenever a new node is added
	and as part of ``cluster-verify``.

	When creating a new disk for an instance, Ganeti will allocate the
	minimum number of PVs to hold the disk, and those PVs will be excluded
	from the pool of available PVs for further disk creations. The
	underlying LV will be striped, when striping is allowed by the current
	configuration. Ganeti will continue to track only the LVs, and query the
	LVM layer to figure out which PVs are available and how much space is
	free. Yet, creation, disk growing, and free-space reporting will ignore
	any partially allocated PVs, so that PVs won't be shared between
	instance disks.

	For compatibility with the DRBD template and to take into account disk
	variability, Ganeti will always subtract 2% (this will be a parameter)
	from the PV space when calculating how many PVs are needed to allocate
	an instance and when nodes report free space.

	The obvious target for this option is plain disk template, which doesn't
	provide redundancy. An administrator can still provide resilience
	against disk failures by setting up RAID under PVs, but this is
	transparent to Ganeti.

	Spindles as a resource
	~~~~~~~~~~~~~~~~~~~~~~

	When resources are dedicated and there are more spindles than instances
	on a node, it is natural to assign more spindles to instances than what
	is strictly needed. For this reason, we introduce a new resource:
	spindles. A spindle is a PV in LVM. The number of spindles required for
	a disk of an instance is specified together with the size. Specifying
	the number of spindles is possible only when ``exclusive_storage`` is
	enabled. It is an error to specify a number of spindles insufficient to
	contain the requested disk size.

	When ``exclusive_storage`` is not enabled, spindles are not used in free
	space calculation, in allocation algorithms, and policies. When it's
	enabled, ``hspace``, ``hbal``, and allocators will use spindles instead
	of disk size for their computation. For each node, the number of all the
	spindles in every LVM group is recorded, and different LVM groups are
	accounted separately in allocation and balancing.

	There is already a concept of spindles in Ganeti. It's not related to
	any actual spindle or volume count, but it's used in ``spindle_use`` to
	measure the pressure of an instance on the storage system and in
	``spindle_ratio`` to balance the I/O load on the nodes. When
	``exclusive_storage`` is enabled, these parameters as currently defined
	won't make any sense, so their meaning will be changed in this way:

	- ``spindle_use`` refers to the resource, hence to the actual spindles
	(PVs in LVM), used by an instance. The values specified in the instance
	policy specifications are compared to the run-time numbers of spindle
	used by an instance. The ``spindle_use`` back-end parameter will be
	ignored.
	- ``spindle_ratio`` in instance policies and ``spindle_count`` in node
	parameters are ignored, as the exclusive assignment of PVs already
	implies a value of 1.0 for the first, and the second is replaced by
	the actual number of spindles.

	When ``exclusive_storage`` is disabled, the existing spindle parameters
	behave as before.

	Dedicated CPUs
	--------------

	``vpcu_ratio`` can be used to tie the number of VCPUs to the number of
	CPUs provided by the hardware. We need to take into account the CPU
	usage of the hypervisor. For Xen, this means counting the number of
	VCPUs assigned to ``Domain-0``.

	For KVM, it's more difficult to limit the number of CPUs used by the
	node OS. ``cgroups`` could be a solution to restrict the node OS to use
	some of the CPUs, leaving the other ones to instances and KVM processes.
	For KVM, the number of CPUs for the host system should also be a
	hypervisor parameter (set at the node group level).

	Dedicated RAM
	-------------

	Instances should not compete for RAM. This is easily done on Xen, but it
	is tricky on KVM.

	Xen
	~~~

	Memory is already fully segregated under Xen, if sharing mechanisms
	(transcendent memory, auto ballooning, etc) are not in use.

	KVM
	~~~
	Under KVM or LXC memory is fully shared between the host system and all
	the guests, and instances can even be swapped out by the host OS.

	It's not clear if the problem can be solved by limiting the size of the
	instances, so that there is plenty of room for the host OS.

	We could implement segregation using cgroups to limit the memory used by
	the host OS. This requires finishing the implementation of the memory
	hypervisor status (set at the node group level) that changes how free
	memory is computed under KVM systems. Then we have to add a way to
	enforce this limit on the host system itself, rather than leaving it as
	a calculation tool only.

	Another problem for KVM is that we need to decide about the size of the
	cgroup versus the size of the VM: some overhead will in particular
	exist, due to the fact that an instance and its encapsulating KVM
	process share the same space. For KVM systems the physical memory
	allocatable to instances should be computed by subtracting an overhead
	for the KVM processes, whose value can be either statically configured
	or set in a hypervisor status parameter.

	NUMA
	~~~~

	If instances are pinned to CPUs, and the amount of memory used for every
	instance is proportionate to the number of VCPUs, NUMA shouldn't be a
	problem, as the hypervisors allocate memory in the appropriate NUMA
	node. Work is in progress in Xen and the Linux kernel to always allocate
	memory correctly even without pinning. Therefore, we don't need to
	address this problem specifically; it will be solved by future versions
	of the hypervisors or by implementing CPU pinning.

	Constrained instance sizes
	--------------------------

	In order to simplify allocation and resource provisioning we want to
	limit the possible sizes of instances to a finite set of specifications,
	defined at node-group level.

	Currently it's possible to define an instance policy that limits the
	minimum and maximum value for CPU, memory, and disk usage (and spindles
	and any other resource, when implemented), independently from each other. We
	extend the policy by allowing it to contain more occurrences of the
	specifications for both the limits for the instance resources. Each
	specification pair (minimum and maximum) has a unique priority
	associated to it (or in other words, specifications are ordered), which
	is used by ``hspace`` (see below). The standard specification doesn't
	change: there is one for the whole cluster.

	For example, a policy could be set up to allow instances with this
	constraints:

	- between 1 and 2 CPUs, 2 GB of RAM, and between 10 GB and 400 GB of
	disk space;
	- 4 CPUs, 4 GB of RAM, and between 10 GB and 800 GB of disk space.

	Then, an instance using 1 CPU, 2 GB of RAM and 50 GB of disk would be
	legal, as an instance using 4 CPUs, 4 GB of RAM, and 20 GB of disk,
	while an instance using 2 CPUs, 4 GB of RAM and 40 GB of disk would be
	illegal.

	Ganeti will refuse to create (or modify) instances that violate instance
	policy constraints, unless the flag ``--ignore-ipolicy`` is passed.

	While the changes needed to check constraint violations are
	straightforward, ``hspace`` behavior needs some adjustments for tiered
	allocation. ``hspace`` will start to allocate instances using the
	maximum specification with the highest priority, then it will try to
	lower the most constrained resources (without breaking the policy)
	before moving to the second highest priority, and so on.

	For consistent results in capacity calculation, the specifications
	inside a policy should be ordered so that the biggest specifications
	have the highest priorities. Also, specifications should not overlap.
	Ganeti won't check nor enforce such constraints, though.

	Implementation order
	====================

	We will implement this design in the following order:

	- Exclusive use of disks (without spindles as a resource)
	- Constrained instance sizes
	- Spindles as a resource
	- Dedicated CPU and memory

	In this way have always new features that are immediately useful.
	Spindles as a resource are not needed for correct capacity calculation,
	as long as allowed disk sizes are multiples of spindle size, so it's
	been moved after constrained instance sizes. If it turns out that it's
	easier to implement dedicated disks with spindles as a resource, then we
	will do that.

	Possible future enhancements
	============================

	This section briefly describes some enhancements to the current design.
	They may require their own design document, and must be re-evaluated
	when considered for implementation, as Ganeti and the hypervisors may
	change substantially in the meantime.

	Network bandwidth
	-----------------

	A new resource is introduced: network bandwidth. An administrator must
	be able to assign some network bandwidth to the virtual interfaces of an
	instance, and set limits in instance policies. Also, a list of the
	physical network interfaces available for Ganeti use and their maximum
	bandwidth must be kept at node-group or node level. This information
	will be taken into account for allocation, balancing, and free-space
	calculation.

	An additional enhancement is Ganeti enforcing the values set in the
	bandwidth resource. This can be done by configuring limits for example
	via openvswitch or normal QoS for bridging or routing. The bandwidth
	resource represents the average bandwidth usage, so a few new back-end
	parameters are needed to configure how to deal with bursts (they depend
	on the actual way used to enforce the limit).

	CPU pinning
	-----------

	In order to avoid unwarranted migrations between CPUs and to deal with
	NUMA effectively we may need CPU pinning. CPU scheduling is a complex
	topic and still under active development in Xen and the Linux kernel, so
	we wont' try to outsmart their developers. If we need pinning it's more
	to have predictable performance than to get the maximum performance
	(which is best done by the hypervisor), so we'll implement a very simple
	algorithm that allocates CPUs when an instance is assigned to a node
	(either when it's created or when it's moved) and takes into account
	NUMA and maybe CPU multithreading. A more refined version might run also
	when an instance is deleted, but that would involve reassigning CPUs,
	which could be bad with NUMA.

	Overcommit for RAM and disks
	----------------------------

	Right now it is possible to assign more VCPUs to the instances running
	on a node than there are CPU available. This works as normally CPU usage
	on average is way below 100%. There are ways to share memory pages
	(e.g. KSM, transcendent memory) and disk blocks, so we could add new
	parameters to overcommit memory and disks, similar to ``vcpu_ratio``.

	.. vim: set textwidth=72 :
	.. Local Variables:
	.. mode: rst
	.. fill-column: 72
	.. End: