doc/design-move-instance-improvements.rst - ganeti - Git at Google

 ========================================
 Instance move improvements
 ========================================

 .. contents:: :depth: 3

 Ganeti provides tools for moving instances within and between clusters. Through
 special export and import calls, a new instance is created with the disk data of
 the existing one.

 The tools work correctly and reliably, but depending on bandwidth and priority,
 an instance disk of considerable size requires a long time to transfer. The
 length of the transfer is inconvenient at best, but the problem becomes only
 worse if excessive locking causes a move operation to be delayed for a longer
 period of time, or to block other operations.

 The performance of moves is a complex topic, with available bandwidth,
 compression, and encryption all being candidates for choke points that bog down
 a transfer. Depending on the environment a move is performed in, tuning these
 can have significant performance benefits, but Ganeti does not expose many
 options needed for such tuning. The details of what to expose and what tradeoffs
 can be made will be presented in this document.

 Apart from existing functionality, some beneficial features can be introduced to
 help with instance moves. Zeroing empty space on instance disks can be useful
 for drastically improving the qualities of compression, effectively not needing
 to transfer unused disk space during moves. Compression itself can be improved
 by using different tools. The encryption used can be weakened or eliminated for
 certain moves. Using opportunistic locking during instance moves results in
 greater parallelization. As all of these approaches aim to tackle two different
 aspects of the problem, they do not exclude each other and will be presented
 independently.

 The performance of Ganeti moves
 ===============================

 In the current implementation, there are three possible factors limiting the
 speed of an instance move. The first is the network bandwidth, which Ganeti can
 exploit better by using compression. The second is the encryption, which is
 obligatory, and which can throttle an otherwise fast connection. The third is
 surprisingly the compression, which can cause the connection to be
 underutilized.

 Example 1: some numbers present during an intra-cluster instance move:

 * Network bandwidth: 105MB/s, courtesy of a gigabit switch

 * Encryption performance: 40MB/s, provided by OpenSSL

 * Compression performance: 22.3MB/s input, 7.1MB/s gzip compressed output

 As can be seen in this example, the obligatory encryption results in 62% of
 available bandwidth being wasted, while using compression further lowers the
 throughput to 55% of what the encryption would allow. The following sections
 will talk about these numbers in more detail, and suggest improvements and best
 practices.

 Encryption and Ganeti security
 ++++++++++++++++++++++++++++++

 Turning compression and encryption off would allow for an immediate improvement,
 and while that is possible for compression, there are good reasons why
 encryption is currently not a feature a user can disable.

 While it is impossible to secure instance data if an attacker gains SSH access
 to a node, the RAPI was designed to never allow user data to be accessed through
 it in case of being compromised. If moves could be performed unencrypted, this
 property would be broken. Instance moves can take place in environments which
 may be hostile, and where unencrypted traffic could be intercepted. As they can
 be instigated through the RAPI, an attacker could access all data on all
 instances in a cluster by moving them unencrypted and intercepting the data in
 flight. This is one of the few situations where the current speed of instance
 moves could be considered a perk.

 The performance of encryption can be increased by either using a less secure
 form of encryption, including no encryption, or using a faster encryption
 algorithm. The example listed above utilizes AES-256, one of the few ciphers
 that Ganeti deems secure enough to use. AES-128, also allowed by Ganeti's
 current settings, is weaker but 46% faster. A cipher that is not allowed due to
 its flaws, such as RC4, could offer a 208% increase in speed. On the other hand,
 using an OS capable of utilizing the AES_NI chip present on modern hardware
 can double the performance of AES, making it the best tradeoff between security
 and performance.

 Ganeti cannot and should not detect all the factors listed above, but should
 rather give its users some leeway in what to choose. A precedent already exists,
 as intra-cluster DRBD replication is already performed unencrypted, albeit on a
 separate VLAN. For intra-cluster moves, Ganeti should allow its users to set
 OpenSSL ciphers at will, while still enforcing high-security settings for moves
 between clusters.

 Thus, two settings will be introduced:

 * a cluster-level setting called ``--allow-cipher-bypassing``, a boolean that
   cannot be set over RAPI

 * a gnt-instance move setting called ``--ciphers-to-use``, bypassing the default
   cipher list with given ciphers, filtered to ensure no other OpenSSL options
   are passed in within

 This change will serve to address the issues with moving non-redundant instances
 within the cluster, while keeping Ganeti security at its current level.

 Compression
 +++++++++++

 Support for disk compression during instance moves was partially present before,
 but cleaned up and unified under the ``--compress`` option only as of Ganeti
 2.11. The only option offered by Ganeti is gzip with no options passed to it,
 resulting in a good compression ratio, but bad compression speed.

 As compression can affect the speed of instance moves significantly, it is
 worthwhile to explore alternatives. To test compression tool performance, an 8GB
 drive filled with data matching the expected usage patterns (taken from a
 workstation) was compressed by using various tools with various settings. The
 two top performers were ``lzop`` and, surprisingly, ``gzip``. The improvement in
 the performance of ``gzip`` was obtained by explicitly optimizing for speed
 rather than compression.

 * ``gzip -6``: 22.3MB/s in, 7.1MB/s out
 * ``gzip -1``: 44.1MB/s in, 15.1MB/s out
 * ``lzop``: 71.9MB/s in, 28.1MB/s out

 If encryption is the limiting factor, and as in the example, limits the
 bandwidth to 40MB/s, ``lzop`` allows for an effective 79% increase in transfer
 speed. The fast ``gzip`` would also prove to be beneficial, but much less than
 ``lzop``. It should also be noted that as a rule of thumb, tools with a lower
 compression ratio had a lesser workload, with ``lzop`` straining the CPU much
 less than any of the competitors.

 With the test results present here, it is clear that ``lzop`` would be a very
 worthwhile addition to the compression options present in Ganeti, yet the
 problem is that it is not available by default on all distributions, as the
 option's presence might imply. In general, Ganeti may know how to use several
 tools, and check for their presence, but should add some way of at least hinting
 at which tools are available.

 Additionally, the user might want to use a tool that Ganeti did not account for.
 Allowing the tool to be named is also helpful, both for cases when multiple
 custom tools are to be used, and for distinguishing between various tools in
 case of e.g. inter-cluster moves.

 To this end, the ``--compression-tools`` cluster parameter will be added to
 Ganeti. It contains a list of names of compression tools that can be supplied as
 the parameter of ``--compress``, and by default it contains all the tools
 Ganeti knows how to use. The user can change the list as desired, removing
 entries that are not or should not be available on the cluster, and adding
 custom tools.

 Every custom tool is identified by its name, and Ganeti expects the name to
 correspond to a script invoking the compression tool. Without arguments, the
 script compresses input on stdin, outputting it on stdout. With the -d argument,
 the script does the same, only while decompressing. The -h argument is used to
 check for the presence of the script, and in this case, only the error code is
 examined. This syntax matches the ``gzip`` syntax well, which should allow most
 compression tools to be adapted to it easily.

 Ganeti will not allow arbitrary parameters to be passed to a compression tool,
 and will restrict the names to contain only a small but assuredly safe subset of
 characters - alphanumeric values and dashes and underscores. This minimizes the
 risk of security issues that could arise from an attacker smuggling a malicious
 command through RAPI. Common variations, like the speed/compression tradeoff of
 ``gzip``, will be handled by aliases, e.g. ``gzip-fast`` or ``gzip-slow``.

 It should also be noted that for some purposes - e.g. the writing of OVF files,
 ``gzip`` is the only allowed means of compression, and an appropriate error
 message should be displayed if the user attempts to use one of the other
 provided tools.

 Zeroing instance disks
 ======================

 While compression lowers the amount of data sent, further reductions can be
 achieved by taking advantage of the structure of the disk - namely, sending only
 used disk sectors.

 There is no direct way to achieve this, as it would require that the
 move-instance tool is aware of the structure of the file system. Mounting the
 filesystem is not an option, primarily due to security issues. A disk primed to
 take advantage of a disk driver exploit could cause an attacker to breach
 instance isolation and gain control of a Ganeti node.

 An indirect way for this performance gain to be achieved is the zeroing of any
 hard disk space not in use. While this primarily means empty space, swap
 partitions can be zeroed as well.

 Sequences of zeroes can be compressed and thus transferred very efficiently, all
 without the host knowing that these are empty space. This approach can also be
 dangerous if a sparse disk is zeroed in this way, causing ballooning. As Ganeti
 does not seem to make special concessions for moving sparse disks, the only
 difference should be the disk space utilization on the current node.

 Zeroing approaches
 ++++++++++++++++++

 Zeroing is a feasible approach, but the node cannot perform it as it cannot
 mount the disk. Only virtualization-based options remain, and of those, using
 Ganeti's own virtualization capabilities makes the most sense. There are two
 ways of doing this - creating a new helper instance, temporary or persistent, or
 reusing the target instance.

 Both approaches have their disadvantages. Creating a new helper instance
 requires managing its lifecycle, taking special care to make sure no helper
 instance remains left over due to a failed operation. Even if this were to be
 taken care of, disks are not yet separate entities in Ganeti, making the
 temporary transfer of disks between instances hard to implement and even harder
 to make robust. The reuse can be done by modifying the OS running on the
 instance to perform the zeroing itself when notified via the new instance
 communication mechanism, but this approach is neither generic, nor particularly
 safe. There is no guarantee that the zeroing operation will not interfere with
 the normal operation of the instance, nor that it will be completed if a
 user-initiated shutdown occurs.

 A better solution can be found by combining the two approaches - re-using the
 virtualized environment, but with a specifically crafted OS image. With the
 instance shut down as it should be in preparation for the move, it can be
 extended with an additional disk with the OS image on it. By prepending the
 disk and changing some instance parameters, the instance can boot from it. The
 OS can be configured to perform the zeroing on startup, attempting to mount any
 partitions with a filesystem present, and creating and deleting a zero-filled
 file on them. After the zeroing is complete, the OS should shut down, and the
 master should note the shutdown and restore the instance to its previous state.

 Note that the requirements above are very similar to the notion of a helper VM
 suggested in the OS install document. Some potentially unsafe actions are
 performed within a virtualized environment, acting on disks that belong or will
 belong to the instance. The mechanisms used will thus be developed with both
 approaches in mind.

 Implementation
 ++++++++++++++

 There are two components to this solution - the Ganeti changes needed to boot
 the OS, and the OS image used for the zeroing. Due to the variety of filesystems
 and architectures that instances can use, no single ready-to-run disk image can
 satisfy the needs of all the Ganeti users. Instead, the instance-debootstrap
 scripts can be used to generate a zeroing-capable OS image. This might not be
 ideal, as there are lightweight distributions that take up less space and boot
 up more quickly. Generating those with the right set of drivers for the
 virtualization platform of choice is not easy. Thus we do not provide a script
 for this purpose, but the user is free to provide any OS image which performs
 the necessary steps: zero out all virtualization-provided devices on startup,
 shutdown immediately. The cluster-wide parameter controlling the image to be
 used would be called ``--zeroing-image``.

 The modifications to Ganeti code needed are minor. The zeroing functionality
 should be implemented as an extension of the instance export, and exposed as the
 ``--zero-free-space option``. Prior to beginning the export, the instance
 configuration is temporarily extended with a new read-only disk of sufficient
 size to host the zeroing image, and the changes necessary for the image to be
 used as the boot drive. The temporary nature of the configuration changes
 requires that they are not propagated to other nodes. While this would normally
 not be feasible with an instance using a disk template offering multi-node
 redundancy, experiments with the code have shown that the restriction on
 diverse disk templates can be bypassed to temporarily allow a plain
 disk-template disk to host the zeroing image. Given that one of the planned
 changes in Ganeti is to have instance disks as separate entities, with no
 restriction on templates, this assumption is useful rather than harmful by
 asserting the desired behavior. The image is dumped to the disk, and the
 instance is started up.

 Once the instance is started up, the zeroing will proceed until completion, when
 a self-initiated shutdown will occur. The instance-shutdown detection
 capabilities of 2.11 should prevent the watcher from restarting the instance
 once this happens, allowing the host to take it as a sign the zeroing was
 completed. Either way, the host waits until the instance is shut down, or a
 timeout has been reached and the instance is forcibly shut down. As the time
 needed to zero an instance is dependent on the size of the disk of the instance,
 the user can provide a fixed and a per-size timeout, recommended to be set to
 twice the maximum write speed of the device hosting the instance.

 Better progress monitoring can be implemented with the instance-host
 communication channel proposed by the OS install design document. The first
 version will most likely use only the shutdown detection, and will be improved
 to account for the available communication channel at a later time.

 After the shutdown, the temporary disk is destroyed and the instance
 configuration is reverted to its original state. The very same action is done if
 any error is encountered during the zeroing process. In the case that the
 zeroing is interrupted while the zero-filled file is being written, the file may
 remain on the disk of the instance. The script that performs the zeroing will be
 made to react to system signals by deleting the zero-filled file, but there is
 little else that can be done to recover.

 When to use zeroing
 +++++++++++++++++++

 The question of when it is useful to use zeroing is hard to answer because the
 effectiveness of the approach depends on many factors. All compression tools
 compress zeroes to almost nothingness, but compressing them takes time. If the
 time needed to compress zeroes were equal to zero, the approach would boil down
 to whether it is faster to zero unused space out, performing writes to disk, or
 to transfer it compressed. For the example used above, the average compression
 ratio, and write speeds of current disk drives, the answer would almost
 unanimously be yes.

 With a more realistic setup, where zeroes take time to compress, yet less time
 than ordinary data, the gains depend on the previously mentioned tradeoff and
 the free space available. Zeroing will definitely lessen the amount of bandwidth
 used, but it can lead to the connection being underutilized due to the time
 spent compressing data. It is up to the user to make these tradeoffs, but
 zeroing should be seen primarily as a means of further reducing the amount of
 data sent while increasing disk activity, with possible speed gains that should
 not be relied upon.

 In the future, the VM created for zeroing could also undertake other tasks
 related to the move, such as compression and encryption, and produce a stream
 of data rather than just modifying the disk. This would lessen the strain on
 the resources of the hypervisor, both disk I/O and CPU usage, and allow moves to
 obey the resource constraints placed on the instance being moved.

 Lock reduction
 ==============

 An instance move as executed by the move-instance tool consists of several
 preparatory RAPI calls, leading up to two long-lasting opcodes: OpCreateInstance
 and OpBackupExport. While OpBackupExport locks only the instance, the locks of
 OpCreateInstance require more attention.

 When executed, this opcode attempts to lock all nodes on which the instance may
 be created and obtain shared locks on the groups they belong to. In the case
 that an IAllocator is used, this means all nodes must be locked. Any operation
 that requires a node lock to be present can delay the move operation, and there
 is no shortage of these.

 The concept of opportunistic locking has been introduced to remedy exactly this
 situation, allowing the IAllocator to lock as many nodes as possible. Depending
 whether the allocation can be made on these nodes, the operation either proceeds
 as expected, or fails noting that it is temporarily infeasible. The failure case
 would change the semantics of the move-instance tool, which is expected to fail
 only if the move is impossible. To yield the benefits of opportunistic locking
 yet satisfy this constraint, the move-instance tool can be extended with the
 --opportunistic-tries and --opportunistic-try-delay options. A number of
 opportunistic instance creations are attempted, with a delay between attempts.
 The delay is slightly altered every time to avoid timing issues. Should all
 attempts fail, a normal instance creation is requested, which blocks until all
 the locks can be acquired.

 While it may seem excessive to grab so many node locks, the early release
 mechanism is used to make the situation less dire, releasing all nodes that were
 not chosen as candidates for allocation. This is taken to the extreme as all the
 locks acquired are released prior to the start of the transfer, barring the
 newly-acquired lock over the new instance. This works because all operations
 that alter the node in a way which could affect the transfer:

 * are prevented by the instance lock or instance presence, e.g. gnt-node remove,
   gnt-node evacuate,

 * do not interrupt the transfer, e.g. a PV on the node can be set as
   unallocatable, and the transfer still proceeds as expected,

 * do not care, e.g. a gnt-node powercycle explicitly ignores all locks.

 This invariant should be kept in mind, and perhaps verified through tests.

 All in all, there is very little space to reduce the number of locks used, and
 the only improvement that can be made is introducing opportunistic locking as an
 option of move-instance.

 Introduction of changes
 =======================

 All the changes noted will be implemented in Ganeti 2.12, in the way described
 in the previous chapters. They will be implemented as separate changes, first
 the lock reduction, then the instance zeroing, then the compression
 improvements, and finally the encryption changes.
	========================================
	Instance move improvements
	========================================

	.. contents:: :depth: 3

	Ganeti provides tools for moving instances within and between clusters. Through
	special export and import calls, a new instance is created with the disk data of
	the existing one.

	The tools work correctly and reliably, but depending on bandwidth and priority,
	an instance disk of considerable size requires a long time to transfer. The
	length of the transfer is inconvenient at best, but the problem becomes only
	worse if excessive locking causes a move operation to be delayed for a longer
	period of time, or to block other operations.

	The performance of moves is a complex topic, with available bandwidth,
	compression, and encryption all being candidates for choke points that bog down
	a transfer. Depending on the environment a move is performed in, tuning these
	can have significant performance benefits, but Ganeti does not expose many
	options needed for such tuning. The details of what to expose and what tradeoffs
	can be made will be presented in this document.

	Apart from existing functionality, some beneficial features can be introduced to
	help with instance moves. Zeroing empty space on instance disks can be useful
	for drastically improving the qualities of compression, effectively not needing
	to transfer unused disk space during moves. Compression itself can be improved
	by using different tools. The encryption used can be weakened or eliminated for
	certain moves. Using opportunistic locking during instance moves results in
	greater parallelization. As all of these approaches aim to tackle two different
	aspects of the problem, they do not exclude each other and will be presented
	independently.

	The performance of Ganeti moves
	===============================

	In the current implementation, there are three possible factors limiting the
	speed of an instance move. The first is the network bandwidth, which Ganeti can
	exploit better by using compression. The second is the encryption, which is
	obligatory, and which can throttle an otherwise fast connection. The third is
	surprisingly the compression, which can cause the connection to be
	underutilized.

	Example 1: some numbers present during an intra-cluster instance move:

	* Network bandwidth: 105MB/s, courtesy of a gigabit switch

	* Encryption performance: 40MB/s, provided by OpenSSL

	* Compression performance: 22.3MB/s input, 7.1MB/s gzip compressed output

	As can be seen in this example, the obligatory encryption results in 62% of
	available bandwidth being wasted, while using compression further lowers the
	throughput to 55% of what the encryption would allow. The following sections
	will talk about these numbers in more detail, and suggest improvements and best
	practices.

	Encryption and Ganeti security
	++++++++++++++++++++++++++++++

	Turning compression and encryption off would allow for an immediate improvement,
	and while that is possible for compression, there are good reasons why
	encryption is currently not a feature a user can disable.

	While it is impossible to secure instance data if an attacker gains SSH access
	to a node, the RAPI was designed to never allow user data to be accessed through
	it in case of being compromised. If moves could be performed unencrypted, this
	property would be broken. Instance moves can take place in environments which
	may be hostile, and where unencrypted traffic could be intercepted. As they can
	be instigated through the RAPI, an attacker could access all data on all
	instances in a cluster by moving them unencrypted and intercepting the data in
	flight. This is one of the few situations where the current speed of instance
	moves could be considered a perk.

	The performance of encryption can be increased by either using a less secure
	form of encryption, including no encryption, or using a faster encryption
	algorithm. The example listed above utilizes AES-256, one of the few ciphers
	that Ganeti deems secure enough to use. AES-128, also allowed by Ganeti's
	current settings, is weaker but 46% faster. A cipher that is not allowed due to
	its flaws, such as RC4, could offer a 208% increase in speed. On the other hand,
	using an OS capable of utilizing the AES_NI chip present on modern hardware
	can double the performance of AES, making it the best tradeoff between security
	and performance.

	Ganeti cannot and should not detect all the factors listed above, but should
	rather give its users some leeway in what to choose. A precedent already exists,
	as intra-cluster DRBD replication is already performed unencrypted, albeit on a
	separate VLAN. For intra-cluster moves, Ganeti should allow its users to set
	OpenSSL ciphers at will, while still enforcing high-security settings for moves
	between clusters.

	Thus, two settings will be introduced:

	* a cluster-level setting called ``--allow-cipher-bypassing``, a boolean that
	cannot be set over RAPI

	* a gnt-instance move setting called ``--ciphers-to-use``, bypassing the default
	cipher list with given ciphers, filtered to ensure no other OpenSSL options
	are passed in within

	This change will serve to address the issues with moving non-redundant instances
	within the cluster, while keeping Ganeti security at its current level.

	Compression
	+++++++++++

	Support for disk compression during instance moves was partially present before,
	but cleaned up and unified under the ``--compress`` option only as of Ganeti
	2.11. The only option offered by Ganeti is gzip with no options passed to it,
	resulting in a good compression ratio, but bad compression speed.

	As compression can affect the speed of instance moves significantly, it is
	worthwhile to explore alternatives. To test compression tool performance, an 8GB
	drive filled with data matching the expected usage patterns (taken from a
	workstation) was compressed by using various tools with various settings. The
	two top performers were ``lzop`` and, surprisingly, ``gzip``. The improvement in
	the performance of ``gzip`` was obtained by explicitly optimizing for speed
	rather than compression.

	* ``gzip -6``: 22.3MB/s in, 7.1MB/s out
	* ``gzip -1``: 44.1MB/s in, 15.1MB/s out
	* ``lzop``: 71.9MB/s in, 28.1MB/s out

	If encryption is the limiting factor, and as in the example, limits the
	bandwidth to 40MB/s, ``lzop`` allows for an effective 79% increase in transfer
	speed. The fast ``gzip`` would also prove to be beneficial, but much less than
	``lzop``. It should also be noted that as a rule of thumb, tools with a lower
	compression ratio had a lesser workload, with ``lzop`` straining the CPU much
	less than any of the competitors.

	With the test results present here, it is clear that ``lzop`` would be a very
	worthwhile addition to the compression options present in Ganeti, yet the
	problem is that it is not available by default on all distributions, as the
	option's presence might imply. In general, Ganeti may know how to use several
	tools, and check for their presence, but should add some way of at least hinting
	at which tools are available.

	Additionally, the user might want to use a tool that Ganeti did not account for.
	Allowing the tool to be named is also helpful, both for cases when multiple
	custom tools are to be used, and for distinguishing between various tools in
	case of e.g. inter-cluster moves.

	To this end, the ``--compression-tools`` cluster parameter will be added to
	Ganeti. It contains a list of names of compression tools that can be supplied as
	the parameter of ``--compress``, and by default it contains all the tools
	Ganeti knows how to use. The user can change the list as desired, removing
	entries that are not or should not be available on the cluster, and adding
	custom tools.

	Every custom tool is identified by its name, and Ganeti expects the name to
	correspond to a script invoking the compression tool. Without arguments, the
	script compresses input on stdin, outputting it on stdout. With the -d argument,
	the script does the same, only while decompressing. The -h argument is used to
	check for the presence of the script, and in this case, only the error code is
	examined. This syntax matches the ``gzip`` syntax well, which should allow most
	compression tools to be adapted to it easily.

	Ganeti will not allow arbitrary parameters to be passed to a compression tool,
	and will restrict the names to contain only a small but assuredly safe subset of
	characters - alphanumeric values and dashes and underscores. This minimizes the
	risk of security issues that could arise from an attacker smuggling a malicious
	command through RAPI. Common variations, like the speed/compression tradeoff of
	``gzip``, will be handled by aliases, e.g. ``gzip-fast`` or ``gzip-slow``.

	It should also be noted that for some purposes - e.g. the writing of OVF files,
	``gzip`` is the only allowed means of compression, and an appropriate error
	message should be displayed if the user attempts to use one of the other
	provided tools.

	Zeroing instance disks
	======================

	While compression lowers the amount of data sent, further reductions can be
	achieved by taking advantage of the structure of the disk - namely, sending only
	used disk sectors.

	There is no direct way to achieve this, as it would require that the
	move-instance tool is aware of the structure of the file system. Mounting the
	filesystem is not an option, primarily due to security issues. A disk primed to
	take advantage of a disk driver exploit could cause an attacker to breach
	instance isolation and gain control of a Ganeti node.

	An indirect way for this performance gain to be achieved is the zeroing of any
	hard disk space not in use. While this primarily means empty space, swap
	partitions can be zeroed as well.

	Sequences of zeroes can be compressed and thus transferred very efficiently, all
	without the host knowing that these are empty space. This approach can also be
	dangerous if a sparse disk is zeroed in this way, causing ballooning. As Ganeti
	does not seem to make special concessions for moving sparse disks, the only
	difference should be the disk space utilization on the current node.

	Zeroing approaches
	++++++++++++++++++

	Zeroing is a feasible approach, but the node cannot perform it as it cannot
	mount the disk. Only virtualization-based options remain, and of those, using
	Ganeti's own virtualization capabilities makes the most sense. There are two
	ways of doing this - creating a new helper instance, temporary or persistent, or
	reusing the target instance.

	Both approaches have their disadvantages. Creating a new helper instance
	requires managing its lifecycle, taking special care to make sure no helper
	instance remains left over due to a failed operation. Even if this were to be
	taken care of, disks are not yet separate entities in Ganeti, making the
	temporary transfer of disks between instances hard to implement and even harder
	to make robust. The reuse can be done by modifying the OS running on the
	instance to perform the zeroing itself when notified via the new instance
	communication mechanism, but this approach is neither generic, nor particularly
	safe. There is no guarantee that the zeroing operation will not interfere with
	the normal operation of the instance, nor that it will be completed if a
	user-initiated shutdown occurs.

	A better solution can be found by combining the two approaches - re-using the
	virtualized environment, but with a specifically crafted OS image. With the
	instance shut down as it should be in preparation for the move, it can be
	extended with an additional disk with the OS image on it. By prepending the
	disk and changing some instance parameters, the instance can boot from it. The
	OS can be configured to perform the zeroing on startup, attempting to mount any
	partitions with a filesystem present, and creating and deleting a zero-filled
	file on them. After the zeroing is complete, the OS should shut down, and the
	master should note the shutdown and restore the instance to its previous state.

	Note that the requirements above are very similar to the notion of a helper VM
	suggested in the OS install document. Some potentially unsafe actions are
	performed within a virtualized environment, acting on disks that belong or will
	belong to the instance. The mechanisms used will thus be developed with both
	approaches in mind.

	Implementation
	++++++++++++++

	There are two components to this solution - the Ganeti changes needed to boot
	the OS, and the OS image used for the zeroing. Due to the variety of filesystems
	and architectures that instances can use, no single ready-to-run disk image can
	satisfy the needs of all the Ganeti users. Instead, the instance-debootstrap
	scripts can be used to generate a zeroing-capable OS image. This might not be
	ideal, as there are lightweight distributions that take up less space and boot
	up more quickly. Generating those with the right set of drivers for the
	virtualization platform of choice is not easy. Thus we do not provide a script
	for this purpose, but the user is free to provide any OS image which performs
	the necessary steps: zero out all virtualization-provided devices on startup,
	shutdown immediately. The cluster-wide parameter controlling the image to be
	used would be called ``--zeroing-image``.

	The modifications to Ganeti code needed are minor. The zeroing functionality
	should be implemented as an extension of the instance export, and exposed as the
	``--zero-free-space option``. Prior to beginning the export, the instance
	configuration is temporarily extended with a new read-only disk of sufficient
	size to host the zeroing image, and the changes necessary for the image to be
	used as the boot drive. The temporary nature of the configuration changes
	requires that they are not propagated to other nodes. While this would normally
	not be feasible with an instance using a disk template offering multi-node
	redundancy, experiments with the code have shown that the restriction on
	diverse disk templates can be bypassed to temporarily allow a plain
	disk-template disk to host the zeroing image. Given that one of the planned
	changes in Ganeti is to have instance disks as separate entities, with no
	restriction on templates, this assumption is useful rather than harmful by
	asserting the desired behavior. The image is dumped to the disk, and the
	instance is started up.

	Once the instance is started up, the zeroing will proceed until completion, when
	a self-initiated shutdown will occur. The instance-shutdown detection
	capabilities of 2.11 should prevent the watcher from restarting the instance
	once this happens, allowing the host to take it as a sign the zeroing was
	completed. Either way, the host waits until the instance is shut down, or a
	timeout has been reached and the instance is forcibly shut down. As the time
	needed to zero an instance is dependent on the size of the disk of the instance,
	the user can provide a fixed and a per-size timeout, recommended to be set to
	twice the maximum write speed of the device hosting the instance.

	Better progress monitoring can be implemented with the instance-host
	communication channel proposed by the OS install design document. The first
	version will most likely use only the shutdown detection, and will be improved
	to account for the available communication channel at a later time.

	After the shutdown, the temporary disk is destroyed and the instance
	configuration is reverted to its original state. The very same action is done if
	any error is encountered during the zeroing process. In the case that the
	zeroing is interrupted while the zero-filled file is being written, the file may
	remain on the disk of the instance. The script that performs the zeroing will be
	made to react to system signals by deleting the zero-filled file, but there is
	little else that can be done to recover.

	When to use zeroing
	+++++++++++++++++++

	The question of when it is useful to use zeroing is hard to answer because the
	effectiveness of the approach depends on many factors. All compression tools
	compress zeroes to almost nothingness, but compressing them takes time. If the
	time needed to compress zeroes were equal to zero, the approach would boil down
	to whether it is faster to zero unused space out, performing writes to disk, or
	to transfer it compressed. For the example used above, the average compression
	ratio, and write speeds of current disk drives, the answer would almost
	unanimously be yes.

	With a more realistic setup, where zeroes take time to compress, yet less time
	than ordinary data, the gains depend on the previously mentioned tradeoff and
	the free space available. Zeroing will definitely lessen the amount of bandwidth
	used, but it can lead to the connection being underutilized due to the time
	spent compressing data. It is up to the user to make these tradeoffs, but
	zeroing should be seen primarily as a means of further reducing the amount of
	data sent while increasing disk activity, with possible speed gains that should
	not be relied upon.

	In the future, the VM created for zeroing could also undertake other tasks
	related to the move, such as compression and encryption, and produce a stream
	of data rather than just modifying the disk. This would lessen the strain on
	the resources of the hypervisor, both disk I/O and CPU usage, and allow moves to
	obey the resource constraints placed on the instance being moved.

	Lock reduction
	==============

	An instance move as executed by the move-instance tool consists of several
	preparatory RAPI calls, leading up to two long-lasting opcodes: OpCreateInstance
	and OpBackupExport. While OpBackupExport locks only the instance, the locks of
	OpCreateInstance require more attention.

	When executed, this opcode attempts to lock all nodes on which the instance may
	be created and obtain shared locks on the groups they belong to. In the case
	that an IAllocator is used, this means all nodes must be locked. Any operation
	that requires a node lock to be present can delay the move operation, and there
	is no shortage of these.

	The concept of opportunistic locking has been introduced to remedy exactly this
	situation, allowing the IAllocator to lock as many nodes as possible. Depending
	whether the allocation can be made on these nodes, the operation either proceeds
	as expected, or fails noting that it is temporarily infeasible. The failure case
	would change the semantics of the move-instance tool, which is expected to fail
	only if the move is impossible. To yield the benefits of opportunistic locking
	yet satisfy this constraint, the move-instance tool can be extended with the
	--opportunistic-tries and --opportunistic-try-delay options. A number of
	opportunistic instance creations are attempted, with a delay between attempts.
	The delay is slightly altered every time to avoid timing issues. Should all
	attempts fail, a normal instance creation is requested, which blocks until all
	the locks can be acquired.

	While it may seem excessive to grab so many node locks, the early release
	mechanism is used to make the situation less dire, releasing all nodes that were
	not chosen as candidates for allocation. This is taken to the extreme as all the
	locks acquired are released prior to the start of the transfer, barring the
	newly-acquired lock over the new instance. This works because all operations
	that alter the node in a way which could affect the transfer:

	* are prevented by the instance lock or instance presence, e.g. gnt-node remove,
	gnt-node evacuate,

	* do not interrupt the transfer, e.g. a PV on the node can be set as
	unallocatable, and the transfer still proceeds as expected,

	* do not care, e.g. a gnt-node powercycle explicitly ignores all locks.

	This invariant should be kept in mind, and perhaps verified through tests.

	All in all, there is very little space to reduce the number of locks used, and
	the only improvement that can be made is introducing opportunistic locking as an
	option of move-instance.

	Introduction of changes
	=======================

	All the changes noted will be implemented in Ganeti 2.12, in the way described
	in the previous chapters. They will be implemented as separate changes, first
	the lock reduction, then the instance zeroing, then the compression
	improvements, and finally the encryption changes.