doc/design-configlock.rst - ganeti - Git at Google

 ===================================
 Removal of the Config Lock Overhead
 ===================================

 .. contents:: :depth: 4

 This is a design document detailing how the adverse effect of
 the config lock can be removed in an incremental way.

 Current state and shortcomings
 ==============================

 As a result of the :doc:`design-daemons`, the configuration is held
 in a proccess different from the processes carrying out the Ganeti
 jobs. Therefore, job processes have to contact WConfD in order to
 change the configuration. Of course, these modifications of the
 configuration need to be synchronised.

 The current form of synchronisation is via ``ConfigLock``. Exclusive
 possession of this lock guarantees that no one else modifies the
 configuration. In other words, the current procedure for a job to
 update the configuration is to

 - acquire the ``ConfigLock`` from WConfD,

 - read the configration,

 - write the modified configuration, and

 - release ``ConfigLock``.

 The current procedure has some drawbacks. These also affect the
 overall throughput of jobs in a Ganeti cluster.

 - At each configuration update, the whole configuration is
   transferred between the job and WConfD.

 - More importantly, however, jobs can only release the ``ConfigLock`` after
   the write; the write, in turn, is only confirmed once the configuration
   is written on disk. In particular, we can only have one update per
   configuration write. Also, having the ``ConfigLock`` is only confirmed
   to the job, once the new lock status is written to disk.

 Additional overhead is caused by the fact that reads are synchronised over
 a shared config lock. This used to make sense when the configuration was
 modifiable in the same process to ensure consistent read. With the new
 structure, all access to the configuration via WConfD are consistent
 anyway, and local modifications by other jobs do not happen.


 Proposed changes for an incremental improvement
 ===============================================

 Ideally, jobs would just send patches for the configuration to WConfD
 that are applied by means of atomically updating the respective ``IORef``.
 This, however, would require chaning all of Ganeti's logical units in
 one big change. Therefore, we propose to keep the ``ConfigLock`` and,
 step by step, reduce its impact till it eventually will be just used
 internally in the WConfD process.

 Unlocked Reads
 --------------

 In a first step, all configuration operations that are synchronised over
 a shared config lock, and therefore necessarily read-only, will instead
 use WConfD's ``readConfig`` used to obtain a snapshot of the configuration.
 This will be done without modifying the locks. It is sound, as reads to
 a Haskell ``IORef`` always yield a consistent value. From that snapshot
 the required view is computed locally. This saves two lock-configurtion
 write cycles per read and, additionally, does not block any concurrent
 modifications.

 In a second step, more specialised read functions will be added to ``WConfD``.
 This will reduce the traffic for reads.

 Cached Reads
 ------------

 As jobs synchronize with each other by means of regular locks, the parts
 of the configuration relevant for a job can only change while a job waits
 for new locks. So, if a job has a copy of the configuration and not asked
 for locks afterwards, all read-only access can be done from that copy. While
 this will not affect the ``ConfigLock``, it saves traffic.

 Set-and-release action
 ----------------------

 As a typical pattern is to change the configuration and afterwards release
 the ``ConfigLock``. To avoid unnecessary RPC call overhead, WConfD will offer
 a combined call. To make that call retryable, it will do nothing if the the
 ``ConfigLock`` is not held by the caller; in the return value, it will indicate
 if the config lock was held when the call was made.

 Short-lived ``ConfigLock``
 --------------------------

 For a lot of operations, the regular locks already ensure that only
 one job can modify a certain part of the configuration. For example,
 only jobs with an exclusive lock on an instance will modify that
 instance. Therefore, it can update that entity atomically,
 without relying on the configuration lock to achive consistency.
 ``WConfD`` will provide such operations. To
 avoid interference with non-atomic operations that still take the
 config lock and write the configuration as a whole, this operation
 will only be carried out at times the config lock is not taken. To
 ensure this, the thread handling the request will take the config lock
 itself (hence no one else has it, if that succeeds) before the change
 and release afterwards; both operations will be done without
 triggering a writeout of the lock status.

 Note that the thread handling the request has to take the lock in its
 own name and not in that of the requesting job. A writeout of the lock
 status can still happen, triggered by other requests. Now, if
 ``WConfD`` gets restarted after the lock acquisition, if that happend
 in the name of the job, it would own a lock without knowing about it,
 and hence that lock would never get released.


 Approaches considered, but not working
 ======================================

 Set-and-release action with asynchronous writes
 -----------------------------------------------

 Approach
 ~~~~~~~~

 As a typical pattern is to change the configuration and afterwards release
 the ``ConfigLock``. To avoid unnecessary delay in this operation (the next
 modification of the configuration can already happen while the last change
 is written out), WConfD will offer a combined command that will

 - set the configuration to the specified value,

 - release the config lock,

 - and only then wait for the configuration write to finish; it will not
   wait for confirmation of the lock-release write.

 If jobs use this combined command instead of the sequential set followed
 by release, new configuration changes can come in during writeout of the
 current change; in particular, a writeout can contain more than one change.

 Problem
 ~~~~~~~

 This approach works fine, as long as always either ``WConfD`` can do an ordered
 shutdown or the calling process dies as well. If however, we allow random kill
 signals to be sent to individual daemons (e.g., by an out-of-memory killer),
 the following race occurs. A process can ask for a combined write-and-unlock
 operation; while the configuration is still written out, the write out of the
 updated lock status already finishes. Now, if ``WConfD`` forcefully gets killed
 in that very moment, a restarted ``WConfD`` will read the old configuration but
 the new lock status. This will make the calling process believe that its call,
 while it didn't get an answer, succeeded nevertheless, thus resulting in a
 wrong configuration state.
	===================================
	Removal of the Config Lock Overhead
	===================================

	.. contents:: :depth: 4

	This is a design document detailing how the adverse effect of
	the config lock can be removed in an incremental way.

	Current state and shortcomings
	==============================

	As a result of the :doc:`design-daemons`, the configuration is held
	in a proccess different from the processes carrying out the Ganeti
	jobs. Therefore, job processes have to contact WConfD in order to
	change the configuration. Of course, these modifications of the
	configuration need to be synchronised.

	The current form of synchronisation is via ``ConfigLock``. Exclusive
	possession of this lock guarantees that no one else modifies the
	configuration. In other words, the current procedure for a job to
	update the configuration is to

	- acquire the ``ConfigLock`` from WConfD,

	- read the configration,

	- write the modified configuration, and

	- release ``ConfigLock``.

	The current procedure has some drawbacks. These also affect the
	overall throughput of jobs in a Ganeti cluster.

	- At each configuration update, the whole configuration is
	transferred between the job and WConfD.

	- More importantly, however, jobs can only release the ``ConfigLock`` after
	the write; the write, in turn, is only confirmed once the configuration
	is written on disk. In particular, we can only have one update per
	configuration write. Also, having the ``ConfigLock`` is only confirmed
	to the job, once the new lock status is written to disk.

	Additional overhead is caused by the fact that reads are synchronised over
	a shared config lock. This used to make sense when the configuration was
	modifiable in the same process to ensure consistent read. With the new
	structure, all access to the configuration via WConfD are consistent
	anyway, and local modifications by other jobs do not happen.


	Proposed changes for an incremental improvement
	===============================================

	Ideally, jobs would just send patches for the configuration to WConfD
	that are applied by means of atomically updating the respective ``IORef``.
	This, however, would require chaning all of Ganeti's logical units in
	one big change. Therefore, we propose to keep the ``ConfigLock`` and,
	step by step, reduce its impact till it eventually will be just used
	internally in the WConfD process.

	Unlocked Reads
	--------------

	In a first step, all configuration operations that are synchronised over
	a shared config lock, and therefore necessarily read-only, will instead
	use WConfD's ``readConfig`` used to obtain a snapshot of the configuration.
	This will be done without modifying the locks. It is sound, as reads to
	a Haskell ``IORef`` always yield a consistent value. From that snapshot
	the required view is computed locally. This saves two lock-configurtion
	write cycles per read and, additionally, does not block any concurrent
	modifications.

	In a second step, more specialised read functions will be added to ``WConfD``.
	This will reduce the traffic for reads.

	Cached Reads
	------------

	As jobs synchronize with each other by means of regular locks, the parts
	of the configuration relevant for a job can only change while a job waits
	for new locks. So, if a job has a copy of the configuration and not asked
	for locks afterwards, all read-only access can be done from that copy. While
	this will not affect the ``ConfigLock``, it saves traffic.

	Set-and-release action
	----------------------

	As a typical pattern is to change the configuration and afterwards release
	the ``ConfigLock``. To avoid unnecessary RPC call overhead, WConfD will offer
	a combined call. To make that call retryable, it will do nothing if the the
	``ConfigLock`` is not held by the caller; in the return value, it will indicate
	if the config lock was held when the call was made.

	Short-lived ``ConfigLock``
	--------------------------

	For a lot of operations, the regular locks already ensure that only
	one job can modify a certain part of the configuration. For example,
	only jobs with an exclusive lock on an instance will modify that
	instance. Therefore, it can update that entity atomically,
	without relying on the configuration lock to achive consistency.
	``WConfD`` will provide such operations. To
	avoid interference with non-atomic operations that still take the
	config lock and write the configuration as a whole, this operation
	will only be carried out at times the config lock is not taken. To
	ensure this, the thread handling the request will take the config lock
	itself (hence no one else has it, if that succeeds) before the change
	and release afterwards; both operations will be done without
	triggering a writeout of the lock status.

	Note that the thread handling the request has to take the lock in its
	own name and not in that of the requesting job. A writeout of the lock
	status can still happen, triggered by other requests. Now, if
	``WConfD`` gets restarted after the lock acquisition, if that happend
	in the name of the job, it would own a lock without knowing about it,
	and hence that lock would never get released.


	Approaches considered, but not working
	======================================

	Set-and-release action with asynchronous writes
	-----------------------------------------------

	Approach
	~~~~~~~~

	As a typical pattern is to change the configuration and afterwards release
	the ``ConfigLock``. To avoid unnecessary delay in this operation (the next
	modification of the configuration can already happen while the last change
	is written out), WConfD will offer a combined command that will

	- set the configuration to the specified value,

	- release the config lock,

	- and only then wait for the configuration write to finish; it will not
	wait for confirmation of the lock-release write.

	If jobs use this combined command instead of the sequential set followed
	by release, new configuration changes can come in during writeout of the
	current change; in particular, a writeout can contain more than one change.

	Problem
	~~~~~~~

	This approach works fine, as long as always either ``WConfD`` can do an ordered
	shutdown or the calling process dies as well. If however, we allow random kill
	signals to be sent to individual daemons (e.g., by an out-of-memory killer),
	the following race occurs. A process can ask for a combined write-and-unlock
	operation; while the configuration is still written out, the write out of the
	updated lock status already finishes. Now, if ``WConfD`` forcefully gets killed
	in that very moment, a restarted ``WConfD`` will read the old configuration but
	the new lock status. This will make the calling process believe that its call,
	while it didn't get an answer, succeeded nevertheless, thus resulting in a
	wrong configuration state.