doc/design-virtual-clusters.rst - ganeti - Git at Google

 ===================================
 Design for virtual clusters support
 ===================================


 Introduction
 ============

 Currently there are two ways to test the Ganeti (including HTools) code
 base:

 - unittests, which run using mocks as normal user and test small bits of
   the code
 - QA/burnin/live-test, which require actual hardware (either physical or
   virtual) and will build an actual cluster, with one machine to one
   node correspondence

 The difference in time between these two is significant:

 - the unittests run in about 1-2 minutes
 - a so-called ‘quick’ QA (without burnin) runs in about an hour, and a
   full QA could be double that time

 On one hand, the unittests have a clear advantage: quick to run, not
 requiring many machines, but on the other hand QA is actually able to
 run end-to-end tests (including HTools, for example).

 Ideally, we would have an intermediate step between these two extremes:
 be able to test most, if not all, of Ganeti's functionality but without
 requiring actual hardware, full machine ownership or root access.


 Current situation
 =================

 Ganeti
 ------

 It is possible, given a manually built ``config.data`` and
 ``_autoconf.py``, to run the masterd under the current user as a
 single-node cluster master. However, the node daemon and related
 functionality (cluster initialisation, master failover, etc.) are not
 directly runnable in this model.

 Also, masterd only works as a master of a single node cluster, due to
 our current “hostname” method of identifying nodes, which results in a
 limit of maximum one node daemon per machine, unless we use multiple
 name and IP aliases.

 HTools
 ------

 In HTools the situation is better, since it doesn't have to deal with
 actual machine management: all tools can use a custom LUXI path, and can
 even load RAPI data from the filesystem (so the RAPI backend can be
 tested), and both the ‘text’ backend for hbal/hspace and the input files
 for hail are text-based, loaded from the file-system.

 Proposed changes
 ================

 The end-goal is to have full support for “virtual clusters”, i.e. be
 able to run a “big” (hundreds of virtual nodes and towards thousands of
 virtual instances) on a reasonably powerful, but single machine, under a
 single user account and without any special privileges.

 This would have significant advantages:

 - being able to test end-to-end certain changes, without requiring a
   complicated setup
 - better able to estimate Ganeti's behaviour and performance as the
   cluster size grows; this is something that we haven't been able to
   test reliably yet, and as such we still have not yet diagnosed
   scaling problems
 - easier integration with external tools (and even with HTools)

 ``masterd``
 -----------

 As described above, ``masterd`` already works reasonably well in a
 virtual setup, as it won't execute external programs and it shouldn't
 directly read files from the local filesystem (or at least not
 virtualisation-related, as the master node can be a non-vm_capable
 node).

 ``noded``
 ---------

 The node daemon executes many privileged operations, but they can be
 split in a few general categories:

 +---------------+-----------------------+------------------------------------+
 |Category       |Description            |Solution                            |
 +===============+=======================+====================================+
 |disk operations|Disk creation and      |Use only diskless or file-based     |
 |               |removal                |instances                           |
 +---------------+-----------------------+------------------------------------+
 |disk query     |Node disk total/free,  |Not supported currently, could use  |
 |               |used in node listing   |file-based                          |
 |               |and htools             |                                    |
 +---------------+-----------------------+------------------------------------+
 |hypervisor     |Instance start, stop   |Use the *fake* hypervisor           |
 |operations     |and query              |                                    |
 +---------------+-----------------------+------------------------------------+
 |instance       |Bridge existence query |Unprivileged operation, can be used |
 |networking     |                       |with an existing bridge at system   |
 |               |                       |level or use NIC-less instances     |
 +---------------+-----------------------+------------------------------------+
 |instance OS    |OS add, OS rename,     |Only used with non-diskless         |
 |operations     |export and import      |instances; could work with custom OS|
 |               |                       |scripts that just ``dd`` without    |
 |               |                       |mounting filesystems                |
 +---------------+-----------------------+------------------------------------+
 |node networking|IP address management  |Not supported; Ganeti will need to  |
 |               |(master ip), IP query, |work without a master IP; for the IP|
 |               |etc.                   |query operations the test machine   |
 |               |                       |would need externally-configured IPs|
 +---------------+-----------------------+------------------------------------+
 |node add       |-                      |SSH command must be adjusted        |
 +---------------+-----------------------+------------------------------------+
 |node setup     |ssh, /etc/hosts, so on |Can already be disabled from the    |
 |               |                       |cluster config                      |
 +---------------+-----------------------+------------------------------------+
 |master failover|start/stop the master  |Doable (as long as we use a single  |
 |               |daemon                 |user), might get tricky w.r.t. paths|
 |               |                       |to executables                      |
 +---------------+-----------------------+------------------------------------+
 |file upload    |Uploading of system    |The only issue could be with system |
 |               |files, job queue files |files, which are not owned by the   |
 |               |and ganeti config      |current user; internal ganeti files |
 |               |                       |should be working fine              |
 +---------------+-----------------------+------------------------------------+
 |node oob       |Out-of-band commands   |Since these are user-defined, we can|
 |               |                       |mock them easily                    |
 +---------------+-----------------------+------------------------------------+
 |node OS        |List the existing OSes |No special privileges needed, so    |
 |discovery      |and their properties   |works fine as-is                    |
 +---------------+-----------------------+------------------------------------+
 |hooks          |Running hooks for given|No special privileges needed        |
 |               |operations             |                                    |
 +---------------+-----------------------+------------------------------------+
 |iallocator     |Calling an iallocator  |No special privileges needed        |
 |               |script                 |                                    |
 +---------------+-----------------------+------------------------------------+
 |export/import  |Exporting and importing|When exporting/importing file-based |
 |               |instances              |instances, this should work, as the |
 |               |                       |listening ports are dynamically     |
 |               |                       |chosen                              |
 +---------------+-----------------------+------------------------------------+
 |hypervisor     |The validation of      |As long as the hypervisors don't    |
 |validation     |hypervisor parameters  |call to privileged commands, it     |
 |               |                       |should work                         |
 +---------------+-----------------------+------------------------------------+
 |node powercycle|The ability to power   |Privileged, so not supported, but   |
 |               |cycle a node remotely  |anyway not very interesting for     |
 |               |                       |testing                             |
 +---------------+-----------------------+------------------------------------+

 It seems that much of the functionality works as is, or could work with
 small adjustments, even in a non-privileged setup. The bigger problem is
 the actual use of multiple node daemons per machine.

 Multiple ``noded`` per machine
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Currently Ganeti identifies node simply by their hostname. Since
 changing this method would imply significant changes to tracking the
 nodes, the proposal is to simply have as many IPs per the (single)
 machine that is used for tests as nodes, and have each IP correspond to
 a different name, and thus no changes are needed to the core RPC
 library. Unfortunately this has the downside of requiring root rights
 for setting up the extra IPs and hostnames.

 An alternative option is to implement per-node IP/port support in Ganeti
 (especially in the RPC layer), which would eliminate the root rights. We
 expect that this will get implemented as a second step of this design,
 but as the port is currently static will require changes in many places.

 The only remaining problem is with sharing the ``localstatedir``
 structure (lib, run, log) amongst the daemons, for which we propose to
 introduce an environment variable (``GANETI_ROOTDIR``) acting as a
 prefix for essentially all paths. An environment variable is easier to
 transport through several levels of programs (shell scripts, Python,
 etc.) than a command line parameter. In Python code this prefix will be
 applied to all paths in ``constants.py``. Every virtual node will get
 its own root directory. The rationale for this is two-fold:

 - having two or more node daemons writing to the same directory might
   introduce artificial scenarios not existent in real life; currently
   noded either owns the entire ``/var/lib/ganeti`` directory or shares
   it with masterd, but never with another noded
 - having separate directories allows cluster verify to check correctly
   consistency of file upload operations; otherwise, as long as one node
   daemon wrote a file successfully, the results from all others are
   “lost”

 In case the use of an environment variable turns out to be too difficult
 a compile-time prefix path could be used. This would then require one
 Ganeti installation per virtual node, but it might be good enough.

 ``rapi``
 --------

 The RAPI daemon is not privileged and furthermore we only need one per
 cluster, so it presents no issues.

 ``confd``
 ---------

 ``confd`` has somewhat the same issues as the node daemon regarding
 multiple daemons per machine, but the per-address binding still works.

 ``ganeti-watcher``
 ------------------

 Since the startup of daemons will be customised with per-IP binds, the
 watcher either has to be modified to not activate the daemons, or the
 start-stop tool has to take this into account. Due to watcher's use of
 the hostname, it's recommended that the master node is set to the
 machine hostname (also a requirement for the master daemon).

 CLI scripts
 -----------

 As long as the master node is set to the machine hostname, these should
 work fine.

 Cluster initialisation
 ----------------------

 It could be possible that the cluster initialisation procedure is a bit
 more involved (this was not tried yet). A script will be used to set up
 all necessary IP addresses and hostnames, as well as creating the
 initial directory structure. Building ``config.data`` manually should
 not be necessary.

 Needed tools
 ============

 With the above investigation results in mind, the only thing we need
 are:

 - a tool to setup per-virtual node tree structure of ``localstatedir``
   (with the help of ``ensure-dirs``) and setup correctly the extra
   IP/hostnames
 - changes to the startup daemon tools to launch correctly the daemons
   per virtual node
 - changes to ``constants.py`` to override the ``localstatedir`` path
 - documentation for running such a virtual cluster
 - and eventual small fixes to the node daemon backend functionality, to
   better separate privileged and non-privileged code

 .. vim: set textwidth=72 :
 .. Local Variables:
 .. mode: rst
 .. fill-column: 72
 .. End:
	===================================
	Design for virtual clusters support
	===================================


	Introduction
	============

	Currently there are two ways to test the Ganeti (including HTools) code
	base:

	- unittests, which run using mocks as normal user and test small bits of
	the code
	- QA/burnin/live-test, which require actual hardware (either physical or
	virtual) and will build an actual cluster, with one machine to one
	node correspondence

	The difference in time between these two is significant:

	- the unittests run in about 1-2 minutes
	- a so-called ‘quick’ QA (without burnin) runs in about an hour, and a
	full QA could be double that time

	On one hand, the unittests have a clear advantage: quick to run, not
	requiring many machines, but on the other hand QA is actually able to
	run end-to-end tests (including HTools, for example).

	Ideally, we would have an intermediate step between these two extremes:
	be able to test most, if not all, of Ganeti's functionality but without
	requiring actual hardware, full machine ownership or root access.


	Current situation
	=================

	Ganeti
	------

	It is possible, given a manually built ``config.data`` and
	``_autoconf.py``, to run the masterd under the current user as a
	single-node cluster master. However, the node daemon and related
	functionality (cluster initialisation, master failover, etc.) are not
	directly runnable in this model.

	Also, masterd only works as a master of a single node cluster, due to
	our current “hostname” method of identifying nodes, which results in a
	limit of maximum one node daemon per machine, unless we use multiple
	name and IP aliases.

	HTools
	------

	In HTools the situation is better, since it doesn't have to deal with
	actual machine management: all tools can use a custom LUXI path, and can
	even load RAPI data from the filesystem (so the RAPI backend can be
	tested), and both the ‘text’ backend for hbal/hspace and the input files
	for hail are text-based, loaded from the file-system.

	Proposed changes
	================

	The end-goal is to have full support for “virtual clusters”, i.e. be
	able to run a “big” (hundreds of virtual nodes and towards thousands of
	virtual instances) on a reasonably powerful, but single machine, under a
	single user account and without any special privileges.

	This would have significant advantages:

	- being able to test end-to-end certain changes, without requiring a
	complicated setup
	- better able to estimate Ganeti's behaviour and performance as the
	cluster size grows; this is something that we haven't been able to
	test reliably yet, and as such we still have not yet diagnosed
	scaling problems
	- easier integration with external tools (and even with HTools)

	``masterd``
	-----------

	As described above, ``masterd`` already works reasonably well in a
	virtual setup, as it won't execute external programs and it shouldn't
	directly read files from the local filesystem (or at least not
	virtualisation-related, as the master node can be a non-vm_capable
	node).

	``noded``
	---------

	The node daemon executes many privileged operations, but they can be
	split in a few general categories:

	+---------------+-----------------------+------------------------------------+
	\|Category \|Description \|Solution \|
	+===============+=======================+====================================+
	\|disk operations\|Disk creation and \|Use only diskless or file-based \|
	\| \|removal \|instances \|
	+---------------+-----------------------+------------------------------------+
	\|disk query \|Node disk total/free, \|Not supported currently, could use \|
	\| \|used in node listing \|file-based \|
	\| \|and htools \| \|
	+---------------+-----------------------+------------------------------------+
	\|hypervisor \|Instance start, stop \|Use the fake hypervisor \|
	\|operations \|and query \| \|
	+---------------+-----------------------+------------------------------------+
	\|instance \|Bridge existence query \|Unprivileged operation, can be used \|
	\|networking \| \|with an existing bridge at system \|
	\| \| \|level or use NIC-less instances \|
	+---------------+-----------------------+------------------------------------+
	\|instance OS \|OS add, OS rename, \|Only used with non-diskless \|
	\|operations \|export and import \|instances; could work with custom OS\|
	\| \| \|scripts that just ``dd`` without \|
	\| \| \|mounting filesystems \|
	+---------------+-----------------------+------------------------------------+
	\|node networking\|IP address management \|Not supported; Ganeti will need to \|
	\| \|(master ip), IP query, \|work without a master IP; for the IP\|
	\| \|etc. \|query operations the test machine \|
	\| \| \|would need externally-configured IPs\|
	+---------------+-----------------------+------------------------------------+
	\|node add \|- \|SSH command must be adjusted \|
	+---------------+-----------------------+------------------------------------+
	\|node setup \|ssh, /etc/hosts, so on \|Can already be disabled from the \|
	\| \| \|cluster config \|
	+---------------+-----------------------+------------------------------------+
	\|master failover\|start/stop the master \|Doable (as long as we use a single \|
	\| \|daemon \|user), might get tricky w.r.t. paths\|
	\| \| \|to executables \|
	+---------------+-----------------------+------------------------------------+
	\|file upload \|Uploading of system \|The only issue could be with system \|
	\| \|files, job queue files \|files, which are not owned by the \|
	\| \|and ganeti config \|current user; internal ganeti files \|
	\| \| \|should be working fine \|
	+---------------+-----------------------+------------------------------------+
	\|node oob \|Out-of-band commands \|Since these are user-defined, we can\|
	\| \| \|mock them easily \|
	+---------------+-----------------------+------------------------------------+
	\|node OS \|List the existing OSes \|No special privileges needed, so \|
	\|discovery \|and their properties \|works fine as-is \|
	+---------------+-----------------------+------------------------------------+
	\|hooks \|Running hooks for given\|No special privileges needed \|
	\| \|operations \| \|
	+---------------+-----------------------+------------------------------------+
	\|iallocator \|Calling an iallocator \|No special privileges needed \|
	\| \|script \| \|
	+---------------+-----------------------+------------------------------------+
	\|export/import \|Exporting and importing\|When exporting/importing file-based \|
	\| \|instances \|instances, this should work, as the \|
	\| \| \|listening ports are dynamically \|
	\| \| \|chosen \|
	+---------------+-----------------------+------------------------------------+
	\|hypervisor \|The validation of \|As long as the hypervisors don't \|
	\|validation \|hypervisor parameters \|call to privileged commands, it \|
	\| \| \|should work \|
	+---------------+-----------------------+------------------------------------+
	\|node powercycle\|The ability to power \|Privileged, so not supported, but \|
	\| \|cycle a node remotely \|anyway not very interesting for \|
	\| \| \|testing \|
	+---------------+-----------------------+------------------------------------+

	It seems that much of the functionality works as is, or could work with
	small adjustments, even in a non-privileged setup. The bigger problem is
	the actual use of multiple node daemons per machine.

	Multiple ``noded`` per machine
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Currently Ganeti identifies node simply by their hostname. Since
	changing this method would imply significant changes to tracking the
	nodes, the proposal is to simply have as many IPs per the (single)
	machine that is used for tests as nodes, and have each IP correspond to
	a different name, and thus no changes are needed to the core RPC
	library. Unfortunately this has the downside of requiring root rights
	for setting up the extra IPs and hostnames.

	An alternative option is to implement per-node IP/port support in Ganeti
	(especially in the RPC layer), which would eliminate the root rights. We
	expect that this will get implemented as a second step of this design,
	but as the port is currently static will require changes in many places.

	The only remaining problem is with sharing the ``localstatedir``
	structure (lib, run, log) amongst the daemons, for which we propose to
	introduce an environment variable (``GANETI_ROOTDIR``) acting as a
	prefix for essentially all paths. An environment variable is easier to
	transport through several levels of programs (shell scripts, Python,
	etc.) than a command line parameter. In Python code this prefix will be
	applied to all paths in ``constants.py``. Every virtual node will get
	its own root directory. The rationale for this is two-fold:

	- having two or more node daemons writing to the same directory might
	introduce artificial scenarios not existent in real life; currently
	noded either owns the entire ``/var/lib/ganeti`` directory or shares
	it with masterd, but never with another noded
	- having separate directories allows cluster verify to check correctly
	consistency of file upload operations; otherwise, as long as one node
	daemon wrote a file successfully, the results from all others are
	“lost”

	In case the use of an environment variable turns out to be too difficult
	a compile-time prefix path could be used. This would then require one
	Ganeti installation per virtual node, but it might be good enough.

	``rapi``
	--------

	The RAPI daemon is not privileged and furthermore we only need one per
	cluster, so it presents no issues.

	``confd``
	---------

	``confd`` has somewhat the same issues as the node daemon regarding
	multiple daemons per machine, but the per-address binding still works.

	``ganeti-watcher``
	------------------

	Since the startup of daemons will be customised with per-IP binds, the
	watcher either has to be modified to not activate the daemons, or the
	start-stop tool has to take this into account. Due to watcher's use of
	the hostname, it's recommended that the master node is set to the
	machine hostname (also a requirement for the master daemon).

	CLI scripts
	-----------

	As long as the master node is set to the machine hostname, these should
	work fine.

	Cluster initialisation
	----------------------

	It could be possible that the cluster initialisation procedure is a bit
	more involved (this was not tried yet). A script will be used to set up
	all necessary IP addresses and hostnames, as well as creating the
	initial directory structure. Building ``config.data`` manually should
	not be necessary.

	Needed tools
	============

	With the above investigation results in mind, the only thing we need
	are:

	- a tool to setup per-virtual node tree structure of ``localstatedir``
	(with the help of ``ensure-dirs``) and setup correctly the extra
	IP/hostnames
	- changes to the startup daemon tools to launch correctly the daemons
	per virtual node
	- changes to ``constants.py`` to override the ``localstatedir`` path
	- documentation for running such a virtual cluster
	- and eventual small fixes to the node daemon backend functionality, to
	better separate privileged and non-privileged code

	.. vim: set textwidth=72 :
	.. Local Variables:
	.. mode: rst
	.. fill-column: 72
	.. End: