doc/design-glusterfs-ganeti-support.rst - ganeti - Git at Google

 ========================
 GlusterFS Ganeti support
 ========================

 This document describes the plan for adding GlusterFS support inside Ganeti.

 .. contents:: :depth: 4
 .. highlight:: shell-example

 Gluster overview
 ================

 Gluster is a "brick" "translation" service that can turn a number of LVM logical
 volume or disks (so-called "bricks") into an unified "volume" that can be
 mounted over the network through FUSE or NFS.

 This is a simplified view of what components are at play and how they
 interconnect as data flows from the actual disks to the instances. The parts in
 grey are available for Ganeti to use and included for completeness but not
 targeted for implementation at this stage.

 .. digraph:: "gluster-ganeti-overview"

   graph [ spline=ortho ]
   node [ shape=rect ]

   {

     node [ shape=none ]
     _volume [ label=volume ]

     bricks -> translators -> _volume
     _volume -> network [label=transport]
     network -> instances
   }

   { rank=same; brick1 [ shape=oval ]
                brick2 [ shape=oval ]
                brick3 [ shape=oval ]
                bricks }
   { rank=same; translators distribute }
   { rank=same; volume [ shape=oval ]
                _volume }
   { rank=same; instances instanceA instanceB instanceC instanceD }
   { rank=same; network FUSE NFS QEMUC QEMUD }

   {
     node [ shape=oval ]
     brick1 [ label=brick ]
     brick2 [ label=brick ]
     brick3 [ label=brick ]
   }

   {
     node [ shape=oval ]
     volume
   }

   brick1 -> distribute
   brick2 -> distribute
   brick3 -> distribute -> volume
   volume -> FUSE [ label=<TCP<br/><font color="grey">UDP</font>>
                    color="black:grey" ]

   NFS [ color=grey fontcolor=grey ]
   volume -> NFS [ label="TCP" color=grey fontcolor=grey ]
   NFS -> mountpoint [ color=grey fontcolor=grey ]

   mountpoint [ shape=oval ]

   FUSE -> mountpoint

   instanceA [ label=instances ]
   instanceB [ label=instances ]

   mountpoint -> instanceA
   mountpoint -> instanceB

   mountpoint [ shape=oval ]

   QEMUC [ label=QEMU ]
   QEMUD [ label=QEMU ]

   {
     instanceC [ label=instances ]
     instanceD [ label=instances ]
   }

   volume -> QEMUC [ label=<TCP<br/><font color="grey">UDP</font>>
                     color="black:grey" ]
   volume -> QEMUD [ label=<TCP<br/><font color="grey">UDP</font>>
                     color="black:grey" ]
   QEMUC -> instanceC
   QEMUD -> instanceD

 brick:
   The unit of storage in gluster. Typically a drive or LVM logical volume
   formatted using, for example, XFS.

 distribute:
   One of the translators in Gluster, it assigns files to bricks based on the
   hash of their full path inside the volume.

 volume:
   A filesystem you can mount on multiple machines; all machines see the same
   directory tree and files.

 FUSE/NFS:
   Gluster offers two ways to mount volumes: through FUSE or a custom NFS server
   that is incompatible with other NFS servers. FUSE is more compatible with
   other services running on the storage nodes; NFS gives better performance.
   For now, FUSE is a priority.

 QEMU:
   QEMU 1.3 has the ability to use Gluster volumes directly in userspace without
   the need for mounting anything. Ganeti still needs kernelspace access at disk
   creation and OS install time.

 transport:
   FUSE and QEMU allow you to connect using TCP and UDP, whereas NFS only
   supports TCP. Those protocols are called transports in Gluster. For now, TCP
   is a priority.

 It is the administrator's duty to set up the bricks, the translators and thus
 the volume as they see fit. Ganeti will take care of connecting the instances to
 a given volume.

 .. note::

   The gluster mountpoint must be whitelisted by the administrator in
   ``/etc/ganeti/file-storage-paths`` for security reasons in order to allow
   Ganeti to modify the filesystem.

 Why not use a ``sharedfile`` disk template?
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Gluster volumes `can` be used by Ganeti using the generic shared file disk
 template. There is a number of reasons why that is probably not a good idea,
 however:

 * Shared file, being a generic solution, cannot offer userspace access support.
 * Even with userspace support, Ganeti still needs kernelspace access in order to
   create disks and install OSes on them. Ganeti can manage the mounting for you
   so that the Gluster servers only have as many connections as necessary.
 * Experiments showed that you can't trust ``mount.glusterfs`` to give useful
   return codes or error messages. Ganeti can work around its oddities so
   administrators don't have to.
 * The shared file folder scheme (``../{instance.name}/disk{disk.id}``) does not
   work well with Gluster. The ``distribute`` translator distributes files across
   bricks, but directories need to be replicated on `all` bricks. As a result, if
   we have a dozen hundred instances, that means a dozen hundred folders being
   replicated on all bricks. This does not scale well.
 * This frees up the shared file disk template to use a different, unsupported
   replication scheme together with Gluster. (Storage pools are the long term
   solution for this, however.)

 So, while gluster `is` a shared file disk template, essentially, Ganeti can
 provide better support for it than that.

 Implementation strategy
 =======================

 Working with GlusterFS in kernel space essentially boils down to:

 1. Ask FUSE to mount the Gluster volume.
 2. Check that the mount succeeded.
 3. Use files stored in the volume as instance disks, just like sharedfile does.
 4. When the instances are spun down, attempt unmounting the volume. If the
    gluster connection is still required, the mountpoint is allowed to remain.

 Since it is not strictly necessary for Gluster to mount the disk if all that's
 needed is userspace access, however, it is inappropriate for the Gluster storage
 class to inherit from FileStorage. So the implementation should resort to
 composition rather than inheritance:

 1. Extract the ``FileStorage`` disk-facing logic into a ``FileDeviceHelper``
    class.

  * In order not to further inflate bdev.py, Filestorage should join its helper
    functions in filestorage.py (thus reducing their visibility) and add Gluster
    to its own file, gluster.py. Moving the other classes to their own files
    like it's been done in ``lib/hypervisor/``) is not addressed as part of this
    design.

 2. Use the ``FileDeviceHelper`` class to implement a ``GlusterStorage`` class in
    much the same way.
 3. Add Gluster as a disk template that behaves like SharedFile in every way.
 4. Provide Ganeti knowledge about what a ``GlusterVolume`` is and how to mount,
    unmount and reference them.

  * Before attempting a mount, we should check if the volume is not mounted
    already. Linux allows mounting partitions multiple times, but then you also
    have to unmount them as many times as you mounted them to actually free the
    resources; this also makes the output of commands such as ``mount`` less
    useful.
  * Every time the device could be released (after instance shutdown, OS
    installation scripts or file creation), a single unmount is attempted. If
    the device is still busy (e.g. from other instances, jobs or open
    administrator shells), the failure is ignored.

 5. Modify ``GlusterStorage`` and customize the disk template behavior to fit
    Gluster's needs.

 Directory structure
 ~~~~~~~~~~~~~~~~~~~

 In order to address the shortcomings of the generic shared file handling of
 instance disk directory structure, Gluster uses a different scheme for
 determining a disk's logical id and therefore path on the file system.

 The naming scheme is::

     /ganeti/{instance.uuid}.{disk.id}

 ...bringing the actual path on a node's file system to::

     /var/run/ganeti/gluster/ganeti/{instance.uuid}.{disk.id}

 This means Ganeti only uses one folder on the Gluster volume (allowing other
 uses of the Gluster volume in the meantime) and works better with how Gluster
 distributes storage over its bricks.

 Changes to the storage types system
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Ganeti has a number of storage types that abstract over disk templates. This
 matters mainly in terms of disk space reporting. Gluster support is improved by
 a rethinking of how disk templates are assigned to storage types in Ganeti.

 This is the summary of the changes:

 +--------------+---------+---------+-------------------------------------------+
 | Disk         | Current | New     | Does it report storage information to...  |
 | template     | storage | storage +-------------+----------------+------------+
 |              | type    | type    | ``gnt-node  | ``gnt-node     | iallocator |
 |              |         |         | list``      | list-storage`` |            |
 +==============+=========+=========+=============+================+============+
 | File         | File    | File    | Yes.        | Yes.           | Yes.       |
 +--------------+---------+---------+-------------+----------------+------------+
 | Shared file  | File    | Shared  | No.         | Yes.           | No.        |
 +--------------+---------+ file    |             |                |            |
 | Gluster (new)| N/A     | (new)   |             |                |            |
 +--------------+---------+---------+-------------+----------------+------------+
 | RBD (for     | RBD               | No.         | No.            | No.        |
 | reference)   |                   |             |                |            |
 +--------------+-------------------+-------------+----------------+------------+

 Gluster or Shared File should not, like RBD, report storage information to
 gnt-node list or to IAllocators. Regrettably, the simplest way to do so right
 now is by claiming that storage reporting for the relevant storage type is not
 implemented. An effort was made to claim that the shared storage type did support
 disk reporting while refusing to provide any value, but it was not successful
 (``hail`` does not support this combination.)

 To do so without breaking the File disk template, a new storage type must be
 added. Like RBD, it does not claim to support disk reporting. However, we can
 still make an effort of reporting stats to ``gnt-node list-storage``.

 The rationale is simple. For shared file and gluster storage, disk space is not
 a function of any one node. If storage types with disk space reporting are used,
 Hail expects them to give useful numbers for allocation purposes, but a shared
 storage system means disk balancing is not affected by node-instance allocation
 any longer. Moreover, it would be wasteful to mount a Gluster volume on each
 node just for running statvfs() if no machine was actually running gluster VMs.

 As a result, Gluster support for gnt-node list-storage is necessarily limited
 and nodes on which Gluster is available but not in use will report failures.
 Additionally, running ``gnt-node list`` will give an output like this::

   Node              DTotal DFree MTotal MNode MFree Pinst Sinst
   node1.example.com      ?     ?   744M  273M  477M     0     0
   node2.example.com      ?     ?   744M  273M  477M     0     0

 This is expected and consistent with behaviour in RBD.

 An alternative would have been to report DTotal and DFree as 0 in order to allow
 ``hail`` to ignore the disk information, but this incorrectly populates the
 ``gnt-node list`` DTotal and DFree fields with 0s as well.

 New configuration switches
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 Configurable at the cluster and node group level (``gnt-cluster modify``,
 ``gnt-group modify`` and other commands that support the `-D` switch to edit
 disk parameters):

 ``gluster:host``
   The IP address or hostname of the Gluster server to connect to. In the default
   deployment of Gluster, that is any machine that is hosting a brick.

   Default: ``"127.0.0.1"``

 ``gluster:port``
   The port where the Gluster server is listening to.

   Default: ``24007``

 ``gluster:volume``
   The volume Ganeti should use.

   Default: ``"gv0"``

 Configurable at the cluster level only (``gnt-cluster init``) and stored in
 ssconf for all nodes to read (just like shared file):

 ``--gluster-dir``
   Where the Gluster volume should be mounted.

   Default: ``/var/run/ganeti/gluster``

 The default values work if all of the Ganeti nodes also host Gluster bricks.
 This is possible, but `not` recommended as it can cause the host to hardlock due
 to deadlocks in the kernel memory (much in the same way RBD works).

 Future work
 ===========

 In no particular order:

 * Support the UDP transport.
 * Support mounting through NFS.
 * Filter ``gnt-node list`` so DTotal and DFree are not shown for RBD and shared
   file disk types, or otherwise report the disk storage values as "-" or some
   other special value to clearly distinguish it from the result of a
   communication failure between nodes.
 * Allow configuring the in-volume path Ganeti uses.

 .. vim: set textwidth=72 :
 .. Local Variables:
 .. mode: rst
 .. fill-column: 72
 .. End:
	========================
	GlusterFS Ganeti support
	========================

	This document describes the plan for adding GlusterFS support inside Ganeti.

	.. contents:: :depth: 4
	.. highlight:: shell-example

	Gluster overview
	================

	Gluster is a "brick" "translation" service that can turn a number of LVM logical
	volume or disks (so-called "bricks") into an unified "volume" that can be
	mounted over the network through FUSE or NFS.

	This is a simplified view of what components are at play and how they
	interconnect as data flows from the actual disks to the instances. The parts in
	grey are available for Ganeti to use and included for completeness but not
	targeted for implementation at this stage.

	.. digraph:: "gluster-ganeti-overview"

	graph [ spline=ortho ]
	node [ shape=rect ]

	{

	node [ shape=none ]
	_volume [ label=volume ]

	bricks -> translators -> _volume
	_volume -> network [label=transport]
	network -> instances
	}

	{ rank=same; brick1 [ shape=oval ]
	brick2 [ shape=oval ]
	brick3 [ shape=oval ]
	bricks }
	{ rank=same; translators distribute }
	{ rank=same; volume [ shape=oval ]
	_volume }
	{ rank=same; instances instanceA instanceB instanceC instanceD }
	{ rank=same; network FUSE NFS QEMUC QEMUD }

	{
	node [ shape=oval ]
	brick1 [ label=brick ]
	brick2 [ label=brick ]
	brick3 [ label=brick ]
	}

	{
	node [ shape=oval ]
	volume
	}

	brick1 -> distribute
	brick2 -> distribute
	brick3 -> distribute -> volume
	volume -> FUSE [ label=<TCP<br/><font color="grey">UDP</font>>
	color="black:grey" ]

	NFS [ color=grey fontcolor=grey ]
	volume -> NFS [ label="TCP" color=grey fontcolor=grey ]
	NFS -> mountpoint [ color=grey fontcolor=grey ]

	mountpoint [ shape=oval ]

	FUSE -> mountpoint

	instanceA [ label=instances ]
	instanceB [ label=instances ]

	mountpoint -> instanceA
	mountpoint -> instanceB

	mountpoint [ shape=oval ]

	QEMUC [ label=QEMU ]
	QEMUD [ label=QEMU ]

	{
	instanceC [ label=instances ]
	instanceD [ label=instances ]
	}

	volume -> QEMUC [ label=<TCP<br/><font color="grey">UDP</font>>
	color="black:grey" ]
	volume -> QEMUD [ label=<TCP<br/><font color="grey">UDP</font>>
	color="black:grey" ]
	QEMUC -> instanceC
	QEMUD -> instanceD

	brick:
	The unit of storage in gluster. Typically a drive or LVM logical volume
	formatted using, for example, XFS.

	distribute:
	One of the translators in Gluster, it assigns files to bricks based on the
	hash of their full path inside the volume.

	volume:
	A filesystem you can mount on multiple machines; all machines see the same
	directory tree and files.

	FUSE/NFS:
	Gluster offers two ways to mount volumes: through FUSE or a custom NFS server
	that is incompatible with other NFS servers. FUSE is more compatible with
	other services running on the storage nodes; NFS gives better performance.
	For now, FUSE is a priority.

	QEMU:
	QEMU 1.3 has the ability to use Gluster volumes directly in userspace without
	the need for mounting anything. Ganeti still needs kernelspace access at disk
	creation and OS install time.

	transport:
	FUSE and QEMU allow you to connect using TCP and UDP, whereas NFS only
	supports TCP. Those protocols are called transports in Gluster. For now, TCP
	is a priority.

	It is the administrator's duty to set up the bricks, the translators and thus
	the volume as they see fit. Ganeti will take care of connecting the instances to
	a given volume.

	.. note::

	The gluster mountpoint must be whitelisted by the administrator in
	``/etc/ganeti/file-storage-paths`` for security reasons in order to allow
	Ganeti to modify the filesystem.

	Why not use a ``sharedfile`` disk template?
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Gluster volumes `can` be used by Ganeti using the generic shared file disk
	template. There is a number of reasons why that is probably not a good idea,
	however:

	* Shared file, being a generic solution, cannot offer userspace access support.
	* Even with userspace support, Ganeti still needs kernelspace access in order to
	create disks and install OSes on them. Ganeti can manage the mounting for you
	so that the Gluster servers only have as many connections as necessary.
	* Experiments showed that you can't trust ``mount.glusterfs`` to give useful
	return codes or error messages. Ganeti can work around its oddities so
	administrators don't have to.
	* The shared file folder scheme (``../{instance.name}/disk{disk.id}``) does not
	work well with Gluster. The ``distribute`` translator distributes files across
	bricks, but directories need to be replicated on `all` bricks. As a result, if
	we have a dozen hundred instances, that means a dozen hundred folders being
	replicated on all bricks. This does not scale well.
	* This frees up the shared file disk template to use a different, unsupported
	replication scheme together with Gluster. (Storage pools are the long term
	solution for this, however.)

	So, while gluster `is` a shared file disk template, essentially, Ganeti can
	provide better support for it than that.

	Implementation strategy
	=======================

	Working with GlusterFS in kernel space essentially boils down to:

	1. Ask FUSE to mount the Gluster volume.
	2. Check that the mount succeeded.
	3. Use files stored in the volume as instance disks, just like sharedfile does.
	4. When the instances are spun down, attempt unmounting the volume. If the
	gluster connection is still required, the mountpoint is allowed to remain.

	Since it is not strictly necessary for Gluster to mount the disk if all that's
	needed is userspace access, however, it is inappropriate for the Gluster storage
	class to inherit from FileStorage. So the implementation should resort to
	composition rather than inheritance:

	1. Extract the ``FileStorage`` disk-facing logic into a ``FileDeviceHelper``
	class.

	* In order not to further inflate bdev.py, Filestorage should join its helper
	functions in filestorage.py (thus reducing their visibility) and add Gluster
	to its own file, gluster.py. Moving the other classes to their own files
	like it's been done in ``lib/hypervisor/``) is not addressed as part of this
	design.

	2. Use the ``FileDeviceHelper`` class to implement a ``GlusterStorage`` class in
	much the same way.
	3. Add Gluster as a disk template that behaves like SharedFile in every way.
	4. Provide Ganeti knowledge about what a ``GlusterVolume`` is and how to mount,
	unmount and reference them.

	* Before attempting a mount, we should check if the volume is not mounted
	already. Linux allows mounting partitions multiple times, but then you also
	have to unmount them as many times as you mounted them to actually free the
	resources; this also makes the output of commands such as ``mount`` less
	useful.
	* Every time the device could be released (after instance shutdown, OS
	installation scripts or file creation), a single unmount is attempted. If
	the device is still busy (e.g. from other instances, jobs or open
	administrator shells), the failure is ignored.

	5. Modify ``GlusterStorage`` and customize the disk template behavior to fit
	Gluster's needs.

	Directory structure
	~~~~~~~~~~~~~~~~~~~

	In order to address the shortcomings of the generic shared file handling of
	instance disk directory structure, Gluster uses a different scheme for
	determining a disk's logical id and therefore path on the file system.

	The naming scheme is::

	/ganeti/{instance.uuid}.{disk.id}

	...bringing the actual path on a node's file system to::

	/var/run/ganeti/gluster/ganeti/{instance.uuid}.{disk.id}

	This means Ganeti only uses one folder on the Gluster volume (allowing other
	uses of the Gluster volume in the meantime) and works better with how Gluster
	distributes storage over its bricks.

	Changes to the storage types system
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Ganeti has a number of storage types that abstract over disk templates. This
	matters mainly in terms of disk space reporting. Gluster support is improved by
	a rethinking of how disk templates are assigned to storage types in Ganeti.

	This is the summary of the changes:

	+--------------+---------+---------+-------------------------------------------+
	\| Disk \| Current \| New \| Does it report storage information to... \|
	\| template \| storage \| storage +-------------+----------------+------------+
	\| \| type \| type \| ``gnt-node \| ``gnt-node \| iallocator \|
	\| \| \| \| list`` \| list-storage`` \| \|
	+==============+=========+=========+=============+================+============+
	\| File \| File \| File \| Yes. \| Yes. \| Yes. \|
	+--------------+---------+---------+-------------+----------------+------------+
	\| Shared file \| File \| Shared \| No. \| Yes. \| No. \|
	+--------------+---------+ file \| \| \| \|
	\| Gluster (new)\| N/A \| (new) \| \| \| \|
	+--------------+---------+---------+-------------+----------------+------------+
	\| RBD (for \| RBD \| No. \| No. \| No. \|
	\| reference) \| \| \| \| \|
	+--------------+-------------------+-------------+----------------+------------+

	Gluster or Shared File should not, like RBD, report storage information to
	gnt-node list or to IAllocators. Regrettably, the simplest way to do so right
	now is by claiming that storage reporting for the relevant storage type is not
	implemented. An effort was made to claim that the shared storage type did support
	disk reporting while refusing to provide any value, but it was not successful
	(``hail`` does not support this combination.)

	To do so without breaking the File disk template, a new storage type must be
	added. Like RBD, it does not claim to support disk reporting. However, we can
	still make an effort of reporting stats to ``gnt-node list-storage``.

	The rationale is simple. For shared file and gluster storage, disk space is not
	a function of any one node. If storage types with disk space reporting are used,
	Hail expects them to give useful numbers for allocation purposes, but a shared
	storage system means disk balancing is not affected by node-instance allocation
	any longer. Moreover, it would be wasteful to mount a Gluster volume on each
	node just for running statvfs() if no machine was actually running gluster VMs.

	As a result, Gluster support for gnt-node list-storage is necessarily limited
	and nodes on which Gluster is available but not in use will report failures.
	Additionally, running ``gnt-node list`` will give an output like this::

	Node DTotal DFree MTotal MNode MFree Pinst Sinst
	node1.example.com ? ? 744M 273M 477M 0 0
	node2.example.com ? ? 744M 273M 477M 0 0

	This is expected and consistent with behaviour in RBD.

	An alternative would have been to report DTotal and DFree as 0 in order to allow
	``hail`` to ignore the disk information, but this incorrectly populates the
	``gnt-node list`` DTotal and DFree fields with 0s as well.

	New configuration switches
	~~~~~~~~~~~~~~~~~~~~~~~~~~

	Configurable at the cluster and node group level (``gnt-cluster modify``,
	``gnt-group modify`` and other commands that support the `-D` switch to edit
	disk parameters):

	``gluster:host``
	The IP address or hostname of the Gluster server to connect to. In the default
	deployment of Gluster, that is any machine that is hosting a brick.

	Default: ``"127.0.0.1"``

	``gluster:port``
	The port where the Gluster server is listening to.

	Default: ``24007``

	``gluster:volume``
	The volume Ganeti should use.

	Default: ``"gv0"``

	Configurable at the cluster level only (``gnt-cluster init``) and stored in
	ssconf for all nodes to read (just like shared file):

	``--gluster-dir``
	Where the Gluster volume should be mounted.

	Default: ``/var/run/ganeti/gluster``

	The default values work if all of the Ganeti nodes also host Gluster bricks.
	This is possible, but `not` recommended as it can cause the host to hardlock due
	to deadlocks in the kernel memory (much in the same way RBD works).

	Future work
	===========

	In no particular order:

	* Support the UDP transport.
	* Support mounting through NFS.
	* Filter ``gnt-node list`` so DTotal and DFree are not shown for RBD and shared
	file disk types, or otherwise report the disk storage values as "-" or some
	other special value to clearly distinguish it from the result of a
	communication failure between nodes.
	* Allow configuring the in-volume path Ganeti uses.

	.. vim: set textwidth=72 :
	.. Local Variables:
	.. mode: rst
	.. fill-column: 72
	.. End: