| ============================= |
| Ganeti shared storage support |
| ============================= |
| |
| This document describes the changes in Ganeti 2.3+ compared to Ganeti |
| 2.3 storage model. It also documents the ExtStorage Interface. |
| |
| .. contents:: :depth: 4 |
| .. highlight:: shell-example |
| |
| Objective |
| ========= |
| |
| The aim is to introduce support for externally mirrored, shared storage. |
| This includes two distinct disk templates: |
| |
| - A shared filesystem containing instance disks as regular files |
| typically residing on a networked or cluster filesystem (e.g. NFS, |
| AFS, Ceph, OCFS2, etc.). |
| - Instance images being shared block devices, typically LUNs residing on |
| a SAN appliance. |
| |
| Background |
| ========== |
| |
| DRBD is currently the only shared storage backend supported by Ganeti. |
| DRBD offers the advantages of high availability while running on |
| commodity hardware at the cost of high network I/O for block-level |
| synchronization between hosts. DRBD's master-slave model has greatly |
| influenced Ganeti's design, primarily by introducing the concept of |
| primary and secondary nodes and thus defining an instance's “mobility |
| domain”. |
| |
| Although DRBD has many advantages, many sites choose to use networked |
| storage appliances for Virtual Machine hosting, such as SAN and/or NAS, |
| which provide shared storage without the administrative overhead of DRBD |
| nor the limitation of a 1:1 master-slave setup. Furthermore, new |
| distributed filesystems such as Ceph are becoming viable alternatives to |
| expensive storage appliances. Support for both modes of operation, i.e. |
| shared block storage and shared file storage backend would make Ganeti a |
| robust choice for high-availability virtualization clusters. |
| |
| Throughout this document, the term “externally mirrored storage” will |
| refer to both modes of shared storage, suggesting that Ganeti does not |
| need to take care about the mirroring process from one host to another. |
| |
| Use cases |
| ========= |
| |
| We consider the following use cases: |
| |
| - A virtualization cluster with FibreChannel shared storage, mapping at |
| least one LUN per instance, accessible by the whole cluster. |
| - A virtualization cluster with instance images stored as files on an |
| NFS server. |
| - A virtualization cluster storing instance images on a Ceph volume. |
| |
| Design Overview |
| =============== |
| |
| The design addresses the following procedures: |
| |
| - Refactoring of all code referring to constants.DTS_NET_MIRROR. |
| - Obsolescence of the primary-secondary concept for externally mirrored |
| storage. |
| - Introduction of a shared file storage disk template for use with networked |
| filesystems. |
| - Introduction of a shared block device disk template with device |
| adoption. |
| - Introduction of the External Storage Interface. |
| |
| Additionally, mid- to long-term goals include: |
| |
| - Support for external “storage pools”. |
| |
| Refactoring of all code referring to constants.DTS_NET_MIRROR |
| ============================================================= |
| |
| Currently, all storage-related decision-making depends on a number of |
| frozensets in lib/constants.py, typically constants.DTS_NET_MIRROR. |
| However, constants.DTS_NET_MIRROR is used to signify two different |
| attributes: |
| |
| - A storage device that is shared |
| - A storage device whose mirroring is supervised by Ganeti |
| |
| We propose the introduction of two new frozensets to ease |
| decision-making: |
| |
| - constants.DTS_EXT_MIRROR, holding externally mirrored disk templates |
| - constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and |
| DTS_NET_MIRROR. |
| |
| Additionally, DTS_NET_MIRROR will be renamed to DTS_INT_MIRROR to reflect |
| the status of the storage as internally mirrored by Ganeti. |
| |
| Thus, checks could be grouped into the following categories: |
| |
| - Mobility checks, like whether an instance failover or migration is |
| possible should check against constants.DTS_MIRRORED |
| - Syncing actions should be performed only for templates in |
| constants.DTS_NET_MIRROR |
| |
| Obsolescence of the primary-secondary node model |
| ================================================ |
| |
| The primary-secondary node concept has primarily evolved through the use |
| of DRBD. In a globally shared storage framework without need for |
| external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the |
| following reasons: |
| |
| 1. Access to the storage does not necessarily imply different roles for |
| the nodes (e.g. primary vs secondary). |
| 2. The same storage is available to potentially more than 2 nodes. Thus, |
| an instance backed by a SAN LUN for example may actually migrate to |
| any of the other nodes and not just a pre-designated failover node. |
| |
| The proposed solution is using the iallocator framework for run-time |
| decision making during migration and failover, for nodes with disk |
| templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and |
| gnt-node will be required to accept target node and/or iallocator |
| specification for these operations. Modifications of the iallocator |
| protocol will be required to address at least the following needs: |
| |
| - Allocation tools must be able to distinguish between internal and |
| external storage |
| - Migration/failover decisions must take into account shared storage |
| availability |
| |
| Introduction of a shared file disk template |
| =========================================== |
| |
| Basic shared file storage support can be implemented by creating a new |
| disk template based on the existing FileStorage class, with only minor |
| modifications in lib/bdev.py. The shared file disk template relies on a |
| shared filesystem (e.g. NFS, AFS, Ceph, OCFS2 over SAN or DRBD) being |
| mounted on all nodes under the same path, where instance images will be |
| saved. |
| |
| A new cluster initialization option is added to specify the mountpoint |
| of the shared filesystem. |
| |
| The remainder of this document deals with shared block storage. |
| |
| Introduction of a shared block device template |
| ============================================== |
| |
| Basic shared block device support will be implemented with an additional |
| disk template. This disk template will not feature any kind of storage |
| control (provisioning, removal, resizing, etc.), but will instead rely |
| on the adoption of already-existing block devices (e.g. SAN LUNs, NBD |
| devices, remote iSCSI targets, etc.). |
| |
| The shared block device template will make the following assumptions: |
| |
| - The adopted block device has a consistent name across all nodes, |
| enforced e.g. via udev rules. |
| - The device will be available with the same path under all nodes in the |
| node group. |
| |
| Introduction of the External Storage Interface |
| ============================================== |
| |
| Overview |
| -------- |
| |
| To extend the shared block storage template and give Ganeti the ability |
| to control and manipulate external storage (provisioning, removal, |
| growing, etc.) we need a more generic approach. The generic method for |
| supporting external shared storage in Ganeti will be to have an |
| ExtStorage provider for each external shared storage hardware type. The |
| ExtStorage provider will be a set of files (executable scripts and text |
| files), contained inside a directory which will be named after the |
| provider. This directory must be present across all nodes of a nodegroup |
| (Ganeti doesn't replicate it), in order for the provider to be usable by |
| Ganeti for this nodegroup (valid). The external shared storage hardware |
| should also be accessible by all nodes of this nodegroup too. |
| |
| An “ExtStorage provider” will have to provide the following methods: |
| |
| - Create a disk |
| - Remove a disk |
| - Grow a disk |
| - Attach a disk to a given node |
| - Detach a disk from a given node |
| - SetInfo to a disk (add metadata) |
| - Verify its supported parameters |
| |
| The proposed ExtStorage interface borrows heavily from the OS |
| interface and follows a one-script-per-function approach. An ExtStorage |
| provider is expected to provide the following scripts: |
| |
| - ``create`` |
| - ``remove`` |
| - ``grow`` |
| - ``attach`` |
| - ``detach`` |
| - ``setinfo`` |
| - ``verify`` |
| |
| All scripts will be called with no arguments and get their input via |
| environment variables. A common set of variables will be exported for |
| all commands, and some of them might have extra ones. |
| |
| ``VOL_NAME`` |
| The name of the volume. This is unique for Ganeti and it |
| uses it to refer to a specific volume inside the external storage. |
| ``VOL_SIZE`` |
| The volume's size in mebibytes. |
| ``VOL_NEW_SIZE`` |
| Available only to the `grow` script. It declares the |
| new size of the volume after grow (in mebibytes). |
| ``EXTP_name`` |
| ExtStorage parameter, where `name` is the parameter in |
| upper-case (same as OS interface's ``OSP_*`` parameters). |
| ``VOL_METADATA`` |
| A string containing metadata to be set for the volume. |
| This is exported only to the ``setinfo`` script. |
| |
| All scripts except `attach` should return 0 on success and non-zero on |
| error, accompanied by an appropriate error message on stderr. The |
| `attach` script should return a string on stdout on success, which is |
| the block device's full path, after it has been successfully attached to |
| the host node. On error it should return non-zero. |
| |
| Implementation |
| -------------- |
| |
| To support the ExtStorage interface, we will introduce a new disk |
| template called `ext`. This template will implement the existing Ganeti |
| disk interface in `lib/bdev.py` (create, remove, attach, assemble, |
| shutdown, grow, setinfo), and will simultaneously pass control to the |
| external scripts to actually handle the above actions. The `ext` disk |
| template will act as a translation layer between the current Ganeti disk |
| interface and the ExtStorage providers. |
| |
| We will also introduce a new IDISK_PARAM called `IDISK_PROVIDER = |
| provider`, which will be used at the command line to select the desired |
| ExtStorage provider. This parameter will be valid only for template |
| `ext` e.g.:: |
| |
| $ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1 |
| |
| The Extstorage interface will support different disks to be created by |
| different providers. e.g.:: |
| |
| $ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1 \ |
| --disk=1:size=1G,provider=sample_provider2 \ |
| --disk=2:size=3G,provider=sample_provider1 |
| |
| Finally, the ExtStorage interface will support passing of parameters to |
| the ExtStorage provider. This will also be done per disk, from the |
| command line:: |
| |
| $ gnt-instance add -t ext --disk=0:size=1G,provider=sample_provider1,\ |
| param1=value1,param2=value2 |
| |
| The above parameters will be exported to the ExtStorage provider's |
| scripts as the enviromental variables: |
| |
| - `EXTP_PARAM1 = str(value1)` |
| - `EXTP_PARAM2 = str(value2)` |
| |
| We will also introduce a new Ganeti client called `gnt-storage` which |
| will be used to diagnose ExtStorage providers and show information about |
| them, similarly to the way `gnt-os diagose` and `gnt-os info` handle OS |
| definitions. |
| |
| Long-term shared storage goals |
| ============================== |
| |
| Storage pool handling |
| --------------------- |
| |
| A new cluster configuration attribute will be introduced, named |
| “storage_pools”, modeled as a dictionary mapping storage pools to |
| external storage providers (see below), e.g.:: |
| |
| { |
| "nas1": "foostore", |
| "nas2": "foostore", |
| "cloud1": "barcloud", |
| } |
| |
| Ganeti will not interpret the contents of this dictionary, although it |
| will provide methods for manipulating them under some basic constraints |
| (pool identifier uniqueness, driver existence). The manipulation of |
| storage pools will be performed by implementing new options to the |
| `gnt-cluster` command:: |
| |
| $ gnt-cluster modify --add-pool nas1 foostore |
| $ gnt-cluster modify --remove-pool nas1 # There must be no instances using |
| # the pool to remove it |
| |
| Furthermore, the storage pools will be used to indicate the availability |
| of storage pools to different node groups, thus specifying the |
| instances' “mobility domain”. |
| |
| The pool, in which to put the new instance's disk, will be defined at |
| the command line during `instance add`. This will become possible by |
| replacing the IDISK_PROVIDER parameter with a new one, called `IDISK_POOL |
| = pool`. The cmdlib logic will then look at the cluster-level mapping |
| dictionary to determine the ExtStorage provider for the given pool. |
| |
| gnt-storage |
| ----------- |
| |
| The ``gnt-storage`` client can be extended to support pool management |
| (creation/modification/deletion of pools, connection/disconnection of |
| pools to nodegroups, etc.). It can also be extended to diagnose and |
| provide information for internal disk templates too, such as lvm and |
| drbd. |
| |
| .. vim: set textwidth=72 : |