| ==================================== |
| Synchronising htools to Ganeti 2.3 |
| ==================================== |
| |
| Ganeti 2.3 introduces a number of new features that change the cluster |
| internals significantly enough that the htools suite needs to be |
| updated accordingly in order to function correctly. |
| |
| Shared storage support |
| ====================== |
| |
| Currently, the htools algorithms presume a model where all of an |
| instance's resources is served from within the cluster, more |
| specifically from the nodes comprising the cluster. While is this |
| usual for memory and CPU, deployments which use shared storage will |
| invalidate this assumption for storage. |
| |
| To account for this, we need to move some assumptions from being |
| implicit (and hardcoded) to being explicitly exported from Ganeti. |
| |
| |
| New instance parameters |
| ----------------------- |
| |
| It is presumed that Ganeti will export for all instances a new |
| ``storage_type`` parameter, that will denote either internal storage |
| (e.g. *plain* or *drbd*), or external storage. |
| |
| Furthermore, a new ``storage_pool`` parameter will classify, for both |
| internal and external storage, the pool out of which the storage is |
| allocated. For internal storage, this will be either ``lvm`` (the pool |
| that provides space to both ``plain`` and ``drbd`` instances) or |
| ``file`` (for file-storage-based instances). For external storage, |
| this will be the respective NAS/SAN/cloud storage that backs up the |
| instance. Note that for htools, external storage pools are opaque; we |
| only care that they have an identifier, so that we can distinguish |
| between two different pools. |
| |
| If these two parameters are not present, the instances will be |
| presumed to be ``internal/lvm``. |
| |
| New node parameters |
| ------------------- |
| |
| For each node, it is expected that Ganeti will export what storage |
| types it supports and pools it has access to. So a classic 2.2 cluster |
| will have all nodes supporting ``internal/lvm`` and/or |
| ``internal/file``, whereas a new shared storage only 2.3 cluster could |
| have ``external/my-nas`` storage. |
| |
| Whatever the mechanism that Ganeti will use internally to configure |
| the associations between nodes and storage pools, we consider that |
| we'll have available two node attributes inside htools: the list of internal |
| and external storage pools. |
| |
| External storage and instances |
| ------------------------------ |
| |
| Currently, for an instance we allow one cheap move type: failover to |
| the current secondary, if it is a healthy node, and four other |
| “expensive” (as in, including data copies) moves that involve changing |
| either the secondary or the primary node or both. |
| |
| In presence of an external storage type, the following things will |
| change: |
| |
| - the disk-based moves will be disallowed; this is already a feature |
| in the algorithm, controlled by a boolean switch, so adapting |
| external storage here will be trivial |
| - instead of the current one secondary node, the secondaries will |
| become a list of potential secondaries, based on access to the |
| instance's storage pool |
| |
| Except for this, the basic move algorithm remains unchanged. |
| |
| External storage and nodes |
| -------------------------- |
| |
| Two separate areas will have to change for nodes and external storage. |
| |
| First, then allocating instances (either as part of a move or a new |
| allocation), if the instance is using external storage, then the |
| internal disk metrics should be ignored (for both the primary and |
| secondary cases). |
| |
| Second, the per-node metrics used in the cluster scoring must take |
| into account that nodes might not have internal storage at all, and |
| handle this as a well-balanced case (score 0). |
| |
| N+1 status |
| ---------- |
| |
| Currently, computing the N+1 status of a node is simple: |
| |
| - group the current secondary instances by their primary node, and |
| compute the sum of each instance group memory |
| - choose the maximum sum, and check if it's smaller than the current |
| available memory on this node |
| |
| In effect, computing the N+1 status is a per-node matter. However, |
| with shared storage, we don't have secondary nodes, just potential |
| secondaries. Thus computing the N+1 status will be a cluster-level |
| matter, and much more expensive. |
| |
| A simple version of the N+1 checks would be that for each instance |
| having said node as primary, we have enough memory in the cluster for |
| relocation. This means we would actually need to run allocation |
| checks, and update the cluster status from within allocation on one |
| node, while being careful that we don't recursively check N+1 status |
| during this relocation, which is too expensive. |
| |
| However, the shared storage model has some properties that changes the |
| rules of the computation. Speaking broadly (and ignoring hard |
| restrictions like tag based exclusion and CPU limits), the exact |
| location of an instance in the cluster doesn't matter as long as |
| memory is available. This results in two changes: |
| |
| - simply tracking the amount of free memory buckets is enough, |
| cluster-wide |
| - moving an instance from one node to another would not change the N+1 |
| status of any node, and only allocation needs to deal with N+1 |
| checks |
| |
| Unfortunately, this very cheap solution fails in case of any other |
| exclusion or prevention factors. |
| |
| TODO: find a solution for N+1 checks. |
| |
| |
| Node groups support |
| =================== |
| |
| The addition of node groups has a small impact on the actual |
| algorithms, which will simply operate at node group level instead of |
| cluster level, but it requires the addition of new algorithms for |
| inter-node group operations. |
| |
| The following two definitions will be used in the following |
| paragraphs: |
| |
| local group |
| The local group refers to a node's own node group, or when speaking |
| about an instance, the node group of its primary node |
| |
| regular cluster |
| A cluster composed of a single node group, or pre-2.3 cluster |
| |
| super cluster |
| This term refers to a cluster which comprises multiple node groups, |
| as opposed to a 2.2 and earlier cluster with a single node group |
| |
| In all the below operations, it's assumed that Ganeti can gather the |
| entire super cluster state cheaply. |
| |
| |
| Balancing changes |
| ----------------- |
| |
| Balancing will move from cluster-level balancing to group |
| balancing. In order to achieve a reasonable improvement in a super |
| cluster, without needing to keep state of what groups have been |
| already balanced previously, the balancing algorithm will run as |
| follows: |
| |
| #. the cluster data is gathered |
| #. if this is a regular cluster, as opposed to a super cluster, |
| balancing will proceed normally as previously |
| #. otherwise, compute the cluster scores for all groups |
| #. choose the group with the worst score and see if we can improve it; |
| if not choose the next-worst group, so on |
| #. once a group has been identified, run the balancing for it |
| |
| Of course, explicit selection of a group will be allowed. |
| |
| Super cluster operations |
| ++++++++++++++++++++++++ |
| |
| Beside the regular group balancing, in a super cluster we have more |
| operations. |
| |
| |
| Redistribution |
| ^^^^^^^^^^^^^^ |
| |
| In a regular cluster, once we run out of resources (offline nodes |
| which can't be fully evacuated, N+1 failures, etc.) there is nothing |
| we can do unless nodes are added or instances are removed. |
| |
| In a super cluster however, there might be resources available in |
| another group, so there is the possibility of relocating instances |
| between groups to re-establish N+1 success within each group. |
| |
| One difficulty in the presence of both super clusters and shared |
| storage is that the move paths of instances are quite complicated; |
| basically an instance can move inside its local group, and to any |
| other groups which have access to the same storage type and storage |
| pool pair. In effect, the super cluster is composed of multiple |
| ‘partitions’, each containing one or more groups, but a node is |
| simultaneously present in multiple partitions, one for each storage |
| type and storage pool it supports. As such, the interactions between |
| the individual partitions are too complex for non-trivial clusters to |
| assume we can compute a perfect solution: we might need to move some |
| instances using shared storage pool ‘A’ in order to clear some more |
| memory to accept an instance using local storage, which will further |
| clear more VCPUs in a third partition, etc. As such, we'll limit |
| ourselves at simple relocation steps within a single partition. |
| |
| Algorithm: |
| |
| #. read super cluster data, and exit if cluster doesn't allow |
| inter-group moves |
| #. filter out any groups that are “alone” in their partition |
| (i.e. no other group sharing at least one storage method) |
| #. determine list of healthy versus unhealthy groups: |
| |
| #. a group which contains offline nodes still hosting instances is |
| definitely not healthy |
| #. a group which has nodes failing N+1 is ‘weakly’ unhealthy |
| |
| #. if either list is empty, exit (no work to do, or no way to fix problems) |
| #. for each unhealthy group: |
| |
| #. compute the instances that are causing the problems: all |
| instances living on offline nodes, all instances living as |
| secondary on N+1 failing nodes, all instances living as primaries |
| on N+1 failing nodes (in this order) |
| #. remove instances, one by one, until the source group is healthy |
| again |
| #. try to run a standard allocation procedure for each instance on |
| all potential groups in its partition |
| #. if all instances were relocated successfully, it means we have a |
| solution for repairing the original group |
| |
| Compression |
| ^^^^^^^^^^^ |
| |
| In a super cluster which has had many instance reclamations, it is |
| possible that while none of the groups is empty, overall there is |
| enough empty capacity that an entire group could be removed. |
| |
| The algorithm for “compressing” the super cluster is as follows: |
| |
| #. read super cluster data |
| #. compute total *(memory, disk, cpu)*, and free *(memory, disk, cpu)* |
| for the super-cluster |
| #. computer per-group used and free *(memory, disk, cpu)* |
| #. select candidate groups for evacuation: |
| |
| #. they must be connected to other groups via a common storage type |
| and pool |
| #. they must have fewer used resources than the global free |
| resources (minus their own free resources) |
| |
| #. for each of these groups, try to relocate all its instances to |
| connected peer groups |
| #. report the list of groups that could be evacuated, or if instructed |
| so, perform the evacuation of the group with the largest free |
| resources (i.e. in order to reclaim the most capacity) |
| |
| Load balancing |
| ^^^^^^^^^^^^^^ |
| |
| Assuming a super cluster using shared storage, where instance failover |
| is cheap, it should be possible to do a load-based balancing across |
| groups. |
| |
| As opposed to the normal balancing, where we want to balance on all |
| node attributes, here we should look only at the load attributes; in |
| other words, compare the available (total) node capacity with the |
| (total) load generated by instances in a given group, and computing |
| such scores for all groups, trying to see if we have any outliers. |
| |
| Once a reliable load-weighting method for groups exists, we can apply |
| a modified version of the cluster scoring method to score not |
| imbalances across nodes, but imbalances across groups which result in |
| a super cluster load-related score. |
| |
| Allocation changes |
| ------------------ |
| |
| It is important to keep the allocation method across groups internal |
| (in the Ganeti/Iallocator combination), instead of delegating it to an |
| external party (e.g. a RAPI client). For this, the IAllocator protocol |
| should be extended to provide proper group support. |
| |
| For htools, the new algorithm will work as follows: |
| |
| #. read/receive cluster data from Ganeti |
| #. filter out any groups that do not supports the requested storage |
| method |
| #. for remaining groups, try allocation and compute scores after |
| allocation |
| #. sort valid allocation solutions accordingly and return the entire |
| list to Ganeti |
| |
| The rationale for returning the entire group list, and not only the |
| best choice, is that we anyway have the list, and Ganeti might have |
| other criteria (e.g. the best group might be busy/locked down, etc.) |
| so even if from the point of view of resources it is the best choice, |
| it might not be the overall best one. |
| |
| Node evacuation changes |
| ----------------------- |
| |
| While the basic concept in the ``multi-evac`` iallocator |
| mode remains unchanged (it's a simple local group issue), when failing |
| to evacuate and running in a super cluster, we could have resources |
| available elsewhere in the cluster for evacuation. |
| |
| The algorithm for computing this will be the same as the one for super |
| cluster compression and redistribution, except that the list of |
| instances is fixed to the ones living on the nodes to-be-evacuated. |
| |
| If the inter-group relocation is successful, the result to Ganeti will |
| not be a local group evacuation target, but instead (for each |
| instance) a pair *(remote group, nodes)*. Ganeti itself will have to |
| decide (based on user input) whether to continue with inter-group |
| evacuation or not. |
| |
| In case that Ganeti doesn't provide complete cluster data, just the |
| local group, the inter-group relocation won't be attempted. |
| |
| .. vim: set textwidth=72 : |
| .. Local Variables: |
| .. mode: rst |
| .. fill-column: 72 |
| .. End: |