| ========================= |
| Ganeti Maintenance Daemon |
| ========================= |
| |
| .. contents:: :depth: 4 |
| |
| This design document outlines the implementation of a new Ganeti |
| daemon coordinating all maintenance operations on a cluster |
| (rebalancing, activate disks, ERROR_down handling, node repairs |
| actions). |
| |
| |
| Current state and shortcomings |
| ============================== |
| |
| With ``harep``, Ganeti has a basic mechanism for repairs of instances |
| in a cluster. The ``harep`` tool can fix a broken DRBD status, migrate, |
| failover, and reinstall instances. It is intended to be run regularly, |
| e.g., via a cron job. It will submit appropriate Ganeti jobs to take |
| action within the range allowed by instance tags and keep track |
| of them by recoding the job ids in appropriate tags. |
| |
| Besides ``harep``, Ganeti offers no further support for repair automation. |
| While useful, this setup can be insufficient in some situations. |
| |
| Failures in actual hardware, e.g., a physical disk, currently requires |
| coordination around Ganeti: the hardware failure is detected on the node, |
| Ganeti needs to be told to evacuate the node, and, once this is done, some |
| other entity needs to coordinate the actual physical repair. Currently there |
| is no support by Ganeti to automatically prepare everything for a hardware |
| swap. |
| |
| |
| Proposed changes |
| ================ |
| |
| We propose the addition of an additional daemon, called ``maintd`` |
| that will coordinate cluster balance actions, instance repair actions, |
| and work for hardware repair needs of individual nodes. The information |
| about the work to be done will be obtained from a dedicated data collector |
| via the :doc:`design-monitoring-agent`. |
| |
| Self-diagnose data collector |
| ---------------------------- |
| |
| The monitoring daemon will get one additional dedicated data collector for |
| node health. The collector will call an external command supposed to do |
| any hardware-specific diagnose for the node it is running on. That command |
| is configurable, but needs to be white-listed ahead of time by the node. |
| For convenience, the empty string will stand for a build-in diagnose that |
| always reports that everything is OK; this will also be the default value |
| for this collector. |
| |
| Note that the self-diagnose data collector itself can, and usually will, |
| call separate diagnose tools for separate subsystems. However, it always |
| has to provide a consolidated description of the overall health state |
| of the node. |
| |
| Protocol |
| ~~~~~~~~ |
| |
| The collector script takes no arguments and is supposed to output the string |
| representation of a single JSON object where the individual fields have the |
| following meaning. Note that, if several things are broken on that node, the |
| self-diagnose collector script has to merge them into a single repair action. |
| |
| status |
| ...... |
| |
| This is a JSON string where the value is one of ``Ok``, ``live-repair``, |
| ``evacuate``, ``evacuate-failover``. This indicates the overall need for |
| repair and Ganeti actions to be taken. The meaning of these states are |
| no action needed, some action is needed that can be taken while instances |
| continue to run on that node, it is necessary to evacuate and offline |
| the node, and it is necessary to evacuate and offline the node without |
| attempting live migrations, respectively. |
| |
| command |
| ....... |
| |
| If the status is ``live-repair``, a repair command can be specified. |
| This command will be executed as repair action following the |
| :doc:`design-restricted-commands`, however extended to read information |
| on ``stdin``. The whole diagnose JSON object will be provided as ``stdin`` |
| to those commands. |
| |
| details |
| ....... |
| |
| An opaque JSON value that the repair daemon will just pass through and |
| export. It is intended to contain information about the type of repair |
| that needs to be done after the respective Ganeti action is finished. |
| E.g., it might contain information which piece of hardware is to be |
| swapped, once the node is fully evacuated and offlined. |
| |
| As two failures are considered different, if the output of the script |
| encodes a different JSON object, the collector script should ensure |
| that as long as the hardware status does not change, the output of the |
| script is stable; otherwise this would cause various events reported for |
| the same failure. |
| |
| Security considerations |
| ~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Command execution |
| ................. |
| |
| Obviously, running arbitrary commands that are part of the configuration |
| poses a security risk. Note that an underlying design goal of Ganeti is |
| that even with RAPI credentials known to the attacker, he still cannot |
| obtain data from within the instances. As monitoring, however, is configurable |
| via RAPI, we require the node to white-list the command using a mechanism |
| similar to the :doc:`design-restricted-commands`; in our case, the white-listing |
| directory will be ``/etc/ganeti/node-diagnose-commands``. |
| |
| For the repair-commands, as mentioned, we extend the |
| :doc:`design-restricted-commands` by allowing input on ``stdin``. All other |
| restrictions, in particular the white-listing requirement, remain. The |
| white-listing directory will be ``/etc/ganeti/node-repair-commands``. |
| |
| Result forging |
| .............. |
| |
| As the repair daemon will take real Ganeti actions based on the diagnose |
| reported by the self-diagnose script through the monitoring daemon, we |
| need to verify integrity of such reports to avoid denial-of-service by |
| fraudaulent error reports. Therefore, the monitoring daemon will sign |
| the result by an hmac signature with the cluster hmac key, in the same |
| way as it is done in the ``confd`` wire protocol (see :doc:`design-2.1`). |
| |
| Repair-event life cycle |
| ----------------------- |
| |
| Once a repair event is detected, a unique identifier is assigned to it. |
| As long as the node-health collector returns the same output (as JSON |
| object), this is still considered the same event. |
| This identifier can be used to cancel an observed event at any time; for |
| this an appropriate command-line and RAPI endpoint will be provided. Cancelling |
| an event tells the repair daemon not to take any actions (despite them |
| being requested) for this event and forget about it, as soon as it is |
| no longer observed. |
| |
| Corresponding Ganeti actions will be initiated and success or failure of |
| these Ganeti jobs monitored. All jobs submitted by the repair daemon |
| will have the string ``gnt:daemon:maintd`` and the event identifier |
| in the reason trail, so that :doc:`design-optables` is possible. |
| Once a job fails, no further jobs will be submitted for this event |
| to avoid further damage; the repair action is considered failed in this case. |
| |
| Once all requested actions succeeded, or one failed, the node where the |
| event as observed will be tagged by a tag starting with ``maintd:repairready:`` |
| or ``maintd:repairfailed:``, respectively, where the event identifier is |
| encoded in the rest of the tag. On the one hand, it can be used as an |
| additional verification whether a node is ready for a specific repair. |
| However, the main purpose is to provide a simple and uniform interface |
| to acknowledge an event. Once a ``maintd:repairready`` tag is removed, |
| the maintenance daemon will forget about this event, as soon as it is no |
| longer observed by any monitoring daemon. Removing a ``maintd:repairfailed:`` |
| tag will make the maintenance daemon to unconditionally forget the event; |
| note that, if the underlying problem is not fixed yet, this provides an |
| easy way of restarting a repair flow. |
| |
| |
| Repair daemon |
| ------------- |
| |
| The new daemon ``maintd`` will be running on the master node only. It will |
| verify the master status of its node by popular vote in the same way as all the |
| other master-only daemons. If started on a non-master node, it will exit |
| immediately with exit code ``exitNotmaster``, i.e., 11. |
| |
| External Reporting Protocol |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Upon successful start, the daemon will bind to a port overridable at |
| command-line, by default 1816, on the master network device. There it will |
| serve the current repair state via HTTP. All queries will be HTTP GET |
| requests and all answers will be encoded in JSON format. Initially, the |
| following requests will be supported. |
| |
| ``/`` |
| ..... |
| |
| Returns the list of supported protocol versions, initially just ``[1]``. |
| |
| ``/1/status`` |
| ............. |
| |
| Returns a list of all non-cleared incidents. Each incident is reported |
| as a JSON object with at least the following information. |
| |
| - ``uuid`` The unique identifier assigned to the event. |
| |
| - ``node`` The UUID of the node on which the even was observed. |
| |
| - ``original`` The very JSON object reported by self-diagnose data collector. |
| |
| - ``repair-status`` A string describing the progress made on this event so |
| far. It is one of the following. |
| |
| + ``noted`` The event has been observed, but no action has been taken yet |
| |
| + ``pending`` At least one job has been submitted in reaction to the event |
| and none of the submitted jobs has failed so far. |
| |
| + ``canceled`` The event has been canceled, i.e., ordered to be ignored, but |
| is still observed. |
| |
| + ``failed`` At least one of the submitted jobs has failed. To avoid further |
| damage, the repair daemon will not take any further action for this event. |
| |
| + ``completed`` All Ganeti actions associated with this event have been |
| completed successfully, including tagging the node. |
| |
| - ``jobs`` The list of the numbers of ganeti jobs submitted in response to |
| this event. |
| |
| - ``tag`` A string that is the tag that either has been added to the node, or, |
| if the repair event is not yet finalized, will be added in case of success. |
| |
| State |
| ~~~~~ |
| |
| As repairs, especially those involving physically swapping hardware, can take |
| a long time, the repair daemon needs to store its state persistently. As we |
| cannot exclude master-failovers during a repair cycle, it does so by storing |
| it as part of the Ganeti configuration. |
| |
| This will be done by adding a new top-level entry to the Ganeti configuration. |
| The SSConf will not be changed. |
| |
| Superseeding ``harep`` and implicit balancing |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| To have a single point coordinating all repair actions, the new repair daemon |
| will also have the ability to take over the work currently done by ``harep``. |
| To allow a smooth transition, ``maintd`` when carrying out ``harep``'s duties |
| will add tags in precisely the same way as ``harep`` does. |
| As the new daemon will have to move instances, it will also have the ability |
| to balance the cluster in a way coordinated with the necessary evacuation |
| options; dynamic load information can be taken into account. |
| |
| The question on whether to do ``harep``'s work and whether to balance the |
| cluster and if so using which strategy (e.g., taking dynamic load information |
| into account or not, allowing disk moves or not) are configurable via the Ganeti |
| configuration. The default will be to do neither of those tasks. ``harep`` will |
| continue to exist unchanged as part of the ``htools``. |
| |
| Mode of operation |
| ~~~~~~~~~~~~~~~~~ |
| |
| The repair daemon will poll the monitoring daemons for |
| the value of the self-diagnose data collector at the same (configurable) |
| rate the monitoring daemon collects this collector; if load-based balancing is |
| enabled, it will also collect for the the load data needed. |
| |
| Repair events will be exposed on the web status page as soon as observed. |
| The Ganeti jobs doing the actual maintenance will be submitted in rounds. |
| A new round will be started if all jobs of the old round have finished, and |
| there is an unhandled repair event or the cluster is unbalanced enough (provided |
| that autobalancing is enabled). |
| |
| In each round, ``maintd`` will first determine the most invasive action for |
| each node; despite the self-diagnose collector summing observations in a single |
| action recommendation, a new, more invasive recommendation can be issued before |
| the handling of the first recommendation is finished. For all nodes to be |
| evacuated, the first evacuation task is scheduled, in a way that these tasks do |
| not conflict with each other. Then, for all instances on a non-affected node, |
| that need ``harep``-style repair (if enabled) those jobs are scheduled to the |
| extend of not conflicting with each other. Then on the remaining nodes that |
| are not part of a failed repair event either, the jobs |
| of the first balancing step are scheduled. All those jobs of a round are |
| submitted at once. As they do not conflict they will be able to run in parallel. |