| ======= |
| Hotplug |
| ======= |
| |
| .. contents:: :depth: 4 |
| |
| This is a design document detailing the implementation of device |
| hotplugging in Ganeti. The logic used is hypervisor agnostic but still |
| the initial implementation will target the KVM hypervisor. The |
| implementation adds ``python-fdsend`` as a new dependency. In case |
| it is not installed hotplug will not be possible and the user will |
| be notified with a warning. |
| |
| |
| Current state and shortcomings |
| ============================== |
| |
| Currently, Ganeti supports addition/removal/modification of devices |
| (NICs, Disks) but the actual modification takes place only after |
| rebooting the instance. To this end an instance cannot change network, |
| get a new disk etc. without a hard reboot. |
| |
| Until now, in case of KVM hypervisor, code does not name devices nor |
| places them in specific PCI slots. Devices are appended in the KVM |
| command and Ganeti lets KVM decide where to place them. This means that |
| there is a possibility a device that resides in PCI slot 5, after a |
| reboot (due to another device removal) to be moved to another PCI slot |
| and probably get renamed too (due to udev rules, etc.). |
| |
| In order for a migration to succeed, the process on the target node |
| should be started with exactly the same machine version, CPU |
| architecture and PCI configuration with the running process. During |
| instance creation/startup ganeti creates a KVM runtime file with all the |
| necessary information to generate the KVM command. This runtime file is |
| used during instance migration to start a new identical KVM process. The |
| current format includes the fixed part of the final KVM command, a list |
| of NICs', and hvparams dict. It does not favor easy manipulations |
| concerning disks, because they are encapsulated in the fixed KVM |
| command. |
| |
| |
| Proposed changes |
| ================ |
| |
| For the case of the KVM hypervisor, QEMU exposes 32 PCI slots to the |
| instance. Disks and NICs occupy some of these slots. Recent versions of |
| QEMU have introduced monitor commands that allow addition/removal of PCI |
| devices. Devices are referenced based on their name or position on the |
| virtual PCI bus. To be able to use these commands, we need to be able to |
| assign each device a unique name. |
| |
| To keep track where each device is plugged into, we add the |
| ``pci`` slot to Disk and NIC objects, but we save it only in runtime |
| files, since it is hypervisor specific info. This is added for easy |
| object manipulation and is ensured not to be written back to the config. |
| |
| We propose to make use of QEMU 1.7 QMP commands so that |
| modifications to devices take effect instantly without the need for hard |
| reboot. The only change exposed to the end-user will be the addition of |
| a ``--hotplug`` option to the ``gnt-instance modify`` command. |
| |
| Upon hotplugging the PCI configuration of an instance is changed. |
| Runtime files should be updated correspondingly. Currently this is |
| impossible in case of disk hotplug because disks are included in command |
| line entry of the runtime file, contrary to NICs that are correctly |
| treated separately. We change the format of runtime files, we remove |
| disks from the fixed KVM command and create new entry containing them |
| only. KVM options concerning disk are generated during |
| ``_ExecuteKVMCommand()``, just like NICs. |
| |
| Design decisions |
| ================ |
| |
| Which should be each device ID? Currently KVM does not support arbitrary |
| IDs for devices; supported are only names starting with a letter, max 32 |
| chars length, and only including '.' '_' '-' special chars. |
| For debugging purposes and in order to be more informative, device will be |
| named after: <device type>-<part of uuid>-pci-<slot>. |
| |
| Who decides where to hotplug each device? As long as this is a |
| hypervisor specific matter, there is no point for the master node to |
| decide such a thing. Master node just has to request noded to hotplug a |
| device. To this end, hypervisor specific code should parse the current |
| PCI configuration (i.e. ``query-pci`` QMP command), find the first |
| available slot and hotplug the device. Having noded to decide where to |
| hotplug a device we ensure that no error will occur due to duplicate |
| slot assignment (if masterd keeps track of PCI reservations and noded |
| fails to return the PCI slot that the device was plugged into then next |
| hotplug will fail). |
| |
| Where should we keep track of devices' PCI slots? As already mentioned, |
| we must keep track of devices PCI slots to successfully migrate |
| instances. First option is to save this info to config data, which would |
| allow us to place each device at the same PCI slot after reboot. This |
| would require to make the hypervisor return the PCI slot chosen for each |
| device, and storing this information to config data. Additionally the |
| whole instance configuration should be returned with PCI slots filled |
| after instance start and each instance should keep track of current PCI |
| reservations. We decide not to go towards this direction in order to |
| keep it simple and do not add hypervisor specific info to configuration |
| data (``pci_reservations`` at instance level and ``pci`` at device |
| level). For the aforementioned reason, we decide to store this info only |
| in KVM runtime files. |
| |
| Where to place the devices upon instance startup? QEMU has by default 4 |
| pre-occupied PCI slots. So, hypervisor can use the remaining ones for |
| disks and NICs. Currently, PCI configuration is not preserved after |
| reboot. Each time an instance starts, KVM assigns PCI slots to devices |
| based on their ordering in Ganeti configuration, i.e. the second disk |
| will be placed after the first, the third NIC after the second, etc. |
| Since we decided that there is no need to keep track of devices PCI |
| slots, there is no need to change current functionality. |
| |
| How to deal with existing instances? Hotplug depends on runtime file |
| manipulation. It stores there pci info and every device the kvm process is |
| currently using. Existing files have no pci info in devices and have block |
| devices encapsulated inside kvm_cmd entry. Thus hotplugging of existing devices |
| will not be possible. Still migration and hotplugging of new devices will |
| succeed. The workaround will happen upon loading kvm runtime: if we detect old |
| style format we will add an empty list for block devices and upon saving kvm |
| runtime we will include this empty list as well. Switching entirely to new |
| format will happen upon instance reboot. |
| |
| |
| Configuration changes |
| --------------------- |
| |
| The ``NIC`` and ``Disk`` objects get one extra slot: ``pci``. It refers to |
| PCI slot that the device gets plugged into. |
| |
| In order to be able to live migrate successfully, runtime files should |
| be updated every time a live modification (hotplug) takes place. To this |
| end we change the format of runtime files. The KVM options referring to |
| instance's disks are no longer recorded as part of the KVM command line. |
| Disks are treated separately, just as we treat NICs right now. We insert |
| and remove entries to reflect the current PCI configuration. |
| |
| |
| Backend changes |
| --------------- |
| |
| Introduce one new RPC call: |
| |
| - hotplug_device(DEVICE_TYPE, ACTION, device, ...) |
| |
| where DEVICE_TYPE can be either NIC or Disk, and ACTION either REMOVE or ADD. |
| |
| Hypervisor changes |
| ------------------ |
| |
| We implement hotplug on top of the KVM hypervisor. We take advantage of |
| QEMU 1.7 QMP commands (``device_add``, ``device_del``, |
| ``blockdev-add``, ``netdev_add``, ``netdev_del``). Since ``drive_del`` |
| is not yet implemented in QMP we use the one of HMP. QEMU |
| refers to devices based on their id. We use ``uuid`` to name them |
| properly. If a device is about to be hotplugged we parse the output of |
| ``query-pci`` and find the occupied PCI slots. We choose the first |
| available and the whole device object is appended to the corresponding |
| entry in the runtime file. |
| |
| Concerning NIC handling, we build on the top of the existing logic |
| (first create a tap with _OpenTap() and then pass its file descriptor to |
| the KVM process). To this end we need to pass access rights to the |
| corresponding file descriptor over the QMP socket (UNIX domain |
| socket). The open file is passed as a socket-level control message |
| (SCM), using the ``fdsend`` python library. |
| |
| |
| User interface |
| -------------- |
| |
| The new ``--hotplug`` option to gnt-instance modify is introduced, which |
| forces live modifications. |
| |
| |
| Enabling hotplug |
| ++++++++++++++++ |
| |
| Hotplug will be optional during gnt-instance modify. For existing |
| instance, after installing a version that supports hotplugging we |
| have the restriction that hotplug will not be supported for existing |
| devices. The reason is that old runtime files lack of: |
| |
| 1. Device pci configuration info. |
| |
| 2. Separate block device entry. |
| |
| Hotplug will be supported only for KVM in the first implementation. For |
| all other hypervisors, backend will raise an Exception case hotplug is |
| requested. |
| |
| |
| NIC Hotplug |
| +++++++++++ |
| |
| The user can add/modify/remove NICs either with hotplugging or not. If a |
| NIC is to be added a tap is created first and configured properly with |
| kvm-vif-bridge script. Then the instance gets a new network interface. |
| Since there is no QEMU monitor command to modify a NIC, we modify a NIC |
| by temporary removing the existing one and adding a new with the new |
| configuration. When removing a NIC the corresponding tap gets removed as |
| well. |
| |
| :: |
| |
| gnt-instance modify --net add --hotplug test |
| gnt-instance modify --net 1:mac=aa:00:00:55:44:33 --hotplug test |
| gnt-instance modify --net 1:remove --hotplug test |
| |
| |
| Disk Hotplug |
| ++++++++++++ |
| |
| The user can add and remove disks with hotplugging or not. QEMU monitor |
| supports resizing of disks, however the initial implementation will |
| support only disk addition/deletion. |
| |
| :: |
| |
| gnt-instance modify --disk add:size=1G --hotplug test |
| gnt-instance modify --net 1:remove --hotplug test |
| |
| |
| Dealing with chroot and uid pool (and disks in general) |
| ------------------------------------------------------- |
| |
| The design so far covers all issues that arise without addressing the |
| case where the kvm process will not run with root privileges. |
| Specifically: |
| |
| - in case of chroot, the kvm process cannot see the newly created device |
| |
| - in case of uid pool security model, the kvm process is not allowed |
| to access the device |
| |
| For NIC hotplug we address this problem by using the ``getfd`` QMP |
| command and passing the file descriptor to the kvm process over the |
| monitor socket using SCM_RIGHTS. For disk hotplug and in case of uid |
| pool we can let the hypervisor code temporarily ``chown()`` the device |
| before the actual hotplug. Still this is insufficient in case of chroot. |
| In this case, we need to ``mknod()`` the device inside the chroot. Both |
| workarounds can be avoided, if we make use of the ``add-fd`` |
| QMP command, that was introduced in version 1.7. This command is the |
| equivalent of NICs' `get-fd`` for disks and will allow disk hotplug in |
| every case. So, if the QMP does not support the ``add-fd`` |
| command, we will not allow disk hotplug |
| and notify the user with the corresponding warning. |
| |
| .. vim: set textwidth=72 : |
| .. Local Variables: |
| .. mode: rst |
| .. fill-column: 72 |
| .. End: |