doc/walkthrough.rst - ganeti - Git at Google

 Ganeti walk-through
 ===================

 Documents Ganeti version |version|

 .. contents::

 .. highlight:: shell-example

 Introduction
 ------------

 This document serves as a more example-oriented guide to Ganeti; while
 the administration guide shows a conceptual approach, here you will find
 a step-by-step example to managing instances and the cluster.

 Our simulated, example cluster will have three machines, named
 ``node1``, ``node2``, ``node3``. Note that in real life machines will
 usually have FQDNs but here we use short names for brevity. We will use
 a secondary network for replication data, ``192.0.2.0/24``, with nodes
 having the last octet the same as their index. The cluster name will be
 ``example-cluster``. All nodes have the same simulated hardware
 configuration, two disks of 750GB, 32GB of memory and 4 CPUs.

 On this cluster, we will create up to seven instances, named
 ``instance1`` to ``instance7``.


 Cluster creation
 ----------------

 Follow the :doc:`install` document and prepare the nodes. Then it's time
 to initialise the cluster::

   $ gnt-cluster init -s %192.0.2.1% --enabled-hypervisors=xen-pvm %example-cluster%
   $

 The creation was fine. Let's check that one node we have is functioning
 correctly::

   $ gnt-node list
   Node  DTotal DFree MTotal MNode MFree Pinst Sinst
   node1   1.3T  1.3T  32.0G  1.0G 30.5G     0     0
   $ gnt-cluster verify
   Mon Oct 26 02:08:51 2009 * Verifying global settings
   Mon Oct 26 02:08:51 2009 * Gathering data (1 nodes)
   Mon Oct 26 02:08:52 2009 * Verifying node status
   Mon Oct 26 02:08:52 2009 * Verifying instance status
   Mon Oct 26 02:08:52 2009 * Verifying orphan volumes
   Mon Oct 26 02:08:52 2009 * Verifying remaining instances
   Mon Oct 26 02:08:52 2009 * Verifying N+1 Memory redundancy
   Mon Oct 26 02:08:52 2009 * Other Notes
   Mon Oct 26 02:08:52 2009 * Hooks Results
   $

 Since this proceeded correctly, let's add the other two nodes::

   $ gnt-node add -s %192.0.2.2% %node2%
   -- WARNING --
   Performing this operation is going to replace the ssh daemon keypair
   on the target machine (node2) with the ones of the current one
   and grant full intra-cluster ssh root access to/from it

   Unable to verify hostkey of host xen-devi-5.fra.corp.google.com:
   f7:…. Do you want to accept it?
   y/[n]/?: %y%
   Mon Oct 26 02:11:53 2009  Authentication to node2 via public key failed, trying password
   root password:
   Mon Oct 26 02:11:54 2009  - INFO: Node will be a master candidate
   $ gnt-node add -s %192.0.2.3% %node3%
   -- WARNING --
   Performing this operation is going to replace the ssh daemon keypair
   on the target machine (node3) with the ones of the current one
   and grant full intra-cluster ssh root access to/from it

   …
   Mon Oct 26 02:12:43 2009  - INFO: Node will be a master candidate

 Checking the cluster status again::

   $ gnt-node list
   Node  DTotal DFree MTotal MNode MFree Pinst Sinst
   node1   1.3T  1.3T  32.0G  1.0G 30.5G     0     0
   node2   1.3T  1.3T  32.0G  1.0G 30.5G     0     0
   node3   1.3T  1.3T  32.0G  1.0G 30.5G     0     0
   $ gnt-cluster verify
   Mon Oct 26 02:15:14 2009 * Verifying global settings
   Mon Oct 26 02:15:14 2009 * Gathering data (3 nodes)
   Mon Oct 26 02:15:16 2009 * Verifying node status
   Mon Oct 26 02:15:16 2009 * Verifying instance status
   Mon Oct 26 02:15:16 2009 * Verifying orphan volumes
   Mon Oct 26 02:15:16 2009 * Verifying remaining instances
   Mon Oct 26 02:15:16 2009 * Verifying N+1 Memory redundancy
   Mon Oct 26 02:15:16 2009 * Other Notes
   Mon Oct 26 02:15:16 2009 * Hooks Results
   $

 And let's check that we have a valid OS::

   $ gnt-os list
   Name
   debootstrap
   node1#

 Running a burn-in
 -----------------

 Now that the cluster is created, it is time to check that the hardware
 works correctly, that the hypervisor can actually create instances,
 etc. This is done via the debootstrap tool as described in the admin
 guide. Similar output lines are replaced with ``…`` in the below log::

   $ /usr/lib/ganeti/tools/burnin -o debootstrap -p instance{1..5}
   - Testing global parameters
   - Creating instances
     * instance instance1
       on node1, node2
     * instance instance2
       on node2, node3
     …
     * instance instance5
       on node2, node3
     * Submitted job ID(s) 157, 158, 159, 160, 161
       waiting for job 157 for instance1
       …
       waiting for job 161 for instance5
   - Replacing disks on the same nodes
     * instance instance1
       run replace_on_secondary
       run replace_on_primary
     …
     * instance instance5
       run replace_on_secondary
       run replace_on_primary
     * Submitted job ID(s) 162, 163, 164, 165, 166
       waiting for job 162 for instance1
       …
   - Changing the secondary node
     * instance instance1
       run replace_new_secondary node3
     * instance instance2
       run replace_new_secondary node1
     …
     * instance instance5
       run replace_new_secondary node1
     * Submitted job ID(s) 167, 168, 169, 170, 171
       waiting for job 167 for instance1
       …
   - Growing disks
     * instance instance1
       increase disk/0 by 128 MB
     …
     * instance instance5
       increase disk/0 by 128 MB
     * Submitted job ID(s) 173, 174, 175, 176, 177
       waiting for job 173 for instance1
       …
   - Failing over instances
     * instance instance1
     …
     * instance instance5
     * Submitted job ID(s) 179, 180, 181, 182, 183
       waiting for job 179 for instance1
       …
   - Migrating instances
     * instance instance1
       migration and migration cleanup
     …
     * instance instance5
       migration and migration cleanup
     * Submitted job ID(s) 184, 185, 186, 187, 188
       waiting for job 184 for instance1
       …
   - Exporting and re-importing instances
     * instance instance1
       export to node node3
       remove instance
       import from node3 to node1, node2
       remove export
     …
     * instance instance5
       export to node node1
       remove instance
       import from node1 to node2, node3
       remove export
     * Submitted job ID(s) 196, 197, 198, 199, 200
       waiting for job 196 for instance1
       …
   - Reinstalling instances
     * instance instance1
       reinstall without passing the OS
       reinstall specifying the OS
     …
     * instance instance5
       reinstall without passing the OS
       reinstall specifying the OS
     * Submitted job ID(s) 203, 204, 205, 206, 207
       waiting for job 203 for instance1
       …
   - Rebooting instances
     * instance instance1
       reboot with type 'hard'
       reboot with type 'soft'
       reboot with type 'full'
     …
     * instance instance5
       reboot with type 'hard'
       reboot with type 'soft'
       reboot with type 'full'
     * Submitted job ID(s) 208, 209, 210, 211, 212
       waiting for job 208 for instance1
     …
   - Adding and removing disks
     * instance instance1
       adding a disk
       removing last disk
     …
     * instance instance5
       adding a disk
       removing last disk
     * Submitted job ID(s) 213, 214, 215, 216, 217
       waiting for job 213 for instance1
       …
   - Adding and removing NICs
     * instance instance1
       adding a NIC
       removing last NIC
     …
     * instance instance5
       adding a NIC
       removing last NIC
     * Submitted job ID(s) 218, 219, 220, 221, 222
       waiting for job 218 for instance1
       …
   - Activating/deactivating disks
     * instance instance1
       activate disks when online
       activate disks when offline
       deactivate disks (when offline)
     …
     * instance instance5
       activate disks when online
       activate disks when offline
       deactivate disks (when offline)
     * Submitted job ID(s) 223, 224, 225, 226, 227
       waiting for job 223 for instance1
       …
   - Stopping and starting instances
     * instance instance1
     …
     * instance instance5
     * Submitted job ID(s) 230, 231, 232, 233, 234
       waiting for job 230 for instance1
       …
   - Removing instances
     * instance instance1
     …
     * instance instance5
     * Submitted job ID(s) 235, 236, 237, 238, 239
       waiting for job 235 for instance1
       …
   $

 You can see in the above what operations the burn-in does. Ideally, the
 burn-in log would proceed successfully through all the steps and end
 cleanly, without throwing errors.

 Instance operations
 -------------------

 Creation
 ++++++++

 At this point, Ganeti and the hardware seems to be functioning
 correctly, so we'll follow up with creating the instances manually::

   $ gnt-instance add -t drbd -o debootstrap -s %256m% %instance1%
   Mon Oct 26 04:06:52 2009  - INFO: Selected nodes for instance instance1 via iallocator hail: node2, node3
   Mon Oct 26 04:06:53 2009 * creating instance disks...
   Mon Oct 26 04:06:57 2009 adding instance instance1 to cluster config
   Mon Oct 26 04:06:57 2009  - INFO: Waiting for instance instance1 to sync disks.
   Mon Oct 26 04:06:57 2009  - INFO: - device disk/0: 20.00\% done, 4 estimated seconds remaining
   Mon Oct 26 04:07:01 2009  - INFO: Instance instance1's disks are in sync.
   Mon Oct 26 04:07:01 2009 creating os for instance instance1 on node node2
   Mon Oct 26 04:07:01 2009 * running the instance OS create scripts...
   Mon Oct 26 04:07:14 2009 * starting instance...
   $ gnt-instance add -t drbd -o debootstrap -s %256m% -n %node1%:%node2% %instance2%
   Mon Oct 26 04:11:37 2009 * creating instance disks...
   Mon Oct 26 04:11:40 2009 adding instance instance2 to cluster config
   Mon Oct 26 04:11:41 2009  - INFO: Waiting for instance instance2 to sync disks.
   Mon Oct 26 04:11:41 2009  - INFO: - device disk/0: 35.40\% done, 1 estimated seconds remaining
   Mon Oct 26 04:11:42 2009  - INFO: - device disk/0: 58.50\% done, 1 estimated seconds remaining
   Mon Oct 26 04:11:43 2009  - INFO: - device disk/0: 86.20\% done, 0 estimated seconds remaining
   Mon Oct 26 04:11:44 2009  - INFO: - device disk/0: 92.40\% done, 0 estimated seconds remaining
   Mon Oct 26 04:11:44 2009  - INFO: - device disk/0: 97.00\% done, 0 estimated seconds remaining
   Mon Oct 26 04:11:44 2009  - INFO: Instance instance2's disks are in sync.
   Mon Oct 26 04:11:44 2009 creating os for instance instance2 on node node1
   Mon Oct 26 04:11:44 2009 * running the instance OS create scripts...
   Mon Oct 26 04:11:57 2009 * starting instance...
   $

 The above shows one instance created via an iallocator script, and one
 being created with manual node assignment. The other three instances
 were also created and now it's time to check them::

   $ gnt-instance list
   Instance  Hypervisor OS          Primary_node Status  Memory
   instance1 xen-pvm    debootstrap node2        running   128M
   instance2 xen-pvm    debootstrap node1        running   128M
   instance3 xen-pvm    debootstrap node1        running   128M
   instance4 xen-pvm    debootstrap node3        running   128M
   instance5 xen-pvm    debootstrap node2        running   128M

 Accessing instances
 +++++++++++++++++++

 Accessing an instance's console is easy::

   $ gnt-instance console %instance2%
   [    0.000000] Bootdata ok (command line is root=/dev/sda1 ro)
   [    0.000000] Linux version 2.6…
   [    0.000000] BIOS-provided physical RAM map:
   [    0.000000]  Xen: 0000000000000000 - 0000000008800000 (usable)
   [13138176.018071] Built 1 zonelists.  Total pages: 34816
   [13138176.018074] Kernel command line: root=/dev/sda1 ro
   [13138176.018694] Initializing CPU#0
   …
   Checking file systems...fsck 1.41.3 (12-Oct-2008)
   done.
   Setting kernel variables (/etc/sysctl.conf)...done.
   Mounting local filesystems...done.
   Activating swapfile swap...done.
   Setting up networking....
   Configuring network interfaces...done.
   Setting console screen modes and fonts.
   INIT: Entering runlevel: 2
   Starting enhanced syslogd: rsyslogd.
   Starting periodic command scheduler: crond.

   Debian GNU/Linux 5.0 instance2 tty1

   instance2 login:

 At this moment you can login to the instance and, after configuring the
 network (and doing this on all instances), we can check their
 connectivity::

   $ fping %instance{1..5}%
   instance1 is alive
   instance2 is alive
   instance3 is alive
   instance4 is alive
   instance5 is alive
   $

 Removal
 +++++++

 Removing unwanted instances is also easy::

   $ gnt-instance remove %instance5%
   This will remove the volumes of the instance instance5 (including
   mirrors), thus removing all the data of the instance. Continue?
   y/[n]/?: %y%
   $


 Recovering from hardware failures
 ---------------------------------

 Recovering from node failure
 ++++++++++++++++++++++++++++

 We are now left with four instances. Assume that at this point, node3,
 which has one primary and one secondary instance, crashes::

   $ gnt-node info %node3%
   Node name: node3
     primary ip: 198.51.100.1
     secondary ip: 192.0.2.3
     master candidate: True
     drained: False
     offline: False
     primary for instances:
       - instance4
     secondary for instances:
       - instance1
   $ fping %node3%
   node3 is unreachable

 At this point, the primary instance of that node (instance4) is down,
 but the secondary instance (instance1) is not affected except it has
 lost disk redundancy::

   $ fping %instance{1,4}%
   instance1 is alive
   instance4 is unreachable
   $

 If we try to check the status of instance4 via the instance info
 command, it fails because it tries to contact node3 which is down::

   $ gnt-instance info %instance4%
   Failure: command execution error:
   Error checking node node3: Connection failed (113: No route to host)
   $

 So we need to mark node3 as being *offline*, and thus Ganeti won't talk
 to it anymore::

   $ gnt-node modify -O yes -f %node3%
   Mon Oct 26 04:34:12 2009  - WARNING: Not enough master candidates (desired 10, new value will be 2)
   Mon Oct 26 04:34:15 2009  - WARNING: Communication failure to node node3: Connection failed (113: No route to host)
   Modified node node3
    - offline -> True
    - master_candidate -> auto-demotion due to offline
   $

 And now we can failover the instance::

   $ gnt-instance failover %instance4%
   Failover will happen to image instance4. This requires a shutdown of
   the instance. Continue?
   y/[n]/?: %y%
   Mon Oct 26 04:35:34 2009 * checking disk consistency between source and target
   Failure: command execution error:
   Disk disk/0 is degraded on target node, aborting failover.
   $ gnt-instance failover --ignore-consistency %instance4%
   Failover will happen to image instance4. This requires a shutdown of
   the instance. Continue?
   y/[n]/?: y
   Mon Oct 26 04:35:47 2009 * checking disk consistency between source and target
   Mon Oct 26 04:35:47 2009 * shutting down instance on source node
   Mon Oct 26 04:35:47 2009  - WARNING: Could not shutdown instance instance4 on node node3. Proceeding anyway. Please make sure node node3 is down. Error details: Node is marked offline
   Mon Oct 26 04:35:47 2009 * deactivating the instance's disks on source node
   Mon Oct 26 04:35:47 2009  - WARNING: Could not shutdown block device disk/0 on node node3: Node is marked offline
   Mon Oct 26 04:35:47 2009 * activating the instance's disks on target node
   Mon Oct 26 04:35:47 2009  - WARNING: Could not prepare block device disk/0 on node node3 (is_primary=False, pass=1): Node is marked offline
   Mon Oct 26 04:35:48 2009 * starting the instance on the target node
   $

 Note in our first attempt, Ganeti refused to do the failover since it
 wasn't sure what is the status of the instance's disks. We pass the
 ``--ignore-consistency`` flag and then we can failover::

   $ gnt-instance list
   Instance  Hypervisor OS          Primary_node Status  Memory
   instance1 xen-pvm    debootstrap node2        running   128M
   instance2 xen-pvm    debootstrap node1        running   128M
   instance3 xen-pvm    debootstrap node1        running   128M
   instance4 xen-pvm    debootstrap node1        running   128M
   $

 But at this point, both instance1 and instance4 are without disk
 redundancy::

   $ gnt-instance info %instance1%
   Instance name: instance1
   UUID: 45173e82-d1fa-417c-8758-7d582ab7eef4
   Serial number: 2
   Creation time: 2009-10-26 04:06:57
   Modification time: 2009-10-26 04:07:14
   State: configured to be up, actual state is up
     Nodes:
       - primary: node2
       - secondaries: node3
     Operating system: debootstrap
     Allocated network port: None
     Hypervisor: xen-pvm
       - root_path: default (/dev/sda1)
       - kernel_args: default (ro)
       - use_bootloader: default (False)
       - bootloader_args: default ()
       - bootloader_path: default ()
       - kernel_path: default (/boot/vmlinuz-2.6-xenU)
       - initrd_path: default ()
     Hardware:
       - VCPUs: 1
       - maxmem: 256MiB
       - minmem: 512MiB
       - NICs:
         - nic/0: MAC: aa:00:00:78:da:63, IP: None, mode: bridged, link: xen-br0
     Disks:
       - disk/0: drbd8, size 256M
         access mode: rw
         nodeA:       node2, minor=0
         nodeB:       node3, minor=0
         port:        11035
         auth key:    8e950e3cec6854b0181fbc3a6058657701f2d458
         on primary:  /dev/drbd0 (147:0) in sync, status *DEGRADED*
         child devices:
           - child 0: lvm, size 256M
             logical_id: xenvg/22459cf8-117d-4bea-a1aa-791667d07800.disk0_data
             on primary: /dev/xenvg/22459cf8-117d-4bea-a1aa-791667d07800.disk0_data (254:0)
           - child 1: lvm, size 128M
             logical_id: xenvg/22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta
             on primary: /dev/xenvg/22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta (254:1)

 The output is similar for instance4. In order to recover this, we need
 to run the node evacuate command which will change from the current
 secondary node to a new one (in this case, we only have two working
 nodes, so all instances will be end on nodes one and two)::

   $ gnt-node evacuate -I hail %node3%
   Relocate instance(s) 'instance1','instance4' from node
    node3 using iallocator hail?
   y/[n]/?: %y%
   Mon Oct 26 05:05:39 2009  - INFO: Selected new secondary for instance 'instance1': node1
   Mon Oct 26 05:05:40 2009  - INFO: Selected new secondary for instance 'instance4': node2
   Mon Oct 26 05:05:40 2009 Replacing disk(s) 0 for instance1
   Mon Oct 26 05:05:40 2009 STEP 1/6 Check device existence
   Mon Oct 26 05:05:40 2009  - INFO: Checking disk/0 on node2
   Mon Oct 26 05:05:40 2009  - INFO: Checking volume groups
   Mon Oct 26 05:05:40 2009 STEP 2/6 Check peer consistency
   Mon Oct 26 05:05:40 2009  - INFO: Checking disk/0 consistency on node node2
   Mon Oct 26 05:05:40 2009 STEP 3/6 Allocate new storage
   Mon Oct 26 05:05:40 2009  - INFO: Adding new local storage on node1 for disk/0
   Mon Oct 26 05:05:41 2009 STEP 4/6 Changing drbd configuration
   Mon Oct 26 05:05:41 2009  - INFO: activating a new drbd on node1 for disk/0
   Mon Oct 26 05:05:42 2009  - INFO: Shutting down drbd for disk/0 on old node
   Mon Oct 26 05:05:42 2009  - WARNING: Failed to shutdown drbd for disk/0 on oldnode: Node is marked offline
   Mon Oct 26 05:05:42 2009       Hint: Please cleanup this device manually as soon as possible
   Mon Oct 26 05:05:42 2009  - INFO: Detaching primary drbds from the network (=> standalone)
   Mon Oct 26 05:05:42 2009  - INFO: Updating instance configuration
   Mon Oct 26 05:05:45 2009  - INFO: Attaching primary drbds to new secondary (standalone => connected)
   Mon Oct 26 05:05:46 2009 STEP 5/6 Sync devices
   Mon Oct 26 05:05:46 2009  - INFO: Waiting for instance instance1 to sync disks.
   Mon Oct 26 05:05:46 2009  - INFO: - device disk/0: 13.90\% done, 7 estimated seconds remaining
   Mon Oct 26 05:05:53 2009  - INFO: Instance instance1's disks are in sync.
   Mon Oct 26 05:05:53 2009 STEP 6/6 Removing old storage
   Mon Oct 26 05:05:53 2009  - INFO: Remove logical volumes for 0
   Mon Oct 26 05:05:53 2009  - WARNING: Can't remove old LV: Node is marked offline
   Mon Oct 26 05:05:53 2009       Hint: remove unused LVs manually
   Mon Oct 26 05:05:53 2009  - WARNING: Can't remove old LV: Node is marked offline
   Mon Oct 26 05:05:53 2009       Hint: remove unused LVs manually
   Mon Oct 26 05:05:53 2009 Replacing disk(s) 0 for instance4
   Mon Oct 26 05:05:53 2009 STEP 1/6 Check device existence
   Mon Oct 26 05:05:53 2009  - INFO: Checking disk/0 on node1
   Mon Oct 26 05:05:53 2009  - INFO: Checking volume groups
   Mon Oct 26 05:05:53 2009 STEP 2/6 Check peer consistency
   Mon Oct 26 05:05:53 2009  - INFO: Checking disk/0 consistency on node node1
   Mon Oct 26 05:05:54 2009 STEP 3/6 Allocate new storage
   Mon Oct 26 05:05:54 2009  - INFO: Adding new local storage on node2 for disk/0
   Mon Oct 26 05:05:54 2009 STEP 4/6 Changing drbd configuration
   Mon Oct 26 05:05:54 2009  - INFO: activating a new drbd on node2 for disk/0
   Mon Oct 26 05:05:55 2009  - INFO: Shutting down drbd for disk/0 on old node
   Mon Oct 26 05:05:55 2009  - WARNING: Failed to shutdown drbd for disk/0 on oldnode: Node is marked offline
   Mon Oct 26 05:05:55 2009       Hint: Please cleanup this device manually as soon as possible
   Mon Oct 26 05:05:55 2009  - INFO: Detaching primary drbds from the network (=> standalone)
   Mon Oct 26 05:05:55 2009  - INFO: Updating instance configuration
   Mon Oct 26 05:05:55 2009  - INFO: Attaching primary drbds to new secondary (standalone => connected)
   Mon Oct 26 05:05:56 2009 STEP 5/6 Sync devices
   Mon Oct 26 05:05:56 2009  - INFO: Waiting for instance instance4 to sync disks.
   Mon Oct 26 05:05:56 2009  - INFO: - device disk/0: 12.40\% done, 8 estimated seconds remaining
   Mon Oct 26 05:06:04 2009  - INFO: Instance instance4's disks are in sync.
   Mon Oct 26 05:06:04 2009 STEP 6/6 Removing old storage
   Mon Oct 26 05:06:04 2009  - INFO: Remove logical volumes for 0
   Mon Oct 26 05:06:04 2009  - WARNING: Can't remove old LV: Node is marked offline
   Mon Oct 26 05:06:04 2009       Hint: remove unused LVs manually
   Mon Oct 26 05:06:04 2009  - WARNING: Can't remove old LV: Node is marked offline
   Mon Oct 26 05:06:04 2009       Hint: remove unused LVs manually
   $

 And now node3 is completely free of instances and can be repaired::

   $ gnt-node list
   Node  DTotal DFree MTotal MNode MFree Pinst Sinst
   node1   1.3T  1.3T  32.0G  1.0G 30.2G     3     1
   node2   1.3T  1.3T  32.0G  1.0G 30.4G     1     3
   node3      ?     ?      ?     ?     ?     0     0

 Re-adding a node to the cluster
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Let's say node3 has been repaired and is now ready to be
 reused. Re-adding it is simple::

   $ gnt-node add --readd %node3%
   The authenticity of host 'node3 (198.51.100.1)' can't be established.
   RSA key fingerprint is 9f:2e:5a:2e:e0:bd:00:09:e4:5c:32:f2:27:57:7a:f4.
   Are you sure you want to continue connecting (yes/no)? yes
   Mon Oct 26 05:27:39 2009  - INFO: Readding a node, the offline/drained flags were reset
   Mon Oct 26 05:27:39 2009  - INFO: Node will be a master candidate

 And it is now working again::

   $ gnt-node list
   Node  DTotal DFree MTotal MNode MFree Pinst Sinst
   node1   1.3T  1.3T  32.0G  1.0G 30.2G     3     1
   node2   1.3T  1.3T  32.0G  1.0G 30.4G     1     3
   node3   1.3T  1.3T  32.0G  1.0G 30.4G     0     0

 .. note:: If Ganeti has been built with the htools
    component enabled, you can shuffle the instances around to have a
    better use of the nodes.

 Disk failures
 +++++++++++++

 A disk failure is simpler than a full node failure. First, a single disk
 failure should not cause data-loss for any redundant instance; only the
 performance of some instances might be reduced due to more network
 traffic.

 Let take the cluster status in the above listing, and check what volumes
 are in use::

   $ gnt-node volumes -o phys,instance %node2%
   PhysDev   Instance
   /dev/sdb1 instance4
   /dev/sdb1 instance4
   /dev/sdb1 instance1
   /dev/sdb1 instance1
   /dev/sdb1 instance3
   /dev/sdb1 instance3
   /dev/sdb1 instance2
   /dev/sdb1 instance2
   $

 You can see that all instances on node2 have logical volumes on
 ``/dev/sdb1``. Let's simulate a disk failure on that disk::

   $ ssh node2
   # on node2
   $ echo offline > /sys/block/sdb/device/state
   $ vgs
     /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
     /dev/sdb1: read failed after 0 of 4096 at 750153695232: Input/output error
     /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
     Couldn't find device with uuid '954bJA-mNL0-7ydj-sdpW-nc2C-ZrCi-zFp91c'.
     Couldn't find all physical volumes for volume group xenvg.
     /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
     /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
     Couldn't find device with uuid '954bJA-mNL0-7ydj-sdpW-nc2C-ZrCi-zFp91c'.
     Couldn't find all physical volumes for volume group xenvg.
     Volume group xenvg not found
   $

 At this point, the node is broken and if we are to examine
 instance2 we get (simplified output shown)::

   $ gnt-instance info %instance2%
   Instance name: instance2
   State: configured to be up, actual state is up
     Nodes:
       - primary: node1
       - secondaries: node2
     Disks:
       - disk/0: drbd8, size 256M
         on primary:   /dev/drbd0 (147:0) in sync, status ok
         on secondary: /dev/drbd1 (147:1) in sync, status *DEGRADED* *MISSING DISK*

 This instance has a secondary only on node2. Let's verify a primary
 instance of node2::

   $ gnt-instance info %instance1%
   Instance name: instance1
   State: configured to be up, actual state is up
     Nodes:
       - primary: node2
       - secondaries: node1
     Disks:
       - disk/0: drbd8, size 256M
         on primary:   /dev/drbd0 (147:0) in sync, status *DEGRADED* *MISSING DISK*
         on secondary: /dev/drbd3 (147:3) in sync, status ok
   $ gnt-instance console %instance1%

   Debian GNU/Linux 5.0 instance1 tty1

   instance1 login: root
   Last login: Tue Oct 27 01:24:09 UTC 2009 on tty1
   instance1:~# date > test
   instance1:~# sync
   instance1:~# cat test
   Tue Oct 27 01:25:20 UTC 2009
   instance1:~# dmesg|tail
   [5439785.235448] NET: Registered protocol family 15
   [5439785.235489] 802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
   [5439785.235495] All bugs added by David S. Miller <davem@redhat.com>
   [5439785.235517] XENBUS: Device with no driver: device/console/0
   [5439785.236576] kjournald starting.  Commit interval 5 seconds
   [5439785.236588] EXT3-fs: mounted filesystem with ordered data mode.
   [5439785.236625] VFS: Mounted root (ext3 filesystem) readonly.
   [5439785.236663] Freeing unused kernel memory: 172k freed
   [5439787.533779] EXT3 FS on sda1, internal journal
   [5440655.065431] eth0: no IPv6 routers present
   instance1:~#

 As you can see, the instance is running fine and doesn't see any disk
 issues. It is now time to fix node2 and re-establish redundancy for the
 involved instances.

 .. note:: For Ganeti 2.0 we need to fix manually the volume group on
    node2 by running ``vgreduce --removemissing xenvg``

 ::

   $ gnt-node repair-storage %node2% lvm-vg %xenvg%
   Mon Oct 26 18:14:03 2009 Repairing storage unit 'xenvg' on node2 ...
   $ ssh %node2% vgs
   VG    #PV #LV #SN Attr   VSize   VFree
   xenvg   1   8   0 wz--n- 673.84G 673.84G
   $

 This has removed the 'bad' disk from the volume group, which is now left
 with only one PV. We can now replace the disks for the involved
 instances::

   $ for i in %instance{1..4}%; do gnt-instance replace-disks -a $i; done
   Mon Oct 26 18:15:38 2009 Replacing disk(s) 0 for instance1
   Mon Oct 26 18:15:38 2009 STEP 1/6 Check device existence
   Mon Oct 26 18:15:38 2009  - INFO: Checking disk/0 on node1
   Mon Oct 26 18:15:38 2009  - INFO: Checking disk/0 on node2
   Mon Oct 26 18:15:38 2009  - INFO: Checking volume groups
   Mon Oct 26 18:15:38 2009 STEP 2/6 Check peer consistency
   Mon Oct 26 18:15:38 2009  - INFO: Checking disk/0 consistency on node node1
   Mon Oct 26 18:15:39 2009 STEP 3/6 Allocate new storage
   Mon Oct 26 18:15:39 2009  - INFO: Adding storage on node2 for disk/0
   Mon Oct 26 18:15:39 2009 STEP 4/6 Changing drbd configuration
   Mon Oct 26 18:15:39 2009  - INFO: Detaching disk/0 drbd from local storage
   Mon Oct 26 18:15:40 2009  - INFO: Renaming the old LVs on the target node
   Mon Oct 26 18:15:40 2009  - INFO: Renaming the new LVs on the target node
   Mon Oct 26 18:15:40 2009  - INFO: Adding new mirror component on node2
   Mon Oct 26 18:15:41 2009 STEP 5/6 Sync devices
   Mon Oct 26 18:15:41 2009  - INFO: Waiting for instance instance1 to sync disks.
   Mon Oct 26 18:15:41 2009  - INFO: - device disk/0: 12.40\% done, 9 estimated seconds remaining
   Mon Oct 26 18:15:50 2009  - INFO: Instance instance1's disks are in sync.
   Mon Oct 26 18:15:50 2009 STEP 6/6 Removing old storage
   Mon Oct 26 18:15:50 2009  - INFO: Remove logical volumes for disk/0
   Mon Oct 26 18:15:52 2009 Replacing disk(s) 0 for instance2
   Mon Oct 26 18:15:52 2009 STEP 1/6 Check device existence
   …
   Mon Oct 26 18:16:01 2009 STEP 6/6 Removing old storage
   Mon Oct 26 18:16:01 2009  - INFO: Remove logical volumes for disk/0
   Mon Oct 26 18:16:02 2009 Replacing disk(s) 0 for instance3
   Mon Oct 26 18:16:02 2009 STEP 1/6 Check device existence
   …
   Mon Oct 26 18:16:09 2009 STEP 6/6 Removing old storage
   Mon Oct 26 18:16:09 2009  - INFO: Remove logical volumes for disk/0
   Mon Oct 26 18:16:10 2009 Replacing disk(s) 0 for instance4
   Mon Oct 26 18:16:10 2009 STEP 1/6 Check device existence
   …
   Mon Oct 26 18:16:18 2009 STEP 6/6 Removing old storage
   Mon Oct 26 18:16:18 2009  - INFO: Remove logical volumes for disk/0
   $

 As this point, all instances should be healthy again.

 .. note:: Ganeti 2.0 doesn't have the ``-a`` option to replace-disks, so
    for it you have to run the loop twice, once over primary instances
    with argument ``-p`` and once secondary instances with argument
    ``-s``, but otherwise the operations are similar::

      $ gnt-instance replace-disks -p instance1
      …
      $ for i in %instance{2..4}%; do gnt-instance replace-disks -s $i; done

 Common cluster problems
 -----------------------

 There are a number of small issues that might appear on a cluster that
 can be solved easily as long as the issue is properly identified. For
 this exercise we will consider the case of node3, which was broken
 previously and re-added to the cluster without reinstallation. Running
 cluster verify on the cluster reports::

   $ gnt-cluster verify
   Mon Oct 26 18:30:08 2009 * Verifying global settings
   Mon Oct 26 18:30:08 2009 * Gathering data (3 nodes)
   Mon Oct 26 18:30:10 2009 * Verifying node status
   Mon Oct 26 18:30:10 2009   - ERROR: node node3: unallocated drbd minor 0 is in use
   Mon Oct 26 18:30:10 2009   - ERROR: node node3: unallocated drbd minor 1 is in use
   Mon Oct 26 18:30:10 2009 * Verifying instance status
   Mon Oct 26 18:30:10 2009   - ERROR: instance instance4: instance should not run on node node3
   Mon Oct 26 18:30:10 2009 * Verifying orphan volumes
   Mon Oct 26 18:30:10 2009   - ERROR: node node3: volume 22459cf8-117d-4bea-a1aa-791667d07800.disk0_data is unknown
   Mon Oct 26 18:30:10 2009   - ERROR: node node3: volume 1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_data is unknown
   Mon Oct 26 18:30:10 2009   - ERROR: node node3: volume 1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_meta is unknown
   Mon Oct 26 18:30:10 2009   - ERROR: node node3: volume 22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta is unknown
   Mon Oct 26 18:30:10 2009 * Verifying remaining instances
   Mon Oct 26 18:30:10 2009 * Verifying N+1 Memory redundancy
   Mon Oct 26 18:30:10 2009 * Other Notes
   Mon Oct 26 18:30:10 2009 * Hooks Results
   $

 Instance status
 +++++++++++++++

 As you can see, *instance4* has a copy running on node3, because we
 forced the failover when node3 failed. This case is dangerous as the
 instance will have the same IP and MAC address, wreaking havoc on the
 network environment and anyone who tries to use it.

 Ganeti doesn't directly handle this case. It is recommended to logon to
 node3 and run::

   $ xm destroy %instance4%

 Unallocated DRBD minors
 +++++++++++++++++++++++

 There are still unallocated DRBD minors on node3. Again, these are not
 handled by Ganeti directly and need to be cleaned up via DRBD commands::

   $ ssh %node3%
   # on node 3
   $ drbdsetup /dev/drbd%0% down
   $ drbdsetup /dev/drbd%1% down
   $

 Orphan volumes
 ++++++++++++++

 At this point, the only remaining problem should be the so-called
 *orphan* volumes. This can happen also in the case of an aborted
 disk-replace, or similar situation where Ganeti was not able to recover
 automatically. Here you need to remove them manually via LVM commands::

   $ ssh %node3%
   # on node3
   $ lvremove %xenvg%
   Do you really want to remove active logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_data"? [y/n]: %y%
     Logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_data" successfully removed
   Do you really want to remove active logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta"? [y/n]: %y%
     Logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta" successfully removed
   Do you really want to remove active logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_data"? [y/n]: %y%
     Logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_data" successfully removed
   Do you really want to remove active logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_meta"? [y/n]: %y%
     Logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_meta" successfully removed
   node3#

 At this point cluster verify shouldn't complain anymore::

   $ gnt-cluster verify
   Mon Oct 26 18:37:51 2009 * Verifying global settings
   Mon Oct 26 18:37:51 2009 * Gathering data (3 nodes)
   Mon Oct 26 18:37:53 2009 * Verifying node status
   Mon Oct 26 18:37:53 2009 * Verifying instance status
   Mon Oct 26 18:37:53 2009 * Verifying orphan volumes
   Mon Oct 26 18:37:53 2009 * Verifying remaining instances
   Mon Oct 26 18:37:53 2009 * Verifying N+1 Memory redundancy
   Mon Oct 26 18:37:53 2009 * Other Notes
   Mon Oct 26 18:37:53 2009 * Hooks Results
   $

 N+1 errors
 ++++++++++

 Since redundant instances in Ganeti have a primary/secondary model, it
 is needed to leave aside on each node enough memory so that if one of
 its peer node fails, all the secondary instances that have that node as
 primary can be relocated. More specifically, if instance2 has node1 as
 primary and node2 as secondary (and node1 and node2 do not have any
 other instances in this layout), then it means that node2 must have
 enough free memory so that if node1 fails, we can failover instance2
 without any other operations (for reducing the downtime window). Let's
 increase the memory of the current instances to 4G, and add three new
 instances, two on node2:node3 with 8GB of RAM and one on node1:node2,
 with 12GB of RAM (numbers chosen so that we run out of memory)::

   $ gnt-instance modify -B memory=%4G% %instance1%
   Modified instance instance1
    - be/maxmem -> 4096
    - be/minmem -> 4096
   Please don't forget that these parameters take effect only at the next start of the instance.
   $ gnt-instance modify …

   $ gnt-instance add -t drbd -n %node2%:%node3% -s %512m% -B memory=%8G% -o %debootstrap% %instance5%
   …
   $ gnt-instance add -t drbd -n %node2%:%node3% -s %512m% -B memory=%8G% -o %debootstrap% %instance6%
   …
   $ gnt-instance add -t drbd -n %node1%:%node2% -s %512m% -B memory=%8G% -o %debootstrap% %instance7%
   $ gnt-instance reboot --all
   The reboot will operate on 7 instances.
   Do you want to continue?
   Affected instances:
     instance1
     instance2
     instance3
     instance4
     instance5
     instance6
     instance7
   y/[n]/?: %y%
   Submitted jobs 677, 678, 679, 680, 681, 682, 683
   Waiting for job 677 for instance1...
   Waiting for job 678 for instance2...
   Waiting for job 679 for instance3...
   Waiting for job 680 for instance4...
   Waiting for job 681 for instance5...
   Waiting for job 682 for instance6...
   Waiting for job 683 for instance7...
   $

 We rebooted the instances for the memory changes to have effect. Now the
 cluster looks like::

   $ gnt-node list
   Node  DTotal DFree MTotal MNode MFree Pinst Sinst
   node1   1.3T  1.3T  32.0G  1.0G  6.5G     4     1
   node2   1.3T  1.3T  32.0G  1.0G 10.5G     3     4
   node3   1.3T  1.3T  32.0G  1.0G 30.5G     0     2
   $ gnt-cluster verify
   Mon Oct 26 18:59:36 2009 * Verifying global settings
   Mon Oct 26 18:59:36 2009 * Gathering data (3 nodes)
   Mon Oct 26 18:59:37 2009 * Verifying node status
   Mon Oct 26 18:59:37 2009 * Verifying instance status
   Mon Oct 26 18:59:37 2009 * Verifying orphan volumes
   Mon Oct 26 18:59:37 2009 * Verifying remaining instances
   Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy
   Mon Oct 26 18:59:37 2009   - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail
   Mon Oct 26 18:59:37 2009 * Other Notes
   Mon Oct 26 18:59:37 2009 * Hooks Results
   $

 The cluster verify error above shows that if node1 fails, node2 will not
 have enough memory to failover all primary instances on node1 to it. To
 solve this, you have a number of options:

 - try to manually move instances around (but this can become complicated
   for any non-trivial cluster)
 - try to reduce the minimum memory of some instances on the source node
   of the N+1 failure (in the example above ``node1``): this will allow
   it to start and be failed over/migrated with less than its maximum
   memory
 - try to reduce the runtime/maximum memory of some instances on the
   destination node of the N+1 failure (in the example above ``node2``)
   to create additional available node memory (check the :doc:`admin`
   guide for what Ganeti will and won't automatically do in regards to
   instance runtime memory modification)
 - if Ganeti has been built with the htools package enabled, you can run
   the ``hbal`` tool which will try to compute an automated cluster
   solution that complies with the N+1 rule

 Network issues
 ++++++++++++++

 In case a node has problems with the network (usually the secondary
 network, as problems with the primary network will render the node
 unusable for ganeti commands), it will show up in cluster verify as::

   $ gnt-cluster verify
   Mon Oct 26 19:07:19 2009 * Verifying global settings
   Mon Oct 26 19:07:19 2009 * Gathering data (3 nodes)
   Mon Oct 26 19:07:23 2009 * Verifying node status
   Mon Oct 26 19:07:23 2009   - ERROR: node node1: tcp communication with node 'node3': failure using the secondary interface(s)
   Mon Oct 26 19:07:23 2009   - ERROR: node node2: tcp communication with node 'node3': failure using the secondary interface(s)
   Mon Oct 26 19:07:23 2009   - ERROR: node node3: tcp communication with node 'node1': failure using the secondary interface(s)
   Mon Oct 26 19:07:23 2009   - ERROR: node node3: tcp communication with node 'node2': failure using the secondary interface(s)
   Mon Oct 26 19:07:23 2009   - ERROR: node node3: tcp communication with node 'node3': failure using the secondary interface(s)
   Mon Oct 26 19:07:23 2009 * Verifying instance status
   Mon Oct 26 19:07:23 2009 * Verifying orphan volumes
   Mon Oct 26 19:07:23 2009 * Verifying remaining instances
   Mon Oct 26 19:07:23 2009 * Verifying N+1 Memory redundancy
   Mon Oct 26 19:07:23 2009 * Other Notes
   Mon Oct 26 19:07:23 2009 * Hooks Results
   $

 This shows that both node1 and node2 have problems contacting node3 over
 the secondary network, and node3 has problems contacting them. From this
 output is can be deduced that since node1 and node2 can communicate
 between themselves, node3 is the one having problems, and you need to
 investigate its network settings/connection.

 Migration problems
 ++++++++++++++++++

 Since live migration can sometimes fail and leave the instance in an
 inconsistent state, Ganeti provides a ``--cleanup`` argument to the
 migrate command that does:

 - check on which node the instance is actually running (has the
   command failed before or after the actual migration?)
 - reconfigure the DRBD disks accordingly

 It is always safe to run this command as long as the instance has good
 data on its primary node (i.e. not showing as degraded). If so, you can
 simply run::

   $ gnt-instance migrate --cleanup %instance1%
   Instance instance1 will be recovered from a failed migration. Note
   that the migration procedure (including cleanup) is **experimental**
   in this version. This might impact the instance if anything goes
   wrong. Continue?
   y/[n]/?: %y%
   Mon Oct 26 19:13:49 2009 Migrating instance instance1
   Mon Oct 26 19:13:49 2009 * checking where the instance actually runs (if this hangs, the hypervisor might be in a bad state)
   Mon Oct 26 19:13:49 2009 * instance confirmed to be running on its primary node (node2)
   Mon Oct 26 19:13:49 2009 * switching node node1 to secondary mode
   Mon Oct 26 19:13:50 2009 * wait until resync is done
   Mon Oct 26 19:13:50 2009 * changing into standalone mode
   Mon Oct 26 19:13:50 2009 * changing disks into single-master mode
   Mon Oct 26 19:13:50 2009 * wait until resync is done
   Mon Oct 26 19:13:51 2009 * done
   $

 In use disks at instance shutdown
 +++++++++++++++++++++++++++++++++

 If you see something like the following when trying to shutdown or
 deactivate disks for an instance::

   $ gnt-instance shutdown %instance1%
   Mon Oct 26 19:16:23 2009  - WARNING: Could not shutdown block device disk/0 on node node2: drbd0: can't shutdown drbd device: /dev/drbd0: State change failed: (-12) Device is held open by someone\n

 It most likely means something is holding open the underlying DRBD
 device. This can be bad if the instance is not running, as it might mean
 that there was concurrent access from both the node and the instance to
 the disks, but not always (e.g. you could only have had the partitions
 activated via ``kpartx``).

 To troubleshoot this issue you need to follow standard Linux practices,
 and pay attention to the hypervisor being used:

 - check if (in the above example) ``/dev/drbd0`` on node2 is being
   mounted somewhere (``cat /proc/mounts``)
 - check if the device is not being used by device mapper itself:
   ``dmsetup ls`` and look for entries of the form ``drbd0pX``, and if so
   remove them with either ``kpartx -d`` or ``dmsetup remove``

 For Xen, check if it's not using the disks itself::

   $ xenstore-ls /local/domain/%0%/backend/vbd|grep -e "domain =" -e physical-device
   domain = "instance2"
   physical-device = "93:0"
   domain = "instance3"
   physical-device = "93:1"
   domain = "instance4"
   physical-device = "93:2"
   $

 You can see in the above output that the node exports three disks, to
 three instances. The ``physical-device`` key is in major:minor format in
 hexadecimal, and ``0x93`` represents DRBD's major number. Thus we can
 see from the above that instance2 has /dev/drbd0, instance3 /dev/drbd1,
 and instance4 /dev/drbd2.

 LUXI version mismatch
 +++++++++++++++++++++

 LUXI is the protocol used for communication between clients and the
 master daemon. Starting in Ganeti 2.3, the peers exchange their version
 in each message. When they don't match, an error is raised::

   $ gnt-node modify -O yes %node3%
   Unhandled Ganeti error: LUXI version mismatch, server 2020000, request 2030000

 Usually this means that server and client are from different Ganeti
 versions or import their libraries from different, consistent paths
 (e.g. an older version installed in another place). You can print the
 import path for Ganeti's modules using the following command (note that
 depending on your setup you might have to use an explicit version in the
 Python command, e.g. ``python2.6``)::

   python -c 'import ganeti; print ganeti.__file__'

 .. vim: set textwidth=72 :
 .. Local Variables:
 .. mode: rst
 .. fill-column: 72
 .. End: