This document provides troubleshooting tips for issues commonly encountered by Ganeti users.
If the instances you create are unreachable, you may have one of two main problems:
gnt-instance console
or connect to the VNC console to access to the instance and configure its networking.gnt-cluster init
fails complaining with the error message “Cluster IP already active”Each Ganeti cluster of n nodes needs at least n+1 IP addresses: one IP address for each node, plus an extra IP address that represents the cluster itself and that is “floating” across the nodes. Specifically, the cluster address is active on a network interface of the master node, and it is migrated to the new master when the master node is failed over.
gnt-cluster init
tries to take ownership of this IP address and assign it to the master of the cluster being initialized. Before taking ownership of the address, this command checks whether the address is already active (as it should not be active). If the IP address is already active on a machine, the error message “Cluster IP already active” is triggered.
To resolve this error:
Make sure that the IP is actually the IP that needs to become the cluster IP, as opposed to the IP of the main network interface of one node.
If the IP address is in fact the correct IP, find the machine on which that IP ($MASTER_IP
) is active, and the interface ($MASTER_NETDEV
) to which the IP is assigned.
On said machine, execute the following command as root:
ip addr del "$MASTER_IP" dev "$MASTER_NETDEV"
Run gnt-cluster
init on the master to initialize the cluster.
To exit the console at any time: Type CTRL+]
. This method works on both Xen and KVM.
If nothing is displayed on the console: Try pressing Enter
.
If the console doesn’t work:
inittab
, /etc/init
, or /etc/event.d
, remember to restart init (you can use killall -HUP init
) or reboot the instance to verify that your setup works.gnt-instance activate disks
, and then mounting the disks to add the relevant configuration.Because all instances are different, some of these steps might be redundant for your instance. If the instance is running LVM or some other peculiar config, you may need to take additional steps. The following commands work for a standard instance with a partition inside a block device:
gnt-instance activate-disks <instance-name>
and note the path of the device (the path after the last “:”).losetup -f <file>
, and then use losetup -a
to find the correct device.kpartx -a <device>
(this is either the activate-disks device or the losetup device).mount
on /dev/mapper/<device>p<int>
, substituting the appropriate variables for the device name and the partition number you want to access.umount
1. kpartx -d <device>
1. losetup -d /dev/loop<int>
(If you used losetup
before) 1. gnt-instance deactivate-disks
(Optional; use if you plan to start the instance immediately)All traffic in inter- and intra-cluster moves is transferred using socat and encrypted by default. In intra-cluster moves, the encryption tends to be the limiting factor for the speed of moves. Allowing more options for encryption, including no encryption at all, is a planned feature for 2.12.
If security is of no concern, one useful trick is to convert the instance to use a disk template with redundancy (e.g. DRBD). Failing the instance over and changing disk templates again can significantly outperform a standard instance move.
For moves in which the speed of the network connection is the problem, try using the --compress
option provided by all operations performing instance moves in 2.10. This option can help reduce the amount of data sent over the network by compressing the instance image.
Unfortunately, this is a known issue that will be addressed in Ganeti 2.12 by using opportunistic locking in instance moves.
The move-instance tool uses RAPI, which requires the rapi.pem certificate file to be passed to it as an argument. If the instance is being moved between clusters, both of the clusters’ RAPI certificate files must be provided.
We recommend first checking the import/export scripts of the used OS image. Ganeti uses these scripts to perform moves and migrations, and the scripts often go untested prior to events such as moves and migrations.
gnt-cluster upgrade --to=2.xx
on the master node.If your cluster has at least 3 nodes using DRBD, the safest way to upgrade is node by node, upgrading the master node last:
gnt-node modify -D yes "$NODE" hbal -L -X gnt-node modify -O yes "$NODE"
gnt-node add --readd "$NODE"
If you have sufficient resources, you can also set up a new cluster with the new OS system, and then use the inter-cluster instance move to transfer the instances.
While we don’t recommend using different DRBD versions within a single node group for an extended period of time, Ganeti still works reliably with an non-homogenous DRBD setup during the upgrade process. The safest way to upgrade DRBD versions is node by node, upgrading the master node last:
Remove a node from the cluster:
gnt-node modify -D yes "$NODE" hbal -L -X gnt-node modify -O yes "$NODE"
Upgrade the node OS.
Re-add the node to the cluster:
gnt-node add --readd "$NODE"
There is some confusion regarding the use of LVM snapshots in Ganeti. Ganeti does not support LVM snapshots, in the sense that it doesn’t support creation of minimal-size snapshots that persist, grow as needed, and that can be used to restore a state. Lack of support for these types of LVM snapshopts is due to the slowdown users experience after significant changes are made and the LVM snapshots grow too large.
Ganeti does use LVM snapshots to create a stable view of a volume currently in use which needs to be backed up, but the snapshot is deleted once a backup is made.
The gnt-backup export
command can be used to export an instance to any node in the cluster. The backup contains the data and the configuration of the instance, and can be found in the /srv/ganeti/export/$instance
directory.
Much like in the case of instance moves, export and import scripts are often to blame for strange behavior, especially if said scripts are untested beforehand. We recommend first checking the import/export scripts of the used OS image.
Ganeti doesn’t require any kind of special Ceph configuration. To deploy and configure RBD/Ceph:
Follow the deployment and configuration instructions on the Ceph website. Ganeti doesn’t use the Ceph Filesystem or MDSes, so you can skip these sections of Ceph’s instructions.
Once Ceph is up and running and you configure a RADOS block device storage pool for Ganeti (by default named rbd
), tell Ganeti to use RBD:
gnt-cluster modify --enabled-disk-templates rbd \ --ipolicy-disk-templates rbd
Now that RBD is enabled, specify the pool Ganeti should use. The default value is rbd
, so on a fresh cluster, this step is a no-op.
gnt-cluster modify -D rbd:pool=rbd
Configure Ceph on all Ganeti notes by following the Installing RBD instructions of the Ganeti installation tutorial. For example:
# This wil run the same command on all nodes in your cluster. # NOTE: On Ganeti nodes, /etc/ceph/ceph.conf must at least enumerate the # IP addresses and ports of all Ceph monitors. dsh -Mf /var/lib/ganeti/ssconf_node_list \ "apt-get update; apt-get install ceph-common; scp $HOSTNAME:/etc/ceph/ceph.conf /etc/ceph/"
Verify that all nodes can access Ceph. If this command completes on all nodes, you're good to go.
dsh -Mf /var/lib/ganeti/ssconf_node_list rbd list
If all has gone well up to this point, you can start your first RBD instance. For example:
gnt-instance add -t rbd -s 80G helloworld.example.org
If you’re using Ganeti 2.10 or newer and KVM, you can exploit its native support for Ceph and get a free performance boost by enabling userspace support:
gnt-cluster modify -D rbd:access=userspace
Note: Gluster configuration is very similar. To configure Gluster:
See the Ganeti OS installation redesign document for troubleshooting information.
Setting a root password or key for the OS you're installing is highly dependent on the OS scripts you use. Here’s an example related to the instance-debootstrap OS install scripts:
By default, instance-debootstrap
resets the root password so that newly-created instances have an empty password. Therefore, you don't need any special steps to provide access to the instance.
If you need a password immediately for the instance as it’s being created, you have to use a hook. The complete instructions are in the README file of the instance-debootstrap package. The process can be summarized as follows:
examples/hooks/defaultpasswords
to $sysconfdir/ganeti/instance-debootstrap/hooks/
.examples/hooks/confdata/defaultpasswords
to $sysconfdir/ganeti/instance-debootstrap/hooks/confdata/
.username:password
.The kernel is specified in the hypervisor parameters and can be modified. For example:
gnt-cluster modify -H kvm:kernel_path=/boot/vmlinuz-2.6-kvmU,initrd_path=/boot/initrd-2.6-kvmU
When the master node fails to connect to the rest of the network because of a network failure, you simply need to fix the network. Once the network is fixed, the master recovers automatically.
In the case of a more serious failure, a master failover is needed. A master failover must be triggered manually. To perform a master failover:
Make sure that the original failed master won't start again while a new master is present, preferably by physically shutting down the node.
To upgrade one of the master candidates to the master, issue the following command on the machine you intend to be the new master:
gnt-cluster master-failover
Offline the old master so the new master doesn't try to communicate with it. Issue the following command:
gnt-node modify --offline yes oldmaster
If there were any DRBD instances on the old master node, they can be failed over by issuing the following commands:
gnt-node evacuate -s oldmaster gnt-node evacuate -p oldmaster
Any plain instances on the old master need to be recreated again.
When a failed node (either a regular or master node) is repaired and ready to be added to the cluster, reinstall Ganeti on the node and then re-add it to the cluster using the following command:
gnt-node add --readd nodename
After re-adding a node, it's a good idea to run hbal --luxi --print-commands
on the master node to obtain the list of commands to balance the cluster and populate the new node with instances. Running hbal --luxi --exec
executes the commands directly.
If the master node fails on a 2 node cluster, upgrading the non-master node to become the new master requires special handling. This is because a master needs to obtain a majority of votes from the rest of the network to ensure there is no other master running. In the case of a cluster with only two nodes, neither node can obtain the majority.
To fail over the master node on a 2 node cluster:
Issue the following command on the new master:
gnt-cluster master-failover --no-voting
Manually start the master daemon with the following command:
ganeti-masterd --no-voting
Run the following command to ensure that the cluster is consistent again:
gnt-cluster redist-conf
Ganeti’s ganeti-watcher daemon makes sure that all instances marked as up are running. It also reactivates secondary DRBD block devices of instances on nodes that are rebooted. In sum, it tries to ensure that if a node is rebooted, all affected instances are eventually fixed and brought to their original state.
Ganeti doesn’t perform failovers automatically; they must be triggered manually. However, there are tools that help the administrator perform common related tasks:
gnt-node evacuate
: Moves instances from a given nodehbal
: Automates the task of distributing instances evenly on the nodes. Displays a list of recommended commands or performs them automatically.DRBD disks can experience various error states, some of which Ganeti can recover from (semi-) automatically.
gnt-instance activate-disks
for the instance (even while the instance is running). DRBD automatically syncs the changed portions of the disks from the primary node to the secondary node.gnt-instance replace-disks
to recreate and resync those disks and return them to a fully functional and replicated state.There is no single source of DRBD performance problems, and thus, no single solution. There are a few good starting points for diagnosing such problems:
resync-rate
, protocol
, and the dynamic-resync
-related parameters).The maximum size of DRBD devices which can be created through Ganeti is 4TB (see Issue 256: DRBD Volume above 4TB not working).
To create larger DRBD devices, you must set up the devices manually and use them via the blockdev disk template. However, you will lose all automatic DRBD management performed by Ganeti, as well as the possibility to failover/migrate instances to their secondary node.
Ganeti can convert between DRBD and plain LVM volumes. To accomplish this conversion, use gnt-instance modify
and provide the desired new disk template with the --disk-template
command line parameter. Refer to the gnt-instance man page for further details. Ganeti does not support any other disk template conversion.
Upgrades to the drbd8-utils package in Ubuntu have resulted in some DRBD 8.4 syntax requirements spilling into DRBD 8.3 tools as well. Ganeti's attempts to execute previously valid commands fail unless the compatibility executable is used instead of the one linked by default. A temporary way to resolve this:
mv /sbin/drbdsetup /sbin/drbdsetup84 ln -s /lib/drbd/drbdsetup-83 /sbin/drbdsetup
For more details, look at the forum topic: https://groups.google.com/forum/#!msg/ganeti/MkCNmzF6hu8/kTPOELyEkdsJ
The message Unhandled Ganeti error: Given cluster certificate does not match local key
can indicate that an old /var/lib/ganeti/server.pem
certificate (on the node to be added) still exists. If you ran a cluster cleanup script to wipe your cluster, make sure to run it on all nodes that shall be added to the new cluster.
Tips and tricks for debugging hail and hbal:
-t
option, which provides the cluster configuration in text format. You can generate a text format description of the current live state of the cluster by running hbal -L -S config
, which creates a file named config.original.-t
option and therefore can be debugged in a way similar to hail.gnt-instance add
, use the -I
option to specify your own instance allocator. This option can also be a script that saves the input and returns a constant output.hspace -L
.*
commandsTips for gnt-*
commands:
Here you find some documentation that Ganeti users provided: