doc/design-migration-speed-hbal.rst - ganeti - Git at Google

 ==================================
 Migration speed accounting in Hbal
 ==================================

 .. contents:: :depth: 2

 Hbal usually performs complex sequence of moves during cluster balancing in
 order to achieve local optimal cluster state. Unfortunately, each move may take
 significant amount of time. Thus, during the sequence of moves the situation on
 cluster may change (e.g., because of adding new instance or because of instance
 or node parameters change) and desired moves can become unprofitable.

 Usually disk moves become a bottleneck and require sufficient amount of time.
 :ref:`Instance move improvements <move-performance>` considers
 disk moves speed in more details. Currently, ``hbal`` has a ``--no-disk-moves``
 option preventing disk moves during cluster balancing in order to perform fast
 (but of course non optimal) balancing. It may be useful, but ideally we need to
 find a balance between optimal configuration and time to reach this
 configuration.

 Avoiding insignificant disk moves
 =================================

 Allowing only profitable enough disk moves may become a first step to reach
 a compromise between moves speed and optimal scoring. This can be implemented
 by introducing ``--avoid-disk-moves *FACTOR*`` option which will admit disk
 moves only if the gain in the cluster metrics is *FACTOR* times
 higher than the gain achievable by non disk moves.

 Avoiding insignificant long-time solutions
 ==========================================

 The next step is to estimate an amount of time required to perform a balancing
 step and to introduce a new term: ``long-time`` solution.

 ``--long-solution-threshold`` option will specify a duration in seconds.
 A solution exceeding the duration is a ``long-time`` solution by definition.

 With time estimations we will be able to filter Hbal's sequences and
 eliminating long-time solutions which don't lead to a sufficient cluster metric
 improvement. This can be done by the ``--avoid-long-solutions *FACTOR*`` option
 which will allow only long solutions, whose K/N metrics are more than *FACTOR*,
 where K is the number of times cluster metric has increased and N is an
 estimated time to perform this solution divided by the threshold.

 The default values for the new options are:

 ``--long-solution-threshold`` = 1000 seconds
 (all adequate solutions are not ``long-time``)

 ``--avoid-long-solutions`` = 0.0
 (no filtering by time estimations, feature disabled)

 Network bandwidth cluster tags
 ==============================

 The bandwidth between nodes, node groups and within whole cluster could be
 specified "by hand" with the cluster tags.

 Every node contains its own set of bandwidth tags. Tags from higher level are
 inherited by lower levels: nodegroups inherit cluster tags and nodes inherit
 nodegroups tags. Tags from lower level (if exist) override higher level tags
 with the same bandwidth prefixes: node tags override nodegroup tags as well as
 nodegroup tags override cluster tags.

 Below are some examples of using bandwidth tags:
 (The examples are provided using in the Htools Text backend format)

 1) single nodegroup with 5 nodes, all bandwidths symmetric.

 group-01|...|nic:100MBit|

 *no* node-level bandwidth tags (they are inherited by group tags)

 htools:bandwidth:nic
 htools:bandwidth:nic:100MBit::nic:100MBit::100


 2) 3 nodegroups, within each nodegroup symmetric bandwidths,
 between nodegroups different bandwidths.

 group-01|...|nic:1000MBit (overrides cluster tags)|
 group-02|...|nic:100MBit (overrides cluster tags)|
 group-03|...|inherited by cluster tags (nic:10MBit)|

 *no* node-level bandwidth tags (they are inherited by group tags)

 htools:bandwidth:nic
 nic:10MBit
 htools:bandwidth:nic:10MBit::nic:10MBit::10
 htools:bandwidth:nic:100MBit::nic:100MBit::100
 htools:bandwidth:nic:1000MBit::nic:1000MBit::1000
 htools:bandwidth:nic:10MBit::nic:100MBit::10
 htools:bandwidth:nic:10MBit::nic:1000MBit::10
 htools:bandwidth:nic:100MBit::nic:1000MBit::100


 Network bandwidth estimation
 ============================

 Balancing time can be estimated by dividing the amount of data to be moved by
 the current network bandwidth between the affected nodes.

 We propose to add a new data collector, that will gather information about
 network speed by sending a test file between between nodes. Accounting the
 time, we can estimate average network speed between nodegroups in a cluster.

 DataCollector implementation details
 ====================================

 The new bandwidth data collector introduces an ability to collect actual
 information about network speed between nodegroups in a cluster. We assume,
 that the network bandwidth between any nodes within one nodegroup is almost the
 same unlike the network speed between different nodegroups. So the proposed
 data collector will provide the most necessary data for time estimations.

 For sending packets *scp* utility will be used. The default size of file
 to send is 5Mbyte in order to obtain adequate values of the network speed.

 During *dcUpdate* every data collector sends a file with known size to a node
 from another nodegroup (chosen randomly) and measures time to perform it. MonD
 receives the response of *dcReport* from collectors, it fills nodegroup
 bandwidth map in nodes. The information collected will be used to estimate the
 time of balancing steps. In the case of existing network speed information for
 some node from bandwidth tags as well as from bandwidth data collector the last
 one will be chosen.
	==================================
	Migration speed accounting in Hbal
	==================================

	.. contents:: :depth: 2

	Hbal usually performs complex sequence of moves during cluster balancing in
	order to achieve local optimal cluster state. Unfortunately, each move may take
	significant amount of time. Thus, during the sequence of moves the situation on
	cluster may change (e.g., because of adding new instance or because of instance
	or node parameters change) and desired moves can become unprofitable.

	Usually disk moves become a bottleneck and require sufficient amount of time.
	:ref:`Instance move improvements <move-performance>` considers
	disk moves speed in more details. Currently, ``hbal`` has a ``--no-disk-moves``
	option preventing disk moves during cluster balancing in order to perform fast
	(but of course non optimal) balancing. It may be useful, but ideally we need to
	find a balance between optimal configuration and time to reach this
	configuration.

	Avoiding insignificant disk moves
	=================================

	Allowing only profitable enough disk moves may become a first step to reach
	a compromise between moves speed and optimal scoring. This can be implemented
	by introducing ``--avoid-disk-moves FACTOR`` option which will admit disk
	moves only if the gain in the cluster metrics is FACTOR times
	higher than the gain achievable by non disk moves.

	Avoiding insignificant long-time solutions
	==========================================

	The next step is to estimate an amount of time required to perform a balancing
	step and to introduce a new term: ``long-time`` solution.

	``--long-solution-threshold`` option will specify a duration in seconds.
	A solution exceeding the duration is a ``long-time`` solution by definition.

	With time estimations we will be able to filter Hbal's sequences and
	eliminating long-time solutions which don't lead to a sufficient cluster metric
	improvement. This can be done by the ``--avoid-long-solutions FACTOR`` option
	which will allow only long solutions, whose K/N metrics are more than FACTOR,
	where K is the number of times cluster metric has increased and N is an
	estimated time to perform this solution divided by the threshold.

	The default values for the new options are:

	``--long-solution-threshold`` = 1000 seconds
	(all adequate solutions are not ``long-time``)

	``--avoid-long-solutions`` = 0.0
	(no filtering by time estimations, feature disabled)

	Network bandwidth cluster tags
	==============================

	The bandwidth between nodes, node groups and within whole cluster could be
	specified "by hand" with the cluster tags.

	Every node contains its own set of bandwidth tags. Tags from higher level are
	inherited by lower levels: nodegroups inherit cluster tags and nodes inherit
	nodegroups tags. Tags from lower level (if exist) override higher level tags
	with the same bandwidth prefixes: node tags override nodegroup tags as well as
	nodegroup tags override cluster tags.

	Below are some examples of using bandwidth tags:
	(The examples are provided using in the Htools Text backend format)

	1) single nodegroup with 5 nodes, all bandwidths symmetric.

	group-01\|...\|nic:100MBit\|

	no node-level bandwidth tags (they are inherited by group tags)

	htools:bandwidth:nic
	htools:bandwidth:nic:100MBit::nic:100MBit::100


	2) 3 nodegroups, within each nodegroup symmetric bandwidths,
	between nodegroups different bandwidths.

	group-01\|...\|nic:1000MBit (overrides cluster tags)\|
	group-02\|...\|nic:100MBit (overrides cluster tags)\|
	group-03\|...\|inherited by cluster tags (nic:10MBit)\|

	no node-level bandwidth tags (they are inherited by group tags)

	htools:bandwidth:nic
	nic:10MBit
	htools:bandwidth:nic:10MBit::nic:10MBit::10
	htools:bandwidth:nic:100MBit::nic:100MBit::100
	htools:bandwidth:nic:1000MBit::nic:1000MBit::1000
	htools:bandwidth:nic:10MBit::nic:100MBit::10
	htools:bandwidth:nic:10MBit::nic:1000MBit::10
	htools:bandwidth:nic:100MBit::nic:1000MBit::100


	Network bandwidth estimation
	============================

	Balancing time can be estimated by dividing the amount of data to be moved by
	the current network bandwidth between the affected nodes.

	We propose to add a new data collector, that will gather information about
	network speed by sending a test file between between nodes. Accounting the
	time, we can estimate average network speed between nodegroups in a cluster.

	DataCollector implementation details
	====================================

	The new bandwidth data collector introduces an ability to collect actual
	information about network speed between nodegroups in a cluster. We assume,
	that the network bandwidth between any nodes within one nodegroup is almost the
	same unlike the network speed between different nodegroups. So the proposed
	data collector will provide the most necessary data for time estimations.

	For sending packets scp utility will be used. The default size of file
	to send is 5Mbyte in order to obtain adequate values of the network speed.

	During dcUpdate every data collector sends a file with known size to a node
	from another nodegroup (chosen randomly) and measures time to perform it. MonD
	receives the response of dcReport from collectors, it fills nodegroup
	bandwidth map in nodes. The information collected will be used to estimate the
	time of balancing steps. In the case of existing network speed information for
	some node from bandwidth tags as well as from bandwidth data collector the last
	one will be chosen.