blob: df053a6c50c4a48ba0c001f672de9f2e5b9ef346 [file] [log] [blame]
==================================
Migration speed accounting in Hbal
==================================
.. contents:: :depth: 2
Hbal usually performs complex sequence of moves during cluster balancing in
order to achieve local optimal cluster state. Unfortunately, each move may take
significant amount of time. Thus, during the sequence of moves the situation on
cluster may change (e.g., because of adding new instance or because of instance
or node parameters change) and desired moves can become unprofitable.
Usually disk moves become a bottleneck and require sufficient amount of time.
:ref:`Instance move improvements <move-performance>` considers
disk moves speed in more details. Currently, ``hbal`` has a ``--no-disk-moves``
option preventing disk moves during cluster balancing in order to perform fast
(but of course non optimal) balancing. It may be useful, but ideally we need to
find a balance between optimal configuration and time to reach this
configuration.
Avoiding insignificant disk moves
=================================
Allowing only profitable enough disk moves may become a first step to reach
a compromise between moves speed and optimal scoring. This can be implemented
by introducing ``--avoid-disk-moves *FACTOR*`` option which will admit disk
moves only if the gain in the cluster metrics is *FACTOR* times
higher than the gain achievable by non disk moves.
Avoiding insignificant long-time solutions
==========================================
The next step is to estimate an amount of time required to perform a balancing
step and to introduce a new term: ``long-time`` solution.
``--long-solution-threshold`` option will specify a duration in seconds.
A solution exceeding the duration is a ``long-time`` solution by definition.
With time estimations we will be able to filter Hbal's sequences and
eliminating long-time solutions which don't lead to a sufficient cluster metric
improvement. This can be done by the ``--avoid-long-solutions *FACTOR*`` option
which will allow only long solutions, whose K/N metrics are more than *FACTOR*,
where K is the number of times cluster metric has increased and N is an
estimated time to perform this solution divided by the threshold.
The default values for the new options are:
``--long-solution-threshold`` = 1000 seconds
(all adequate solutions are not ``long-time``)
``--avoid-long-solutions`` = 0.0
(no filtering by time estimations, feature disabled)
Network bandwidth cluster tags
==============================
The bandwidth between nodes, node groups and within whole cluster could be
specified "by hand" with the cluster tags.
Every node contains its own set of bandwidth tags. Tags from higher level are
inherited by lower levels: nodegroups inherit cluster tags and nodes inherit
nodegroups tags. Tags from lower level (if exist) override higher level tags
with the same bandwidth prefixes: node tags override nodegroup tags as well as
nodegroup tags override cluster tags.
Below are some examples of using bandwidth tags:
(The examples are provided using in the Htools Text backend format)
1) single nodegroup with 5 nodes, all bandwidths symmetric.
group-01|...|nic:100MBit|
*no* node-level bandwidth tags (they are inherited by group tags)
htools:bandwidth:nic
htools:bandwidth:nic:100MBit::nic:100MBit::100
2) 3 nodegroups, within each nodegroup symmetric bandwidths,
between nodegroups different bandwidths.
group-01|...|nic:1000MBit (overrides cluster tags)|
group-02|...|nic:100MBit (overrides cluster tags)|
group-03|...|inherited by cluster tags (nic:10MBit)|
*no* node-level bandwidth tags (they are inherited by group tags)
htools:bandwidth:nic
nic:10MBit
htools:bandwidth:nic:10MBit::nic:10MBit::10
htools:bandwidth:nic:100MBit::nic:100MBit::100
htools:bandwidth:nic:1000MBit::nic:1000MBit::1000
htools:bandwidth:nic:10MBit::nic:100MBit::10
htools:bandwidth:nic:10MBit::nic:1000MBit::10
htools:bandwidth:nic:100MBit::nic:1000MBit::100
Network bandwidth estimation
============================
Balancing time can be estimated by dividing the amount of data to be moved by
the current network bandwidth between the affected nodes.
We propose to add a new data collector, that will gather information about
network speed by sending a test file between between nodes. Accounting the
time, we can estimate average network speed between nodegroups in a cluster.
DataCollector implementation details
====================================
The new bandwidth data collector introduces an ability to collect actual
information about network speed between nodegroups in a cluster. We assume,
that the network bandwidth between any nodes within one nodegroup is almost the
same unlike the network speed between different nodegroups. So the proposed
data collector will provide the most necessary data for time estimations.
For sending packets *scp* utility will be used. The default size of file
to send is 5Mbyte in order to obtain adequate values of the network speed.
During *dcUpdate* every data collector sends a file with known size to a node
from another nodegroup (chosen randomly) and measures time to perform it. MonD
receives the response of *dcReport* from collectors, it fills nodegroup
bandwidth map in nodes. The information collected will be used to estimate the
time of balancing steps. In the case of existing network speed information for
some node from bandwidth tags as well as from bandwidth data collector the last
one will be chosen.