doc/design-performance-tests.rst - ganeti - Git at Google

 ========================
 Performance tests for QA
 ========================

 .. contents:: :depth: 4

 This design document describes performance tests to be added to QA in
 order to measure performance changes over time.

 Current state and shortcomings
 ==============================

 Currently, only functional QA tests are performed. Those tests verify
 the correct behaviour of Ganeti in various configurations, but are not
 designed to continuously monitor the performance of Ganeti.

 The current QA tests don't execute multiple tasks/jobs in parallel.
 Therefore, the locking part of Ganeti does not really receive any
 testing, neither functional nor performance wise.

 On the plus side, Ganeti's QA code does already measure the runtime of
 individual tests, which is leveraged in this design.

 Proposed changes
 ================

 The tests to be added in the context of this design document focus on
 two areas:

   * Job queue performance. How does Ganeti handle a lot of submitted
     jobs?
   * Parallel job execution performance. How well does Ganeti
     parallelize jobs?

 Jobs are submitted to the job queue in sequential order, but the
 execution of the jobs runs in parallel. All job submissions must
 complete within a reasonable timeout.

 In order to make it easier to recognize performance related tests, all
 tests added in the context of this design get a description with a
 "PERFORMANCE: " prefix.

 Job queue performance
 ---------------------

 Tests targeting the job queue should eliminate external factors (like
 network/disk performance or hypervisor delays) as much as possible, so
 they are designed to run in a vcluster QA environment.

 The following tests are added to the QA:

   * Submit the maximum amount of instance create jobs in parallel. As
     soon as a creation job succeeds, submit a removal job for this
     instance.
   * Submit as many instance create jobs as there are nodes in the
     cluster in parallel (for non-redundant instances). Removal jobs
     as above.
   * For the maximum amount of instances in the cluster, submit modify
     jobs (modify hypervisor and backend parameters) in parallel.
   * For the maximum amount of instances in the cluster, submit stop,
     start, reboot and reinstall jobs in parallel.
   * For the maximum amount of instances in the cluster, submit multiple
     list and info jobs in parallel.
   * For the maximum amount of instances in the cluster, submit move
     jobs in parallel. While the move operations are running, get
     instance information using info jobs. Those jobs are required to
     return within a reasonable low timeout.
   * For the maximum amount of instances in the cluster, submit add-,
     remove- and list-tags jobs.
   * Submit 200 `gnt-debug delay` jobs with a delay of 0.1 seconds. To
     speed up submission, perform multiple job submissions in parallel.
     Verify that submitting jobs doesn't significantly slow down during
     the process. Verify that querying cluster information over CLI and
     RAPI succeeds in a timely fashion with the delay jobs
     running/queued.

 Parallel job execution performance
 ----------------------------------

 Tests targeting the performance of parallel execution of "real" jobs
 in close-to-production clusters should actually perform all operations,
 such as creating disks and starting instances. This way, real world
 locking or waiting issues can be reproduced. Performing all those
 operations does requires quite some time though, so only a smaller
 number of instances and parallel jobs can be tested realistically.

 The following tests are added to the QA:

   * Submitting twice as many instance creation request as there are
     nodes in the cluster, using DRBD as disk template.
     The job parameters are chosen according to best practice for
     parallel instance creation without running the risk of instance
     creation failing for too many parallel creation attempts.
     As soon as a creation job succeeds, submit a removal job for
     this instance.
   * Submitting twice as many instance creation request as there are
     nodes in the cluster, using Plain as disk template. As soon as a
     creation job succeeds, submit a removal job for this instance.
     This test can make better use of parallelism because only one
     node must be locked for an instance creation.
   * Create an instance using DRBD. Fail it over, migrate it, change
     its secondary node, reboot it and reinstall it while creating an
     additional instance in parallel to each of those operations.

 Future work
 ===========

 Based on test results of the tests listed above, additional tests can
 be added to cover more real-world use-cases. Also, based on user
 requests, specially crafted performance tests modeling those workloads
 can be added too.

 Additionally, the correlations between job submission time and job
 queue size could be detected. Therefore, a snapshot of the job queue
 before job submission could be taken to measure job submission time
 based on the jobs in the queue.

 .. vim: set textwidth=72 :
 .. Local Variables:
 .. mode: rst
 .. fill-column: 72
 .. End:
	========================
	Performance tests for QA
	========================

	.. contents:: :depth: 4

	This design document describes performance tests to be added to QA in
	order to measure performance changes over time.

	Current state and shortcomings
	==============================

	Currently, only functional QA tests are performed. Those tests verify
	the correct behaviour of Ganeti in various configurations, but are not
	designed to continuously monitor the performance of Ganeti.

	The current QA tests don't execute multiple tasks/jobs in parallel.
	Therefore, the locking part of Ganeti does not really receive any
	testing, neither functional nor performance wise.

	On the plus side, Ganeti's QA code does already measure the runtime of
	individual tests, which is leveraged in this design.

	Proposed changes
	================

	The tests to be added in the context of this design document focus on
	two areas:

	* Job queue performance. How does Ganeti handle a lot of submitted
	jobs?
	* Parallel job execution performance. How well does Ganeti
	parallelize jobs?

	Jobs are submitted to the job queue in sequential order, but the
	execution of the jobs runs in parallel. All job submissions must
	complete within a reasonable timeout.

	In order to make it easier to recognize performance related tests, all
	tests added in the context of this design get a description with a
	"PERFORMANCE: " prefix.

	Job queue performance
	---------------------

	Tests targeting the job queue should eliminate external factors (like
	network/disk performance or hypervisor delays) as much as possible, so
	they are designed to run in a vcluster QA environment.

	The following tests are added to the QA:

	* Submit the maximum amount of instance create jobs in parallel. As
	soon as a creation job succeeds, submit a removal job for this
	instance.
	* Submit as many instance create jobs as there are nodes in the
	cluster in parallel (for non-redundant instances). Removal jobs
	as above.
	* For the maximum amount of instances in the cluster, submit modify
	jobs (modify hypervisor and backend parameters) in parallel.
	* For the maximum amount of instances in the cluster, submit stop,
	start, reboot and reinstall jobs in parallel.
	* For the maximum amount of instances in the cluster, submit multiple
	list and info jobs in parallel.
	* For the maximum amount of instances in the cluster, submit move
	jobs in parallel. While the move operations are running, get
	instance information using info jobs. Those jobs are required to
	return within a reasonable low timeout.
	* For the maximum amount of instances in the cluster, submit add-,
	remove- and list-tags jobs.
	* Submit 200 `gnt-debug delay` jobs with a delay of 0.1 seconds. To
	speed up submission, perform multiple job submissions in parallel.
	Verify that submitting jobs doesn't significantly slow down during
	the process. Verify that querying cluster information over CLI and
	RAPI succeeds in a timely fashion with the delay jobs
	running/queued.

	Parallel job execution performance
	----------------------------------

	Tests targeting the performance of parallel execution of "real" jobs
	in close-to-production clusters should actually perform all operations,
	such as creating disks and starting instances. This way, real world
	locking or waiting issues can be reproduced. Performing all those
	operations does requires quite some time though, so only a smaller
	number of instances and parallel jobs can be tested realistically.

	The following tests are added to the QA:

	* Submitting twice as many instance creation request as there are
	nodes in the cluster, using DRBD as disk template.
	The job parameters are chosen according to best practice for
	parallel instance creation without running the risk of instance
	creation failing for too many parallel creation attempts.
	As soon as a creation job succeeds, submit a removal job for
	this instance.
	* Submitting twice as many instance creation request as there are
	nodes in the cluster, using Plain as disk template. As soon as a
	creation job succeeds, submit a removal job for this instance.
	This test can make better use of parallelism because only one
	node must be locked for an instance creation.
	* Create an instance using DRBD. Fail it over, migrate it, change
	its secondary node, reboot it and reinstall it while creating an
	additional instance in parallel to each of those operations.

	Future work
	===========

	Based on test results of the tests listed above, additional tests can
	be added to cover more real-world use-cases. Also, based on user
	requests, specially crafted performance tests modeling those workloads
	can be added too.

	Additionally, the correlations between job submission time and job
	queue size could be detected. Therefore, a snapshot of the job queue
	before job submission could be taken to measure job submission time
	based on the jobs in the queue.

	.. vim: set textwidth=72 :
	.. Local Variables:
	.. mode: rst
	.. fill-column: 72
	.. End: