doc/design-chained-jobs.rst - ganeti - Git at Google

 ============
 Chained jobs
 ============

 .. contents:: :depth: 4

 This is a design document about the innards of Ganeti's job processing.
 Readers are advised to study previous design documents on the topic:

 - :ref:`Original job queue <jqueue-original-design>`
 - :ref:`Job priorities <jqueue-job-priority-design>`
 - :doc:`LU-generated jobs <design-lu-generated-jobs>`


 Current state and shortcomings
 ==============================

 Ever since the introduction of the job queue with Ganeti 2.0 there have
 been situations where we wanted to run several jobs in a specific order.
 Due to the job queue's current design, such a guarantee can not be
 given. Jobs are run according to their priority, their ability to
 acquire all necessary locks and other factors.

 One way to work around this limitation is to do some kind of job
 grouping in the client code. Once all jobs of a group have finished, the
 next group is submitted and waited for. There are different kinds of
 clients for Ganeti, some of which don't share code (e.g. Python clients
 vs. htools). This design proposes a solution which would be implemented
 as part of the job queue in the master daemon.


 Proposed changes
 ================

 With the implementation of :ref:`job priorities
 <jqueue-job-priority-design>` the processing code was re-architectured
 and became a lot more versatile. It now returns jobs to the queue in
 case the locks for an opcode can't be acquired, allowing other
 jobs/opcodes to be run in the meantime.

 The proposal is to add a new, optional property to opcodes to define
 dependencies on other jobs. Job X could define opcodes with a dependency
 on the success of job Y and would only be run once job Y is finished. If
 there's a dependency on success and job Y failed, job X would fail as
 well. Since such dependencies would use job IDs, the jobs still need to
 be submitted in the right order.

 .. pyassert::

    # Update description below if finalized job status change
    constants.JOBS_FINALIZED == frozenset([
      constants.JOB_STATUS_CANCELED,
      constants.JOB_STATUS_SUCCESS,
      constants.JOB_STATUS_ERROR,
      ])

 The new attribute's value would be a list of two-valued tuples. Each
 tuple contains a job ID and a list of requested status for the job
 depended upon. Only final status are accepted
 (:pyeval:`utils.CommaJoin(constants.JOBS_FINALIZED)`). An empty list is
 equivalent to specifying all final status (except
 :pyeval:`constants.JOB_STATUS_CANCELED`, which is treated specially).
 An opcode runs only once all its dependency requirements have been
 fulfilled.

 Any job referring to a cancelled job is also cancelled unless it
 explicitly lists :pyeval:`constants.JOB_STATUS_CANCELED` as a requested
 status.

 In case a referenced job can not be found in the normal queue or the
 archive, referring jobs fail as the status of the referenced job can't
 be determined.

 With this change, clients can submit all wanted jobs in the right order
 and proceed to wait for changes on all these jobs (see
 ``cli.JobExecutor``). The master daemon will take care of executing them
 in the right order, while still presenting the client with a simple
 interface.

 Clients using the ``SubmitManyJobs`` interface can use relative job IDs
 (negative integers) to refer to jobs in the same submission.

 .. highlight:: javascript

 Example data structures::

   # First job
   {
     "job_id": "6151",
     "ops": [
       { "OP_ID": "OP_INSTANCE_REPLACE_DISKS", ..., },
       { "OP_ID": "OP_INSTANCE_FAILOVER", ..., },
       ],
   }

   # Second job, runs in parallel with first job
   {
     "job_id": "7687",
     "ops": [
       { "OP_ID": "OP_INSTANCE_MIGRATE", ..., },
       ],
   }

   # Third job, depending on success of previous jobs
   {
     "job_id": "9218",
     "ops": [
       { "OP_ID": "OP_NODE_SET_PARAMS",
         "depend": [
           [6151, ["success"]],
           [7687, ["success"]],
           ],
         "offline": True, },
       ],
   }


 Implementation details
 ----------------------

 Status while waiting for dependencies
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Jobs waiting for dependencies are certainly not in the queue anymore and
 therefore need to change their status from "queued". While waiting for
 opcode locks the job is in the "waiting" status (the constant is named
 ``JOB_STATUS_WAITLOCK``, but the actual value is ``waiting``). There the
 following possibilities:

 #. Introduce a new status, e.g. "waitdeps".

    Pro:

    - Clients know for sure a job is waiting for dependencies, not locks

    Con:

    - Code and tests would have to be updated/extended for the new status
    - List of possible state transitions certainly wouldn't get simpler
    - Breaks backwards compatibility, older clients might get confused

 #. Use existing "waiting" status.

    Pro:

    - No client changes necessary, less code churn (note that there are
      clients which don't live in Ganeti core)
    - Clients don't need to know the difference between waiting for a job
      and waiting for a lock; it doesn't make a difference
    - Fewer state transitions (see commit ``5fd6b69479c0``, which removed
      many state transitions and disk writes)

    Con:

    - Not immediately visible what a job is waiting for, but it's the
      same issue with locks; this is the reason why the lock monitor
      (``gnt-debug locks``) was introduced; job dependencies can be shown
      as "locks" in the monitor

 Based on these arguments, the proposal is to do the following:

 - Rename ``JOB_STATUS_WAITLOCK`` constant to ``JOB_STATUS_WAITING`` to
   reflect its actual meanting: the job is waiting for something
 - While waiting for dependencies and locks, jobs are in the "waiting"
   status
 - Export dependency information in lock monitor; example output::

     Name      Mode Owner Pending
     job/27491 -    -     success:job/34709,job/21459
     job/21459 -    -     success,error:job/14513


 Cost of deserialization
 ~~~~~~~~~~~~~~~~~~~~~~~

 To determine the status of a dependency job the job queue must have
 access to its data structure. Other queue operations already do this,
 e.g. archiving, watching a job's progress and querying jobs.

 Initially (Ganeti 2.0/2.1) the job queue shared the job objects
 in memory and protected them using locks. Ganeti 2.2 (see :doc:`design
 document <design-2.2>`) changed the queue to read and deserialize jobs
 from disk. This significantly reduced locking and code complexity.
 Nowadays inotify is used to wait for changes on job files when watching
 a job's progress.

 Reading from disk and deserializing certainly has some cost associated
 with it, but it's a significantly simpler architecture than
 synchronizing in memory with locks. At the stage where dependencies are
 evaluated the queue lock is held in shared mode, so different workers
 can read at the same time (deliberately ignoring CPython's interpreter
 lock).

 It is expected that the majority of executed jobs won't use
 dependencies and therefore won't be affected.


 Other discussed solutions
 =========================

 Job-level attribute
 -------------------

 At a first look it might seem to be better to put dependencies on
 previous jobs at a job level. However, it turns out that having the
 option of defining only a single opcode in a job as having such a
 dependency can be useful as well. The code complexity in the job queue
 is equivalent if not simpler.

 Since opcodes are guaranteed to run in order, clients can just define
 the dependency on the first opcode.

 Another reason for the choice of an opcode-level attribute is that the
 current LUXI interface for submitting jobs is a bit restricted and would
 need to be changed to allow the addition of job-level attributes,
 potentially requiring changes in all LUXI clients and/or breaking
 backwards compatibility.


 Client-side logic
 -----------------

 There's at least one implementation of a batched job executor twisted
 into the ``burnin`` tool's code. While certainly possible, a client-side
 solution should be avoided due to the different clients already in use.
 For one, the :doc:`remote API <rapi>` client shouldn't import
 non-standard modules. htools are written in Haskell and can't use Python
 modules. A batched job executor contains quite some logic. Even if
 cleanly abstracted in a (Python) library, sharing code between different
 clients is difficult if not impossible.


 .. vim: set textwidth=72 :
 .. Local Variables:
 .. mode: rst
 .. fill-column: 72
 .. End:
	============
	Chained jobs
	============

	.. contents:: :depth: 4

	This is a design document about the innards of Ganeti's job processing.
	Readers are advised to study previous design documents on the topic:

	- :ref:`Original job queue <jqueue-original-design>`
	- :ref:`Job priorities <jqueue-job-priority-design>`
	- :doc:`LU-generated jobs <design-lu-generated-jobs>`


	Current state and shortcomings
	==============================

	Ever since the introduction of the job queue with Ganeti 2.0 there have
	been situations where we wanted to run several jobs in a specific order.
	Due to the job queue's current design, such a guarantee can not be
	given. Jobs are run according to their priority, their ability to
	acquire all necessary locks and other factors.

	One way to work around this limitation is to do some kind of job
	grouping in the client code. Once all jobs of a group have finished, the
	next group is submitted and waited for. There are different kinds of
	clients for Ganeti, some of which don't share code (e.g. Python clients
	vs. htools). This design proposes a solution which would be implemented
	as part of the job queue in the master daemon.


	Proposed changes
	================

	With the implementation of :ref:`job priorities
	<jqueue-job-priority-design>` the processing code was re-architectured
	and became a lot more versatile. It now returns jobs to the queue in
	case the locks for an opcode can't be acquired, allowing other
	jobs/opcodes to be run in the meantime.

	The proposal is to add a new, optional property to opcodes to define
	dependencies on other jobs. Job X could define opcodes with a dependency
	on the success of job Y and would only be run once job Y is finished. If
	there's a dependency on success and job Y failed, job X would fail as
	well. Since such dependencies would use job IDs, the jobs still need to
	be submitted in the right order.

	.. pyassert::

	# Update description below if finalized job status change
	constants.JOBS_FINALIZED == frozenset([
	constants.JOB_STATUS_CANCELED,
	constants.JOB_STATUS_SUCCESS,
	constants.JOB_STATUS_ERROR,
	])

	The new attribute's value would be a list of two-valued tuples. Each
	tuple contains a job ID and a list of requested status for the job
	depended upon. Only final status are accepted
	(:pyeval:`utils.CommaJoin(constants.JOBS_FINALIZED)`). An empty list is
	equivalent to specifying all final status (except
	:pyeval:`constants.JOB_STATUS_CANCELED`, which is treated specially).
	An opcode runs only once all its dependency requirements have been
	fulfilled.

	Any job referring to a cancelled job is also cancelled unless it
	explicitly lists :pyeval:`constants.JOB_STATUS_CANCELED` as a requested
	status.

	In case a referenced job can not be found in the normal queue or the
	archive, referring jobs fail as the status of the referenced job can't
	be determined.

	With this change, clients can submit all wanted jobs in the right order
	and proceed to wait for changes on all these jobs (see
	``cli.JobExecutor``). The master daemon will take care of executing them
	in the right order, while still presenting the client with a simple
	interface.

	Clients using the ``SubmitManyJobs`` interface can use relative job IDs
	(negative integers) to refer to jobs in the same submission.

	.. highlight:: javascript

	Example data structures::

	# First job
	{
	"job_id": "6151",
	"ops": [
	{ "OP_ID": "OP_INSTANCE_REPLACE_DISKS", ..., },
	{ "OP_ID": "OP_INSTANCE_FAILOVER", ..., },
	],
	}

	# Second job, runs in parallel with first job
	{
	"job_id": "7687",
	"ops": [
	{ "OP_ID": "OP_INSTANCE_MIGRATE", ..., },
	],
	}

	# Third job, depending on success of previous jobs
	{
	"job_id": "9218",
	"ops": [
	{ "OP_ID": "OP_NODE_SET_PARAMS",
	"depend": [
	[6151, ["success"]],
	[7687, ["success"]],
	],
	"offline": True, },
	],
	}


	Implementation details
	----------------------

	Status while waiting for dependencies
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Jobs waiting for dependencies are certainly not in the queue anymore and
	therefore need to change their status from "queued". While waiting for
	opcode locks the job is in the "waiting" status (the constant is named
	``JOB_STATUS_WAITLOCK``, but the actual value is ``waiting``). There the
	following possibilities:

	#. Introduce a new status, e.g. "waitdeps".

	Pro:

	- Clients know for sure a job is waiting for dependencies, not locks

	Con:

	- Code and tests would have to be updated/extended for the new status
	- List of possible state transitions certainly wouldn't get simpler
	- Breaks backwards compatibility, older clients might get confused

	#. Use existing "waiting" status.

	Pro:

	- No client changes necessary, less code churn (note that there are
	clients which don't live in Ganeti core)
	- Clients don't need to know the difference between waiting for a job
	and waiting for a lock; it doesn't make a difference
	- Fewer state transitions (see commit ``5fd6b69479c0``, which removed
	many state transitions and disk writes)

	Con:

	- Not immediately visible what a job is waiting for, but it's the
	same issue with locks; this is the reason why the lock monitor
	(``gnt-debug locks``) was introduced; job dependencies can be shown
	as "locks" in the monitor

	Based on these arguments, the proposal is to do the following:

	- Rename ``JOB_STATUS_WAITLOCK`` constant to ``JOB_STATUS_WAITING`` to
	reflect its actual meanting: the job is waiting for something
	- While waiting for dependencies and locks, jobs are in the "waiting"
	status
	- Export dependency information in lock monitor; example output::

	Name Mode Owner Pending
	job/27491 - - success:job/34709,job/21459
	job/21459 - - success,error:job/14513


	Cost of deserialization
	~~~~~~~~~~~~~~~~~~~~~~~

	To determine the status of a dependency job the job queue must have
	access to its data structure. Other queue operations already do this,
	e.g. archiving, watching a job's progress and querying jobs.

	Initially (Ganeti 2.0/2.1) the job queue shared the job objects
	in memory and protected them using locks. Ganeti 2.2 (see :doc:`design
	document <design-2.2>`) changed the queue to read and deserialize jobs
	from disk. This significantly reduced locking and code complexity.
	Nowadays inotify is used to wait for changes on job files when watching
	a job's progress.

	Reading from disk and deserializing certainly has some cost associated
	with it, but it's a significantly simpler architecture than
	synchronizing in memory with locks. At the stage where dependencies are
	evaluated the queue lock is held in shared mode, so different workers
	can read at the same time (deliberately ignoring CPython's interpreter
	lock).

	It is expected that the majority of executed jobs won't use
	dependencies and therefore won't be affected.


	Other discussed solutions
	=========================

	Job-level attribute
	-------------------

	At a first look it might seem to be better to put dependencies on
	previous jobs at a job level. However, it turns out that having the
	option of defining only a single opcode in a job as having such a
	dependency can be useful as well. The code complexity in the job queue
	is equivalent if not simpler.

	Since opcodes are guaranteed to run in order, clients can just define
	the dependency on the first opcode.

	Another reason for the choice of an opcode-level attribute is that the
	current LUXI interface for submitting jobs is a bit restricted and would
	need to be changed to allow the addition of job-level attributes,
	potentially requiring changes in all LUXI clients and/or breaking
	backwards compatibility.


	Client-side logic
	-----------------

	There's at least one implementation of a batched job executor twisted
	into the ``burnin`` tool's code. While certainly possible, a client-side
	solution should be avoided due to the different clients already in use.
	For one, the :doc:`remote API <rapi>` client shouldn't import
	non-standard modules. htools are written in Haskell and can't use Python
	modules. A batched job executor contains quite some logic. Even if
	cleanly abstracted in a (Python) library, sharing code between different
	clients is difficult if not impossible.


	.. vim: set textwidth=72 :
	.. Local Variables:
	.. mode: rst
	.. fill-column: 72
	.. End: