Zetaris Cluster Management

Understand how your Zetaris Cluster is designed, and describes its fault tolerant nature

When deploying a clustered Zetaris deployment, whether it be on-premises or in the cloud, your Zetaris cluster will have differing fault tolerance behaviours based on the node or executor that has been compromised.

simple_controller_worker_executor

Zetaris cluster

In your base Zetaris cluster you will have:

Controller:

  • Only one active/primary controller per cluster
  • Coordinates, at query run-time, the allocation of jobs across executors

Worker:

  • Indefinite amount
  • Houses the compute resources that executors are able to utilise

Executor:

  • Can deploy multiple executors per worker. Usually depends on the number of cores/threads accessible in the worker node.
  • Processes jobs given by the controller
  • By default, 1 executor will run 1 thread, though this can be modified
  • Resilient Distributed Datasets (RDD)

Fault Tolerance

In the event of a failure at the worker node:

  • In-memory data will be lost, so any executors in computation, will fail.
  • However, because lineage of partitions is logged, we recreate those lost datasets in the failed worker node, across to an active node.
  • The diagram below shows a classic query execution plan, whereby each partition is traced back to its origin, without have to write each partition to disk.
    zetaris_rdd_lineage

In the event of a failure at the controller node:

  • Cluster wide failure, as the controller is the central point of all query executions. To prevent this, we would advise configuring for High Availability.