Retry Policy
The Retry Policy feature in Exosphere provides sophisticated retry mechanisms for handling transient failures in your workflow nodes. When a node execution fails, the retry policy automatically determines when and how to retry the execution based on configurable strategies.
Overview
Retry policies are configured at the graph level and apply to all nodes within that graph. When a node fails with an error, the state manager automatically creates a retry state with a calculated delay before the next execution attempt.
Configuration
Retry policies are defined in your graph template configuration:
{
"secrets": {
"api_key": "your-api-key"
},
"nodes": [
{
"node_name": "MyNode",
"namespace": "MyProject",
"identifier": "my_node",
"inputs": {
"data": "initial"
},
"next_nodes": []
}
],
"retry_policy": {
"max_retries": 3,
"strategy": "EXPONENTIAL",
"backoff_factor": 2000,
"exponent": 2,
"max_delay": 3600000
}
}
Parameters
max_retries
- Type:
int
- Default:
3
- Description: The maximum number of retry attempts before giving up
- Constraints: Must be >= 0
strategy
- Type:
string
- Default:
"EXPONENTIAL"
- Description: The retry strategy to use for calculating delays
- Options: See Retry Strategies below
backoff_factor
- Type:
int
(milliseconds) - Default:
2000
(2 seconds) - Description: The base delay factor in milliseconds
- Constraints: Must be > 0
exponent
- Type:
int
- Default:
2
- Description: The exponent used for exponential strategies
- Constraints: Must be > 0
max_delay
- Type:
int | null
(milliseconds) - Default:
null
(no maximum delay) - Description: The maximum delay in milliseconds that any retry attempt can have. When set, all calculated delays are capped at this value using the
_cap
function - Constraints: Must be > 0 when not null
- Example:
3600000
(1 hour) would cap all delays to a maximum of 1 hour
Retry Strategies
Exosphere supports three main categories of retry strategies, each with jitter variants to prevent thundering herd problems.
Exponential Strategies
Exponential strategies increase the delay exponentially with each retry attempt.
EXPONENTIAL
Standard exponential backoff without jitter.
Formula: backoff_factor * (exponent ^ (retry_count - 1))
Example:
- Retry 1: 2000ms (2 seconds)
- Retry 2: 4000ms (4 seconds)
- Retry 3: 8000ms (8 seconds)
EXPONENTIAL_FULL_JITTER
Exponential backoff with full jitter (random delay between 0 and calculated delay).
Formula: random(0, backoff_factor * (exponent ^ (retry_count - 1)))
Note: random(a, b)
denotes a uniform random draw over the inclusive range [a, b].
Example:
- Retry 1: 0-2000ms (random)
- Retry 2: 0-4000ms (random)
- Retry 3: 0-8000ms (random)
EXPONENTIAL_EQUAL_JITTER
Exponential backoff with equal jitter (random delay around half the calculated delay).
Formula: (backoff_factor * (exponent ^ (retry_count - 1))) / 2 + random(0, (backoff_factor * (exponent ^ (retry_count - 1))) / 2)
Note: random(a, b)
denotes a uniform random draw over the inclusive range [a, b].
Example:
- Retry 1: 1000-2000ms (random)
- Retry 2: 2000-4000ms (random)
- Retry 3: 4000-8000ms (random)
Linear Strategies
Linear strategies increase the delay linearly with each retry attempt.
LINEAR
Standard linear backoff without jitter.
Formula: backoff_factor * retry_count
Example:
- Retry 1: 2000ms (2 seconds)
- Retry 2: 4000ms (4 seconds)
- Retry 3: 6000ms (6 seconds)
LINEAR_FULL_JITTER
Linear backoff with full jitter.
Formula: random(0, backoff_factor * retry_count)
Note: random(a, b)
denotes a uniform random draw over the inclusive range [a, b].
Example:
- Retry 1: 0-2000ms (random)
- Retry 2: 0-4000ms (random)
- Retry 3: 0-6000ms (random)
LINEAR_EQUAL_JITTER
Linear backoff with equal jitter.
Formula: (backoff_factor * retry_count) / 2 + random(0, (backoff_factor * retry_count) / 2)
Note: random(a, b)
denotes a uniform random draw over the inclusive range [a, b].
Example:
- Retry 1: 1000-2000ms (random)
- Retry 2: 2000-4000ms (random)
- Retry 3: 3000-6000ms (random)
Fixed Strategies
Fixed strategies use a constant delay for all retry attempts.
FIXED
Standard fixed delay without jitter.
Formula: backoff_factor
Example:
- Retry 1: 2000ms (2 seconds)
- Retry 2: 2000ms (2 seconds)
- Retry 3: 2000ms (2 seconds)
FIXED_FULL_JITTER
Fixed delay with full jitter.
Formula: random(0, backoff_factor)
Note: random(a, b)
denotes a uniform random draw over the inclusive range [a, b].
Example:
- Retry 1: 0-2000ms (random)
- Retry 2: 0-2000ms (random)
- Retry 3: 0-2000ms (random)
FIXED_EQUAL_JITTER
Fixed delay with equal jitter.
Formula: backoff_factor / 2 + random(0, backoff_factor / 2)
Note: random(a, b)
denotes a uniform random draw over the inclusive range [a, b].
Example:
- Retry 1: 1000-2000ms (random)
- Retry 2: 1000-2000ms (random)
- Retry 3: 1000-2000ms (random)
Delay Capping
The retry policy includes a built-in delay capping mechanism through the _cap
function and max_delay
parameter. This ensures that retry delays never exceed a specified maximum value, even with aggressive exponential backoff strategies.
How Delay Capping Works
The _cap
function is applied to all calculated delays:
def _cap(value: int) -> int:
if self.max_delay is not None:
return min(value, self.max_delay)
return value
Behavior:
- If
max_delay
is set, all calculated delays are capped at this value - If
max_delay
isnull
(default), no capping is applied - The capping is applied after all strategy calculations.
Example with Delay Capping
Consider an exponential strategy with backoff_factor: 2000
, exponent: 2
, and max_delay: 10000
:
With capping:
- Retry 1: 2000ms
- Retry 2: 4000ms
- Retry 3: 8000ms
- Retry 4: 10000ms (capped at max_delay)
When to Use Delay Capping
- Long-running workflows: Prevent excessive delays that could impact overall workflow completion time
- User-facing applications: Ensure retries don't create unacceptable wait times
- Resource management: Control resource consumption by limiting retry delays
- Predictable behavior: Create more predictable retry patterns for monitoring and alerting
Usage Examples
Basic Exponential Retry
{
"retry_policy": {
"max_retries": 3,
"strategy": "EXPONENTIAL",
"backoff_factor": 1000,
"exponent": 2
}
}
Aggressive Retry with Jitter
{
"retry_policy": {
"max_retries": 5,
"strategy": "EXPONENTIAL_FULL_JITTER",
"backoff_factor": 500,
"exponent": 3
}
}
Conservative Linear Retry
Fixed Retry for Rate Limiting
Exponential Retry with Delay Capping
{
"retry_policy": {
"max_retries": 5,
"strategy": "EXPONENTIAL",
"backoff_factor": 2000,
"exponent": 2,
"max_delay": 30000
}
}
Conservative Retry with Maximum Delay
{
"retry_policy": {
"max_retries": 3,
"strategy": "EXPONENTIAL_FULL_JITTER",
"backoff_factor": 1000,
"exponent": 3,
"max_delay": 60000
}
}
When Retries Are Triggered
Retries are automatically triggered when:
- A node execution fails with an error
- The current retry count is less than
max_retries
- The state status is
QUEUED
The retry mechanism:
- Creates a new state with
retry_count
incremented by 1 - Sets
enqueue_after
to the current time plus the calculated delay - Sets the original state status to
ERRORED
with the error message
Best Practices
Choose the Right Strategy
- EXPONENTIAL: Best for most transient failures (network issues, temporary service unavailability)
- LINEAR: Good for predictable, consistent delays
- FIXED: Useful for rate limiting scenarios
Use Jitter for High Concurrency
- FULL_JITTER: Best for high concurrency to prevent thundering herd
- EQUAL_JITTER: Good balance between predictability and randomization
- No Jitter: Use only when you need deterministic behavior
Set Appropriate Limits
- max_retries: Consider the nature of your failures and downstream dependencies
- backoff_factor: Balance between responsiveness and resource usage
- exponent: Higher values create more aggressive backoff
- max_delay: Set a reasonable maximum delay to prevent excessive wait times, especially for exponential strategies
Monitor Retry Patterns
- Track retry counts in your monitoring system
- Set up alerts for graphs with high retry rates
- Analyze retry patterns to identify systemic issues
Limitations
- Retry policies apply to all nodes in a graph uniformly
- Individual node-level retry policies are not supported
- Retry delays are calculated in milliseconds
- Maximum delay can be capped using the
max_delay
parameter (recommended for long-running workflows)
Error Handling
If a retry policy configuration is invalid:
- The graph template validation will fail
- An error will be returned during graph creation
- The graph will not be saved until the configuration is corrected
Model-Based Configuration
With the new Exosphere Python SDK, you can define retry policies using Pydantic models for better type safety and validation:
from exospherehost import StateManager, GraphNodeModel, RetryPolicyModel, RetryStrategyEnum
# Define retry policy using model
retry_policy = RetryPolicyModel(
max_retries=5,
strategy=RetryStrategyEnum.EXPONENTIAL_FULL_JITTER,
backoff_factor=1000,
exponent=2,
max_delay=30000
)
async def create_graph_with_retry_policy():
state_manager = StateManager(namespace="MyProject")
graph_nodes = [
GraphNodeModel(
node_name="ResilientNode",
namespace="MyProject",
identifier="resilient_node",
inputs={"data": "initial"},
next_nodes=[]
)
]
# Apply retry policy to the entire graph
result = await state_manager.upsert_graph(
graph_name="resilient-workflow",
graph_nodes=graph_nodes,
secrets={"api_key": "your-key"},
retry_policy=retry_policy
)
Benefits of Model-Based Approach:
- Type Safety: Pydantic validation catches configuration errors early
- IDE Support: Better autocomplete and error detection
- Documentation: Built-in field descriptions and validation rules
- Consistency: Standardized parameter names and types
Available Retry Strategies:
RetryStrategyEnum.EXPONENTIAL
: Pure exponential backoffRetryStrategyEnum.EXPONENTIAL_FULL_JITTER
: Exponential with full randomization-
RetryStrategyEnum.EXPONENTIAL_EQUAL_JITTER
: Exponential with 50% randomization -
RetryStrategyEnum.LINEAR
: Linear backoff RetryStrategyEnum.LINEAR_FULL_JITTER
: Linear with full randomization-
RetryStrategyEnum.LINEAR_EQUAL_JITTER
: Linear with 50% randomization -
RetryStrategyEnum.FIXED
: Fixed delay RetryStrategyEnum.FIXED_FULL_JITTER
: Fixed with full randomizationRetryStrategyEnum.FIXED_EQUAL_JITTER
: Fixed with 50% randomization
Example Configurations:
# High-concurrency scenario (recommended)
retry_policy = RetryPolicyModel(
max_retries=3,
strategy=RetryStrategyEnum.EXPONENTIAL_FULL_JITTER,
backoff_factor=1000,
exponent=2,
max_delay=30000
)
# Predictable timing requirements
retry_policy = RetryPolicyModel(
max_retries=5,
strategy=RetryStrategyEnum.LINEAR,
backoff_factor=2000,
exponent=1 # Not used for LINEAR
)
# Rate limiting scenarios
retry_policy = RetryPolicyModel(
max_retries=10,
strategy=RetryStrategyEnum.FIXED,
backoff_factor=5000, # 5 second fixed delay
max_delay=5000
)
Integration with Signals
Retry policies work alongside Exosphere's signaling system:
- Nodes can still raise
PruneSignal
to stop retries immediately - Nodes can raise
ReQueueAfterSignal
to re-queue after some time. This will not mark nodes as failures. - When a node is re-queued using
ReQueueAfterSignal
, theretry_count
is not incremented. The existing count is carried over to the new state.