NCL Protocol (#4734)

# NCL Protocol Documentation The NCL (NATS Client Library) Protocol manages reliable bidirectional communication between compute nodes and orchestrators in the Bacalhau network. It provides ordered async message delivery, connection health monitoring, and automatic recovery from failures. ## Table of Contents 1. [Definitions & Key Concepts](#definitions--key-concepts) 2. [Architecture Overview](#architecture-overview) 3. [Message Sequencing](#message-sequencing) 4. [Connection Lifecycle](#connection-lifecycle) 5. [Message Contracts](#message-contracts) 6. [Communication Flows](#communication-flows) 7. [Component Dependencies](#component-dependencies) 8. [Configuration](#configuration) 9. [Glossary](#glossary) ## Definitions & Key Concepts ### Events and Messages - **Event**: An immutable record of a state change in the local system - **Message**: A communication packet sent between nodes derived from events - **Sequence Number**: A monotonically increasing identifier for ordering events and messages ### Node Information - **Node ID**: Unique identifier for each compute node - **Resources**: Computational resources like CPU, Memory, GPU - **Available Capacity**: Currently free resources on a node - **Queue Used Capacity**: Resources allocated to queued jobs ### Connection States - **Disconnected**: No active connection, no message processing - **Connecting**: Attempting to establish connection - **Connected**: Active message processing and health monitoring Transitions between states occur based on: - Successful/failed handshakes - Missing heartbeats - Network failures - Explicit disconnection ## Architecture Overview The protocol consists of two main planes: ### Control Plane - Handles connection establishment and health monitoring - Manages periodic heartbeats and node info updates - Maintains connection state and health metrics - Handles checkpointing for recovery ### Data Plane - Provides reliable, ordered message delivery - Manages event watching and dispatching - Tracks message sequences for both sides - Handles recovery from network failures ### NATS Subject Structure ``` bacalhau.global.compute.<nodeID>.in.msgs - Messages to compute node bacalhau.global.compute.<nodeID>.out.msgs - Messages from compute node bacalhau.global.compute.<nodeID>.out.ctrl - Control messages from compute bacalhau.global.compute.*.out.ctrl - Global control channel ``` ## Message Sequencing ### Overview The NCL protocol integrates with a local event watcher system to decouple event processing from message delivery. Each node maintains its own ordered ledger of events that the protocol watches and selectively publishes. This decoupling provides several benefits: - Clean separation between business logic and message transport - Reliable local event ordering - Simple checkpointing and recovery - Built-in replay capabilities ### Event Flow Architecture ``` Local Event Store NCL Protocol Remote Node ┌──────────────┐ ┌─────────────────────┐ ┌──────────────┐ │ │ │ 1. Watch Events │ │ │ │ Ordered │◄───┤ 2. Filter Relevant │ │ │ │ Event │ │ 3. Create Messages │───►│ Receive │ │ Ledger │ │ 4. Track Sequences │ │ Process │ │ │ │ 5. Checkpoint │ │ │ └──────────────┘ └─────────────────────┘ └──────────────┘ ``` ### Key Components 1. **Event Store** - Maintains ordered sequence of all local events - Each event has unique monotonic sequence number - Supports seeking and replay from any position 2. **Event Watcher** - Watches event store for new entries - Filters events relevant for transport - Supports resuming from checkpoint 3. **Message Dispatcher** - Creates messages from events - Manages reliable delivery - Tracks publish acknowledgments ## Connection Lifecycle ### Initial Connection 1. **Handshake** - Compute node initiates connection by sending HandshakeRequest - Includes node info, start time, and last processed sequence number - Orchestrator validates request and accepts/rejects connection - On acceptance, orchestrator creates dedicated data plane for node 2. **Data Plane Setup** - Both sides establish message subscriptions - Create ordered publishers for reliable delivery - Initialize event watchers and dispatchers - Set up sequence tracking ### Ongoing Communication 1. **Health Monitoring** - Compute nodes send periodic heartbeats - Include current capacity and last processed sequence - Orchestrator tracks node health and connection state - Missing heartbeats trigger disconnection 2. **Node Info Updates** - Compute nodes send updates when configuration changes - Includes updated capacity, features, labels - Orchestrator maintains current node state 3. **Message Flow** - Data flows through separate control/data subjects - Messages include sequence numbers for ordering - Both sides track processed sequences - Failed deliveries trigger automatic recovery ## Message Contracts ### Handshake Messages ```typescript // Request sent by compute node to initiate connection HandshakeRequest { NodeInfo: models.NodeInfo StartTime: Time LastOrchestratorSeqNum: uint64 } // Response from orchestrator HandshakeResponse { Accepted: boolean Reason: string // Only set if not accepted LastComputeSeqNum: uint64 } ``` ### Heartbeat Messages ```typescript // Periodic heartbeat from compute node HeartbeatRequest { NodeID: string AvailableCapacity: Resources QueueUsedCapacity: Resources LastOrchestratorSeqNum: uint64 } // Acknowledgment from orchestrator HeartbeatResponse { LastComputeSeqNum: uint64 } ``` ### Node Info Update Messages ```typescript // Node info update notification UpdateNodeInfoRequest { NodeInfo: NodeInfo // Same structure as in HandshakeRequest } UpdateNodeInfoResponse { Accepted: boolean Reason: string // Only set if not accepted } ``` ## Communication Flows ### Initial Connection and Handshake The following sequence shows the initial connection establishment between compute node and orchestrator: ```mermaid sequenceDiagram participant C as Compute Node participant O as Orchestrator Note over C,O: Connection Establishment C->>O: HandshakeRequest(NodeInfo, StartTime, LastSeqNum) Note over O: Validate Node alt Valid Node O->>O: Create Data Plane O->>O: Setup Message Handlers O-->>C: HandshakeResponse(Accepted=true, LastSeqNum) Note over C: Setup Data Plane C->>C: Start Control Plane C->>C: Initialize Data Plane Note over C,O: Begin Regular Communication C->>O: Initial Heartbeat O-->>C: HeartbeatResponse else Invalid Node O-->>C: HandshakeResponse(Accepted=false, Reason) Note over C: Retry with backoff end ``` ### Regular Operation Flow The following sequence shows the ongoing communication pattern between compute node and orchestrator, including periodic health checks and configuration updates: ```mermaid sequenceDiagram participant C as Compute Node participant O as Orchestrator rect rgb(200, 230, 200) Note over C,O: Periodic Health Monitoring loop Every HeartbeatInterval C->>O: HeartbeatRequest(NodeID, Capacity, LastSeqNum) O-->>C: HeartbeatResponse() end end rect rgb(230, 200, 200) Note over C,O: Node Info Updates C->>C: Detect Config Change C->>O: UpdateNodeInfoRequest(NewNodeInfo) O-->>C: UpdateNodeInfoResponse(Accepted) end rect rgb(200, 200, 230) Note over C,O: Data Plane Messages O->>C: Execution Messages (with SeqNum) C->>O: Result Messages (with SeqNum) Note over C,O: Both track sequence numbers end ``` During regular operation: - Heartbeats occur every HeartbeatInterval (default 15s) - Configuration changes trigger immediate updates - Data plane messages flow continuously in both directions - Both sides maintain sequence tracking and acknowledgments ### Failure Recover Flow The protocol provides comprehensive failure recovery through several mechanisms: ```mermaid sequenceDiagram participant C as Compute Node participant O as Orchestrator rect rgb(240, 200, 200) Note over C,O: Network Failure C->>O: HeartbeatRequest x--xO: Connection Lost Note over C: Detect Missing Response C->>C: Mark Disconnected C->>C: Stop Data Plane Note over O: Detect Missing Heartbeats O->>O: Mark Node Disconnected O->>O: Cleanup Node Resources end rect rgb(200, 240, 200) Note over C,O: Recovery loop Until Connected Note over C: Exponential Backoff C->>O: HandshakeRequest(LastSeqNum) O-->>C: HandshakeResponse(Accepted) end Note over C,O: Resume from Last Checkpoint Note over C: Restart Data Plane Note over O: Recreate Node Resources end ``` #### Failure Detection - Missing heartbeats beyond threshold - NATS connection failures - Message publish failures #### Recovery Process 1. Both sides independently detect failure 2. Clean up existing resources 3. Compute node initiates reconnection 4. Resume from last checkpoint: - Load last checkpoint sequence - Resume event watching - Rebuild publish state - Resend pending messages 5. Continue normal operation This process ensures: - No events are lost - Messages remain ordered - Efficient recovery - At-least-once delivery ## Component Dependencies ### Compute Node Components: ``` ConnectionManager ├── ControlPlane │ ├── NodeInfoProvider │ │ └── Monitors node state changes │ ├── MessageHandler │ │ └── Processes control messages │ └── Checkpointer │ └── Saves progress state └── DataPlane ├── LogStreamServer │ └── Handles job output streaming ├── MessageHandler │ └── Processes execution messages ├── MessageCreator │ └── Formats outgoing messages └── EventStore └── Tracks execution events ``` ### Orchestrator Components: ``` ComputeManager ├── NodeManager │ ├── Tracks node states │ └── Manages node lifecycle ├── MessageHandler │ └── Processes node messages ├── MessageCreatorFactory │ └── Creates per-node message handlers └── DataPlane (per node) ├── Subscriber │ └── Handles incoming messages ├── Publisher │ └── Sends ordered messages └── Dispatcher └── Watches and sends events ``` ## Configuration ### Connection Management - `HeartbeatInterval`: How often compute nodes send heartbeats (default: 15s) - `HeartbeatMissFactor`: Number of missed heartbeats before disconnection (default: 5) - `NodeInfoUpdateInterval`: How often node info updates are checked (default: 60s) - `RequestTimeout`: Timeout for individual requests (default: 10s) ### Recovery Settings - `ReconnectInterval`: Base interval between reconnection attempts (default: 10s) - `BaseRetryInterval`: Initial retry delay after failure (default: 5s) - `MaxRetryInterval`: Maximum retry delay (default: 5m) ### Data Plane Settings - `CheckpointInterval`: How often sequence progress is saved (default: 30s) ## Glossary - **Checkpoint**: A saved position in the event sequence used for recovery - **Handshake**: Initial connection protocol between compute node and orchestrator - **Heartbeat**: Periodic health check message from compute node to orchestrator - **Node Info**: Current state and capabilities of a compute node - **Sequence Number**: Monotonically increasing identifier used for message ordering  ## Summary by CodeRabbit ## Release Notes - **New Features** - Introduced new message types for transport operations. - Added a `HealthTracker` for monitoring connection health. - Implemented `ControlPlane` and `DataPlane` for managing node operations and message flow. - Created a `ConnectionManager` for handling compute node connections. - Added a factory structure for creating `NCLMessageCreator` instances with improved event filtering. - Enhanced the `BuildVersionInfo` struct with a `Copy` method for duplicating instances. - Introduced a mock implementation for testing `ControlPlane` and `Checkpointer`. - Added a comprehensive README for the NCL protocol. - **Deprecations** - Added deprecation notice for the `ResourceUpdateInterval` field in the `Heartbeat` struct. - **Bug Fixes** - Enhanced error handling and logging in various components. - **Refactor** - Updated imports and modified several method signatures to reflect new package structures. - Streamlined watcher and connection management processes. - Adjusted checkpointing behavior and logging levels for improved clarity. - **Chores** - Removed legacy NCL-related components and tests, consolidating functionality under the new transport layer. - Added comprehensive test suites for various components to ensure robust functionality. - Introduced utility functions for managing NATS server in a testing environment.  --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
bacalhau-project · Dec 11, 2024 · 83bc0a4 · 83bc0a4
1 parent f49e65a
commit 83bc0a4
Show file tree

Hide file tree

Showing 81 changed files with 5,457 additions and 977 deletions.
diff --git a/.cspell/custom-dictionary.txt b/.cspell/custom-dictionary.txt
@@ -20,6 +20,7 @@ boltdb
 booga
 boxo
 bprotocol
+nclprotocol
 BRSNW
 BUCKETNAME
 buildx
@@ -441,4 +442,6 @@ tlsca
 Lenf
 traefik
 bprotocolcompute
-bprotocolorchestrator
+bprotocolorchestrator
+nclprotocolcompute
+ncltest
diff --git a/go.mod b/go.mod
@@ -21,6 +21,7 @@ require (
 	github.com/ghodss/yaml v1.0.0
 	github.com/go-playground/validator/v10 v10.16.0
 	github.com/golang-jwt/jwt v3.2.2+incompatible
+	github.com/google/go-cmp v0.6.0
 	github.com/google/uuid v1.6.0
 	github.com/gorilla/websocket v1.5.1
 	github.com/hashicorp/go-retryablehttp v0.7.7

diff --git a/pkg/compute/watchers/ncl_message_creator.go b/pkg/compute/watchers/ncl_message_creator.go
@@ -8,7 +8,7 @@ import (
 	"github.com/bacalhau-project/bacalhau/pkg/lib/watcher"
 	"github.com/bacalhau-project/bacalhau/pkg/models"
 	"github.com/bacalhau-project/bacalhau/pkg/models/messages"
-	"github.com/bacalhau-project/bacalhau/pkg/transport"
+	"github.com/bacalhau-project/bacalhau/pkg/transport/nclprotocol"
 )
 
 type NCLMessageCreator struct {
@@ -67,4 +67,4 @@ func (d *NCLMessageCreator) CreateMessage(event watcher.Event) (*envelope.Messag
 }
 
 // compile-time check that NCLMessageCreator implements dispatcher.MessageCreator
-var _ transport.MessageCreator = &NCLMessageCreator{}
+var _ nclprotocol.MessageCreator = &NCLMessageCreator{}
diff --git a/pkg/config/defaults.go b/pkg/config/defaults.go
@@ -47,9 +47,8 @@ var Default = types.Bacalhau{
 		Enabled:       false,
 		Orchestrators: []string{"nats://127.0.0.1:4222"},
 		Heartbeat: types.Heartbeat{
-			InfoUpdateInterval:     types.Minute,
-			ResourceUpdateInterval: 30 * types.Second,
-			Interval:               15 * types.Second,
+			InfoUpdateInterval: types.Minute,
+			Interval:           15 * types.Second,
 		},
 		AllocatedCapacity: types.ResourceScaler{
 			CPU:    "70%",

diff --git a/pkg/config/migrate.go b/pkg/config/migrate.go
@@ -38,9 +38,8 @@ func MigrateV1(in v1types.BacalhauConfig) (types.Bacalhau, error) {
 			}),
 			Orchestrators: in.Node.Network.Orchestrators,
 			Heartbeat: types.Heartbeat{
-				Interval:               types.Duration(in.Node.Compute.ControlPlaneSettings.HeartbeatFrequency),
-				ResourceUpdateInterval: types.Duration(in.Node.Compute.ControlPlaneSettings.ResourceUpdateFrequency),
-				InfoUpdateInterval:     types.Duration(in.Node.Compute.ControlPlaneSettings.InfoUpdateFrequency),
+				Interval:           types.Duration(in.Node.Compute.ControlPlaneSettings.HeartbeatFrequency),
+				InfoUpdateInterval: types.Duration(in.Node.Compute.ControlPlaneSettings.InfoUpdateFrequency),
 			},
 			AllowListedLocalPaths: in.Node.AllowListedLocalPaths,
 			Auth:                  types.ComputeAuth{Token: in.Node.Network.AuthSecret},

diff --git a/pkg/config/types/compute.go b/pkg/config/types/compute.go
@@ -31,7 +31,7 @@ type ComputeTLS struct {
 type Heartbeat struct {
 	// InfoUpdateInterval specifies the time between updates of non-resource information to the orchestrator.
 	InfoUpdateInterval Duration `yaml:"InfoUpdateInterval,omitempty" json:"InfoUpdateInterval,omitempty"`
-	// ResourceUpdateInterval specifies the time between updates of resource information to the orchestrator.
+	// Deprecated: use Interval instead
 	ResourceUpdateInterval Duration `yaml:"ResourceUpdateInterval,omitempty" json:"ResourceUpdateInterval,omitempty"`
 	// Interval specifies the time between heartbeat signals sent to the orchestrator.
 	Interval Duration `yaml:"Interval,omitempty" json:"Interval,omitempty"`

diff --git a/pkg/lib/validate/general.go b/pkg/lib/validate/general.go
@@ -11,8 +11,12 @@ func NotNil(value any, msg string, args ...any) error {
 
 	// Use reflection to handle cases where value is a nil pointer wrapped in an interface
 	val := reflect.ValueOf(value)
-	if val.Kind() == reflect.Ptr && val.IsNil() {
-		return createError(msg, args...)
+	switch val.Kind() {
+	case reflect.Ptr, reflect.Interface, reflect.Map, reflect.Slice, reflect.Func:
+		if val.IsNil() {
+			return createError(msg, args...)
+		}
+	default:
 	}
 	return nil
 }

diff --git a/pkg/lib/validate/general_test.go b/pkg/lib/validate/general_test.go
@@ -4,6 +4,14 @@ package validate
 
 import "testing"
 
+type doer struct{}
+
+func (d doer) Do() {}
+
+type Doer interface {
+	Do()
+}
+
 // TestIsNotNil tests the NotNil function for various scenarios.
 func TestIsNotNil(t *testing.T) {
 	t.Run("NilValue", func(t *testing.T) {
@@ -35,4 +43,78 @@ func TestIsNotNil(t *testing.T) {
 			t.Errorf("NotNil failed: unexpected error for non-nil pointer")
 		}
 	})
+
+	t.Run("NilFunc", func(t *testing.T) {
+		var nilFunc func()
+		err := NotNil(nilFunc, "value should not be nil")
+		if err == nil {
+			t.Errorf("NotNil failed: expected error for nil func")
+		}
+	})
+
+	t.Run("NonNilFunc", func(t *testing.T) {
+		nonNilFunc := func() {}
+		err := NotNil(nonNilFunc, "value should not be nil")
+		if err != nil {
+			t.Errorf("NotNil failed: unexpected error for non-nil func")
+		}
+	})
+
+	t.Run("NilSlice", func(t *testing.T) {
+		var nilSlice []int
+		err := NotNil(nilSlice, "value should not be nil")
+		if err == nil {
+			t.Errorf("NotNil failed: expected error for nil slice")
+		}
+	})
+
+	t.Run("NonNilSlice", func(t *testing.T) {
+		nonNilSlice := make([]int, 0)
+		err := NotNil(nonNilSlice, "value should not be nil")
+		if err != nil {
+			t.Errorf("NotNil failed: unexpected error for non-nil slice")
+		}
+	})
+
+	t.Run("NilMap", func(t *testing.T) {
+		var nilMap map[string]int
+		err := NotNil(nilMap, "value should not be nil")
+		if err == nil {
+			t.Errorf("NotNil failed: expected error for nil map")
+		}
+	})
+
+	t.Run("NonNilMap", func(t *testing.T) {
+		nonNilMap := make(map[string]int)
+		err := NotNil(nonNilMap, "value should not be nil")
+		if err != nil {
+			t.Errorf("NotNil failed: unexpected error for non-nil map")
+		}
+	})
+
+	t.Run("NilInterface", func(t *testing.T) {
+		var nilInterface Doer
+		err := NotNil(nilInterface, "value should not be nil")
+		if err == nil {
+			t.Errorf("NotNil failed: expected error for nil interface")
+		}
+	})
+
+	t.Run("NonNilInterface", func(t *testing.T) {
+		var nonNilInterface Doer = doer{}
+		err := NotNil(nonNilInterface, "value should not be nil")
+		if err != nil {
+			t.Errorf("NotNil failed: unexpected error for non-nil interface")
+		}
+	})
+
+	t.Run("FormattedMessage", func(t *testing.T) {
+		err := NotNil(nil, "value %s should not be nil", "test")
+		if err == nil {
+			t.Errorf("NotNil failed: expected error for nil value with formatted message")
+		}
+		if err.Error() != "value test should not be nil" {
+			t.Errorf("NotNil failed: unexpected error message, got %q", err.Error())
+		}
+	})
 }
diff --git a/pkg/models/buildversion.go b/pkg/models/buildversion.go
@@ -14,3 +14,12 @@ type BuildVersionInfo struct {
 	GOOS       string    `json:"GOOS" example:"linux"`
 	GOARCH     string    `json:"GOARCH" example:"amd64"`
 }
+
+func (b *BuildVersionInfo) Copy() *BuildVersionInfo {
+	if b == nil {
+		return nil
+	}
+	newB := new(BuildVersionInfo)
+	*newB = *b
+	return newB
+}
diff --git a/pkg/models/messages/constants.go b/pkg/models/messages/constants.go
@@ -9,4 +9,12 @@ const (
 	BidResultMessageType    = "BidResult"
 	RunResultMessageType    = "RunResult"
 	ComputeErrorMessageType = "ComputeError"
+
+	HandshakeRequestMessageType      = "transport.HandshakeRequest"
+	HeartbeatRequestMessageType      = "transport.HeartbeatRequest"
+	NodeInfoUpdateRequestMessageType = "transport.UpdateNodeInfoRequest"
+
+	HandshakeResponseType      = "transport.HandshakeResponse"
+	HeartbeatResponseType      = "transport.HeartbeatResponse"
+	NodeInfoUpdateResponseType = "transport.UpdateNodeInfoResponse"
 )
diff --git a/pkg/models/node_info.go b/pkg/models/node_info.go
@@ -4,8 +4,11 @@ package models
 import (
 	"context"
 	"fmt"
+	"slices"
 	"strings"
 
+	"github.com/google/go-cmp/cmp"
+	"github.com/google/go-cmp/cmp/cmpopts"
 	"golang.org/x/exp/maps"
 )
 
@@ -107,6 +110,43 @@ func (n NodeInfo) IsComputeNode() bool {
 	return n.NodeType == NodeTypeCompute
 }
 
+// Copy returns a deep copy of the NodeInfo
+func (n *NodeInfo) Copy() *NodeInfo {
+	if n == nil {
+		return nil
+	}
+	cpy := new(NodeInfo)
+	*cpy = *n
+
+	// Deep copy maps
+	cpy.Labels = maps.Clone(n.Labels)
+	cpy.SupportedProtocols = slices.Clone(n.SupportedProtocols)
+	cpy.ComputeNodeInfo = copyOrZero(n.ComputeNodeInfo.Copy())
+	cpy.BacalhauVersion = copyOrZero(n.BacalhauVersion.Copy())
+	return cpy
+}
+
+// HasStaticConfigChanged returns true if the static/configuration aspects of this node
+// have changed compared to other. It ignores dynamic operational fields like queue capacity
+// and execution counts that change frequently during normal operation.
+func (n NodeInfo) HasStaticConfigChanged(other NodeInfo) bool {
+	// Define which fields to ignore in the comparison
+	opts := []cmp.Option{
+		cmpopts.IgnoreFields(ComputeNodeInfo{},
+			"QueueUsedCapacity",
+			"AvailableCapacity",
+			"RunningExecutions",
+			"EnqueuedExecutions",
+		),
+		// Ignore ordering in slices
+		cmpopts.SortSlices(func(a, b string) bool { return a < b }),
+		cmpopts.SortSlices(func(a, b Protocol) bool { return string(a) < string(b) }),
+		cmpopts.SortSlices(func(a, b GPU) bool { return a.Less(b) }), // Sort GPUs by all fields for stable comparison
+	}
+
+	return !cmp.Equal(n, other, opts...)
+}
+
 // ComputeNodeInfo contains metadata about the current state and abilities of a compute node. Compute Nodes share
 // this state with Requester nodes by including it in the NodeInfo they share across the network.
 type ComputeNodeInfo struct {
@@ -120,3 +160,22 @@ type ComputeNodeInfo struct {
 	RunningExecutions  int       `json:"RunningExecutions"`
 	EnqueuedExecutions int       `json:"EnqueuedExecutions"`
 }
+
+// Copy provides a copy of the allocation and deep copies the job
+func (c *ComputeNodeInfo) Copy() *ComputeNodeInfo {
+	if c == nil {
+		return nil
+	}
+	cpy := new(ComputeNodeInfo)
+	*cpy = *c
+
+	// Deep copy slices
+	cpy.ExecutionEngines = slices.Clone(c.ExecutionEngines)
+	cpy.Publishers = slices.Clone(c.Publishers)
+	cpy.StorageSources = slices.Clone(c.StorageSources)
+	cpy.MaxCapacity = copyOrZero(c.MaxCapacity.Copy())
+	cpy.QueueUsedCapacity = copyOrZero(c.QueueUsedCapacity.Copy())
+	cpy.AvailableCapacity = copyOrZero(c.AvailableCapacity.Copy())
+	cpy.MaxJobRequirements = copyOrZero(c.MaxJobRequirements.Copy())
+	return cpy
+}