Address review comments

influxdata · Nov 4, 2024 · ef7b24b · ef7b24b
1 parent dc6ff19
commit ef7b24b
Showing 1 changed file with 46 additions and 27 deletions.
diff --git a/docs/specs/tsd-008-partial-write-error-handling.md b/docs/specs/tsd-008-partial-write-error-handling.md
@@ -11,43 +11,62 @@ output plugins, write, error, output model, metric, buffer
 
 ## Overview
 
-When output plugins serialize metric and/or send them to the service endpoint,
-single metrics might cause errors e.g. by not being serializable or by being
-rejected by the service on the server side.
-Currently, an output is only able to accept or reject the complete batch of
-metrics it receives from the output model. This causes issues if only a subset
-of metrics in the batch fails as the output has no way of telling the model
-(and in turn the output buffer) which metrics failed but can only accept or
-reject the whole batch. As a consequence, outputs need to "accept" the batch
-to avoid a requeueing of the batch for the next flush interval. This distorts
-statistics of accepted metrics and causes misleading log messages.
-Even worse, for outputs ending-up with partial writes, e.g. only the first half
-of the metrics can be written to the service, there is no way of only accepting
-the written metrics so they need to internally buffer the remaining ones.
+The output model wrapping each output plugin buffers metrics to be able to batch
+those metrics for more efficient sending. In each flush cycle, the model
+collects a batch of metrics and hands it over to the output plugin for writing
+through the `Write` method. Currently, if writing succeeds (i.e. no error is
+returned), _all metrics of the batch_ are removed from the buffer and are marked
+as __accepted__ both in terms of statistics as well as in tracking-metric terms.
+If writing fails (i.e. any error is returned), _all metrics of the batch_ are
+__kept__ in the buffer for requeueing them in the next write cycle.
+
+Issues arise when an output plugin cannot write all metrics of a batch bit only
+some to its service endpoint, e.g. due to the metrics being serializable or if
+metrics are selectively rejected by the service on the server side. This might
+happen when reaching submission limits, violating service constraints e.g.
+by out-of-order sends, or due to invalid characters in the serialited metric.
+In those cases, an output currently is only able to accept or reject the
+_complete batch of metrics_ as there is no mechanism to inform the model (and
+in turn the buffer) that only _some_ of the metrics in the batch were failing.
+
+As a consequence, outputs often _accept_ the batch to avoid a requeueing of the
+failing metrics for the next flush interval. This distorts statistics of
+accepted metrics and causes misleading log messages saying all metrics were
+written sucessfully which is not true. Even worse, for outputs ending-up with
+partial writes, e.g. only the first half of the metrics can be written to the
+service, there is no way of telling the model to selectively accept the actually
+written metrics and in turn those outputs must internally buffer the remaining,
+unwritten metrics leading to a duplication of buffering logic and adding to code
+complexity.
 
 This specification aims at defining the handling of partially successful writes
-and introduces the concept of a special _write error_ type. That error type
-must reflect partial writes and partial serialization to overcome the
-aforementioned issues.
+and introduces the concept of a special _partial write error_ type to reflect
+partial writes and partial serialization overcoming the aforementioned issues
+and limitations.
 
-To do so, the error must contain a list of successfully
-written metrics, which must be marked as __accepted__ and must be removed from
-the buffer. The error must contain a list of metrics fatally failed to be
-written or serialized and cannot be retried, which must be marked as
-__rejected__ and must be removed from the buffer.
+To do so, the _partial write error_ error type must contain a list of
+successfully written metrics, to be marked __accepted__, both in terms of
+statistics as well as in terms of metric tracking, and must be removed from the
+buffer. Furthermore, the error must contain a list of metrics that cannot be
+sent or serialized and cannot be retried. These metrics must be marked as
+__rejected__, both in terms of statistics as well as in terms of metric
+tracking,  and must be removed from the buffer.
 
 The error may contain a list of metrics not-yet written to be __kept__ for the
 next write cylce. Those metrics must not be marked and must be kept in the
-buffer. If the error does not contain the list, the list must be inferred using
-the accept and reject lists and the metrics in the batch.
+buffer. If the error does not contain the list of not-yet written metrics, this
+list must be inferred using the accept and reject lists mentioned above.
 
-All metric lists should be communicated as indices into the batch to be able
-to handle tracking metrics correctly.
+To allow the model and the buffer to correctly handle tracking metrics ending up
+in the buffer and output the tracking information must be preserved during
+communication between the output plugin, the model and the buffer through the
+specified error. To do so, all metric lists should be communicated as indices
+into the batch to be able to handle tracking metrics correctly.
 
 For backward compatibility and simplicity output plugins can return a `nil`
 error to indicate that __all__ metrics of the batch are __accepted__. Similarly,
-returing an error _not_ being a _write error_ indicates that __all__ metrics of
-the batch should be __kept__ in the buffer for the next write cycle.
+returing an error _not_ being a _partial write error_ indicates that __all__
+metrics of the batch should be __kept__ in the buffer for the next write cycle.
 
 ## Related Issues