ingest/pipeline: Create functional producer for BufferedStorageBackend #5412

sreuland · 2024-08-06T16:53:21Z

What problem does your feature solve?

BufferedStorageBackend provides ledger close meta(LCM) individually per GetLedger, but there is no more efficient way to participate as a streaming producer of LCM's.

What would you like to see?

Follow design proposal on Functional Processors Lib
Provide a ‘Producer’ function for BufferedStorage Backend.
The function will be used as ‘producer’ operator in a pipeline, emitting tx-meta LCM over a callback fn and acts as a closures to encapsulate private instance of BufferedStorageBackend and avoid any unintended side effects.

// the returned channel will either return one error or it will be closed when the publishing is finished for bounded range
// if it's unbounded, then chan would never be closed.

PublishFromBufferedStorageBackend(ledgerRange ledgerbackend.Range,
                                            bufferedConfig ledgerbackend.BufferedStorageBackendConfig,
                                            dataStoreConfig datastore.DataStoreConfig 
                                            ctx context.Context, 
                                            callback func (xdr.LedgerCloseMeta) error) chan error
    
}

The method will return immediate, creating an async worker routine in background to continue processing.

it uses whatever prepared ledger range was configured for the backend.
it goes into a loop of invoking internal LedgerBacked.GetLedger to iterate each LedgerCloseMeta retrieved for the given prepared ledger range.
It invokes the callback func per each LedgerCloseMeta.
If the range was unbounded then it runs infinite until ctx is canceled.

Visualization of where the producer function fits in the larger CDP design for data transformation pipeline:

Relates to:

What alternatives are there?

new streaming ingestion app use cases would have to implement the same locally.

The text was updated successfully, but these errors were encountered:

tamirms · 2024-08-09T21:16:09Z

I think there are some subtleties to this interface change that we should consider:

If captive-core terminates unexpectedly this interface does not allow us to communicate that error to the caller
What happens if you call Publish() and while that go routine is running you call GetLedger() concurrently?
What happens if you call Publish()multiple times before the go routine finishes running?
What happens if you call Publish() and then call PrepareRange() before the go routine finishes running?

Given these potential issues, I think for the MVP we should avoid changing the LedgerBackend interface. In the future, as we see more uses of the ingestion library we can come up with some helper functions which will reduce boilerplate.

sreuland · 2024-08-20T19:27:19Z

If captive-core terminates unexpectedly this interface does not allow us to communicate that error to the caller

publishing can return a channel to propagate completion status to the caller, an error is sent on channel and then closed or if no errors and finished publishing for requested range, then channel is closed.

What happens if you call Publish() and while that go routine is running you call GetLedger() concurrently?

What happens if you call Publish()multiple times before the go routine finishes running?

What happens if you call Publish() and then call PrepareRange() before the go routine finishes running?

yes, to avoid these re-entrancy problems with LegerBackend instance needed to drive publishing, I think can skip adding the notion of publishing on to LedgerBackend, instead go with functional closures to encapsulate a private instance of LedgerBackend for publishing concerns, internally the closures can iterate on GetLedger(), to avoid any side effects and edge cases related to the underlying backend, the net change for new proposal would be:

// this will create a private instance of CaptiveStellarCore using NewCaptive() 
PublishFromCaptiveCore(ledgerRange ledgerbackend.Range,
                                            captiveCoreConfig ledgerbackend.CaptiveCoreConfig, 
                                            ctx context.Context, 
                                            callback func (xdr.LedgerCloseMeta) error) chan error

// this will create a private instance of BufferedStorageBackend using NewBufferedStorageBackend()
PublishFromBufferedStorageBackend(ledgerRange ledgerbackend.Range,
                                            bufferedConfig ledgerbackend.BufferedStorageBackendConfig,
                                            dataStoreConfig datastore.DataStoreConfig 
                                            ctx context.Context, 
                                            callback func (xdr.LedgerCloseMeta) error) chan error
    
}

Given these potential issues, I think for the MVP we should avoid changing the LedgerBackend interface. In the future, as we see more uses of the ingestion library we can come up with some helper functions which will reduce boilerplate.

I think if we can provide this sdk mechanism up front for automating the streaming of ledger tx-meta it will be worthwhile for demonstrating the DX during the MVP timeframe as it lowers resistance for app development(DX) to adopt CDP approach of transforming network data to derived models in a pipeline. Apps avoid investing in boilerplate(ledgerbackend, GetLedger iteration, etc) and they get stellar tx-meta 'source of origin' operator(publisher) to use in their pipeline out-of-box.

sreuland · 2024-09-16T21:11:38Z

@chowbao , @tamirms @urvisavla

are there any known BufferedStorage settings based on benchmarks that we feel good to provide as a default constant in the sdk, i.e. clients can use it for sanity check/reference and to quickly get moving initially, tuning later if they need:

buffer_size = ?  
num_workers = ?      
retry_limit = ?      
retry_wait = "30s"

could encapsulate functionally in sdk as func DefaultBufferedStorageBackendConfig() *BufferedStorageBackendConfig

urvisavla · 2024-09-16T21:36:58Z

@chowbao , @tamirms @urvisavla

are there any known BufferedStorage settings based on benchmarks that we feel good to provide as a default constant in the sdk, i.e. clients can use it for sanity check/reference and to quickly get moving initially, tuning later if they need:
buffer_size = ?  
num_workers = ?      
retry_limit = ?      
retry_wait = "30s"

Here's a summary of the recommended configuration for buffer size and number of workers based on my analysis:

For a small number of ledgers_per_file (1 LedgersPerFile):
buffer_size: 100 to 500 and num_workers: 5

For a medium number of ledgers_per_file (100 LedgersPerFile):
buffer_size: 10 and num_workers: 5

For a large number of ledgers_per_file (1000 LedgersPerFile):
buffer_size: 10 and num_workers: 1 to 2

You can find the detailed numbers and results here. One thing to note is that these tests were run on my local machine so actual times may vary depending on hardware but the relative config recommendation should remain same.

As for retry_limit and retry_wait, these values aren't dependent on other parameters so imo a reasonable values of retry limit=3 to 5 and retry_wait=30s should be good.

Let me know if you need any additional info. Thanks!

…ew feedback on api best practice

…k asserts

…istent on ci runs

…cancel

sreuland added feature request cdp-horizon-scrum labels Aug 6, 2024

sreuland added this to Platform Scrum Aug 6, 2024

github-project-automation bot moved this to Backlog in Platform Scrum Aug 6, 2024

sreuland added the cdp-processors-lib label Aug 20, 2024

sreuland added this to the platform sprint 50 milestone Aug 30, 2024

sreuland mentioned this issue Sep 3, 2024

ingest/pipeline: Create functional producer for CaptiveStellarCore LedgerBackend #5451

Open

sreuland changed the title ~~ingest/pipeline: Update LedgerBackend to support functional pipeline producer~~ ingest/pipeline: Create functional producer for BufferedStorageBackend Sep 3, 2024

sreuland self-assigned this Sep 6, 2024

sreuland modified the milestones: platform sprint 50, platform sprint 51 Sep 10, 2024

sreuland added a commit to sreuland/go that referenced this issue Sep 17, 2024

stellar#5412: functional producer for bufferedstorage backend

1b3f07e

sreuland mentioned this issue Sep 17, 2024

ingest/ledgerbackend: Create functional producer for BufferedStorageBackend #5462

Merged

7 tasks

sreuland added a commit to sreuland/go that referenced this issue Sep 20, 2024

stellar#5412: updated changelog to reflect new function feature

a26ff3d

sreuland added a commit to sreuland/go that referenced this issue Sep 25, 2024

stellar#5412: moved the producer fn into new cdp package under ingest

cd7bb34

sreuland added a commit to sreuland/go that referenced this issue Sep 25, 2024

stellar#5412: forgot to include new files on last commit

3414d24

sreuland added a commit to sreuland/go that referenced this issue Sep 25, 2024

stellar#5412: moved PublisherConfig to cdp package

68f7f43

sreuland added a commit to sreuland/go that referenced this issue Sep 26, 2024

stellar#5412: review feedback

4876f2e

sreuland added a commit to sreuland/go that referenced this issue Sep 26, 2024

stellar#5412: review feedback on loop logic

94285c0

sreuland added a commit to sreuland/go that referenced this issue Sep 27, 2024

stellar#5412: converted producer function to sync signature, per revi…

8c3f694

…ew feedback on api best practice

sreuland added a commit to sreuland/go that referenced this issue Sep 27, 2024

stellar#5412: add unit test to assert caller ctx cancellation outcome

f1e9d27

sreuland added a commit to sreuland/go that referenced this issue Sep 30, 2024

stellar#5412: fixed unit test for producer caller cancel ctx

952d4d6

sreuland added a commit to sreuland/go that referenced this issue Sep 30, 2024

stellar#5412: fixed unit test for producer get ledger error case

4447f68

sreuland added a commit to sreuland/go that referenced this issue Sep 30, 2024

stellar#5412: fixed unit test for producer get ledger error case, moc…

03cd4b6

…k asserts

sreuland added a commit to sreuland/go that referenced this issue Sep 30, 2024

stellar#5412: included changelog on new ingest/cdp package

d25a506

sreuland added a commit to sreuland/go that referenced this issue Oct 3, 2024

stellar#5412: renamed the producer fn to ApplyLedgerMetadata

1b1244b

sreuland closed this as completed in #5462 Oct 3, 2024

github-project-automation bot moved this from Needs Review to Done in Platform Scrum Oct 3, 2024

sreuland added a commit to sreuland/go that referenced this issue Oct 8, 2024

stellar#5412: fixed producer test for caller ctx cancel to be consistent

9d06613

sreuland added a commit to sreuland/go that referenced this issue Oct 8, 2024

stellar#5412: triaging producer test for caller ctx cancel to be cons…

dd0f2e9

…istent on ci runs

sreuland added a commit to sreuland/go that referenced this issue Oct 8, 2024

stellar#5412: review feedback on producer test, where to run the ctx …

37eb785

…cancel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest/pipeline: Create functional producer for BufferedStorageBackend #5412

ingest/pipeline: Create functional producer for BufferedStorageBackend #5412

sreuland commented Aug 6, 2024 •

edited

Loading

tamirms commented Aug 9, 2024

sreuland commented Aug 20, 2024

sreuland commented Sep 16, 2024 •

edited

Loading

urvisavla commented Sep 16, 2024 •

edited

Loading

ingest/pipeline: Create functional producer for BufferedStorageBackend #5412

ingest/pipeline: Create functional producer for BufferedStorageBackend #5412

Comments

sreuland commented Aug 6, 2024 • edited Loading

What problem does your feature solve?

What would you like to see?

What alternatives are there?

tamirms commented Aug 9, 2024

sreuland commented Aug 20, 2024

sreuland commented Sep 16, 2024 • edited Loading

urvisavla commented Sep 16, 2024 • edited Loading

sreuland commented Aug 6, 2024 •

edited

Loading

sreuland commented Sep 16, 2024 •

edited

Loading

urvisavla commented Sep 16, 2024 •

edited

Loading