Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add time range parameters to sync script #483

Merged
merged 20 commits into from
Jan 10, 2025
Merged

Add time range parameters to sync script #483

merged 20 commits into from
Jan 10, 2025

Conversation

fhenneke
Copy link
Collaborator

@fhenneke fhenneke commented Jan 7, 2025

This PR adds start and end times to the sync data script.

The command line arguments start_time and end_time are added to the sync_data script. The script is called as, e.g.,

python -m src.data_sync.sync_data --sync-table order_data --start-time 2025-12-30 --end-time 2025-01-07

Only data between start_time and end_time is computed. This data is then upserted into the corresponding table of the month. I.e. if a row was not in the table already, it is inserted into the table. Rows from the new data replace rows of the old data if it exists. Old data which was not recomputed stays as is.

The code is structured as follows:

  1. Arguments are parsed with appropriate default values.
  2. The full time range is partitioned into monthly ranges.
  3. Block ranges and months are computed from those time ranges.
  4. Essentially the old code is used for computing data for those block ranges.
  5. Data is written to the database.

The convention for the stat time to be inclusive and for the end time to be exclusive is used. This way the two ranges (2024-12-30, 2025-01-02), (2025-01-02, 2025-01-07) would give the same result as the range (2024-12-30, 2025-01-07). Though some overlap is required in cases end_time is beyond the last finalized block.
If no argument is supplied, the start of the month and the start of the next month are used as default for start_time and end_time, respectively, to compute data for the full month, until the last finalized block.

The previous month is not automatically recomputed on the first of the next month. Instead, one can use a time range which contains whatever time window for which data needs to be recomputed.

Potential TODOs

  • Change data-jobs workflows to have (overlapping) time ranges.
  • Add tests
  • Add documentation
  • Add functionality to drop a table instead of upserting. This will get rid of wrong data not corresponding to any valid order or batch.

@fhenneke fhenneke changed the title [Draft] Add time range parameters to sync script Add time range parameters to sync script Jan 8, 2025
@fhenneke fhenneke marked this pull request as ready for review January 8, 2025 15:37
src/data_sync/common.py Outdated Show resolved Hide resolved
from src.logger import set_log

log = set_log(__name__)


def compute_time_range(
start_time: datetime, end_time: datetime
) -> list[tuple[datetime, datetime]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't there used to be a type AccountingPeriod? If that class has been abolished, it might not hurt to add a basic declaration:

AccountingPeriod = tuple[datetime, datetime]

that can be passed around the project for better readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I am still not sure how to reasonably merge the two code bases of solver-rewards and dune-sync (v1).

Accounting period seems to be tailored to rewards with defaulting to weeks and having pretty printing for dune tables. The accounting period also has some implicit constraints on having a time of 00:00:00 in some parts of the code.

Making the concept of accounting periods consistent throughout the code would definitely help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, I block number ranges would be the most appropriate thing. This was the benefit of the old AccountingPeriod class; it provided a bijection between timestamps and block numbers so computations could be performed on the more natural type.

Anyway, I suppose this is just a side note. Feel free to ignore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an ongoing internal debate on what the most natural concept for accounting periods is. Block ranges sound nice when considering how backend data is structured. But they become a bit unnatural if multiple chains are considered. For simplicity, we want to use time based periods everywhere, across chains and projects.


async def sync_data_to_db( # pylint: disable=too-many-arguments
async def sync_data_to_db( # pylint: disable=too-many-arguments, too-many-locals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Looks like this method has become quite overloaded...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified the code (but not the responsibilities of the function) a bit. At least I do not need to exclude the pylint check anymore.

Copy link
Contributor

@bram-vdberg bram-vdberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


start_block = find_block_with_timestamp(node, start_time.timestamp())
if latest_block_time < end_time:
end_block = int(latest_block["number"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a log here to let us know if this happens that the script is not actually using the end time because it's still in the future?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an opinion on where this processing for time ranges should take place?

On the one hand, most of the logging and checking can also happen during initialization in ScriptArgs. That might also remove the implicit changing of end times due to the restriction to finalized blocks.

On the other hand, a node would be required to do the checking for latest block. And it would be mixing abstractions. The accounting, on a high level, should be time based. That this is internally split into monthly ranges and blocks is a bit of an implementation detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to be honest. Since this requires actual implementation details and not checking whether or not the args are valid it feels cleaner to leave this here. For the sake of serializing the args any end time > start time would be valid, so it seems to me like this is the right place. But if moving this to the initialization makes it more maintainable then that could be a good argument to do this there instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I just added a log message.

Copy link

socket-security bot commented Jan 9, 2025

New and removed dependencies detected. Learn more about Socket for GitHub ↗︎

Package New capabilities Transitives Size Publisher
pypi/[email protected] environment, unsafe 0 1.67 MB protobuf-packages
pypi/[email protected] environment, eval, filesystem, network, shell, unsafe 0 1.67 MB piro

🚮 Removed packages: pypi/[email protected], pypi/[email protected]

View full report↗︎

@fhenneke
Copy link
Collaborator Author

fhenneke commented Jan 9, 2025

I recompiled dependencies after adding python-dateutils. This removed some dependencies.

src/data_sync/common.py Outdated Show resolved Hide resolved

log = set_log(__name__)


def compute_time_range(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of function is a bit confusing as its input is actually the time range, but tbh i don't have any good suggestion for the name

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to partition_time_range and added a comment at the call site to make it a bit clearer.

Overall, this splitting of time ranges is a bit confusing since it is leaking implementation details of chunking into the rest of the code. The current main has that problem too, though.

fhenneke and others added 4 commits January 10, 2025 10:01
Co-authored-by: Haris Angelidakis <[email protected]>
to make it easier to understand what is going on with time ranges
@fhenneke fhenneke merged commit e383645 into main Jan 10, 2025
6 checks passed
@fhenneke fhenneke deleted the sync_time_range branch January 10, 2025 15:29
@github-actions github-actions bot locked and limited conversation to collaborators Jan 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants