Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ITCH parser and writer #8

Open
vgreg opened this issue Apr 24, 2024 · 2 comments
Open

Improve ITCH parser and writer #8

vgreg opened this issue Apr 24, 2024 · 2 comments

Comments

@vgreg
Copy link
Owner

vgreg commented Apr 24, 2024

See if we can speedup parser.

@vgreg
Copy link
Owner Author

vgreg commented Apr 24, 2024

Potential ideas for speedup:

  • Avoid converting to string for comparison in long if...
  • Reorder message in if based on frequency.

New improved output formats:

  • As (rich) markdown
  • As JSON
  • As arrow-ready format (common for all messages, to store in parquet file)

The markdown output is for interactive work and for the CLI

The JSON output is to simplify development and debugging.

The arrow format is to be able to store historical files in parquet format for easy searching and extraction. That way, when we want to look at a subset of stocks on a given day, we can easily query the messages related to those symbol/days and process them.

@vgreg
Copy link
Owner Author

vgreg commented Apr 24, 2024

The overall parser architecture should be overhauled. The current approach is highly inefficient as it forces to store all messages in memory.

The more modern way to read large files like this would be to use a generator that can do automatic filtering:
https://realpython.com/introduction-to-python-generators/

It would also decouple two important aspects of the message parser: reading and writing. The "in-memory" representation is currently at the message level, but the code around it is very messy. We could have many readers (one for each file type, at the minimum a binary ITCH reader, but potentially also parquet, JSON, etc...)

We could also have many writers, one for each file type.

The formatting logic could be defined at the message level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant