Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding multiprocessing #148

Merged
merged 4 commits into from
Mar 4, 2022
Merged

Adding multiprocessing #148

merged 4 commits into from
Mar 4, 2022

Conversation

casabre
Copy link
Contributor

@casabre casabre commented Mar 1, 2022

First implementation of multiprocessing for review

@casabre
Copy link
Contributor Author

casabre commented Mar 1, 2022

@dixudx which formatting tool are you using? I would adapt my changes to this because my VS Code formatted with Black...

@casabre
Copy link
Contributor Author

casabre commented Mar 1, 2022

I observed a speed-up of approximately 43% for a runSavedQueryByID call with 82 entries. However, the parsing takes quite long time in comparison to the requests.get round-trip time.

  • Time with old implementation for the described call ~145 seconds
  • Time with multiprocessing for the described call ~83 seconds
  • The requests.get call takes ~1.5 seconds

Copy link
Owner

@dixudx dixudx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
Would you please format the code? Thanks

@dixudx
Copy link
Owner

dixudx commented Mar 2, 2022

which formatting tool are you using? I would adapt my changes to this because my VS Code formatted with Black...

@casabre Please run tox -e pycodestyle. Refer to the testing guide for details.

@dixudx
Copy link
Owner

dixudx commented Mar 2, 2022

I observed a speed-up of approximately 43% for a runSavedQueryByID call with 82 entries. However, the parsing takes quite long time in comparison to the requests.get round-trip time.

* Time with old implementation for the described call `~145 seconds`

* Time with multiprocessing for the described call `~83 seconds`

* The `requests.get` call takes `~1.5 seconds`

Really HUGE improvements. 👍🏻👍🏻👍🏻

I am wondering whether we could have a benchmark graph on our README.

@casabre
Copy link
Contributor Author

casabre commented Mar 2, 2022

Really HUGE improvements. 👍🏻👍🏻👍🏻

Thanks a lot but I dug deeper because I was wondering about the long overall processing time. I re-used the previous mentioned scenario. There are now 83 entries in the query.

Action Processing Time Comment
requests.get ~ 1.5 - 2 seconds Bias via VPN
xmltodict.parse ~ 0.1 seconds
process query dict ~ 85 - 90 seconds Process spawning takes ~ 5 seconds. Running it in a row uses the instance from garbage collection which speeds up

I am wondering where the processing time is lost because I used 8 cores which should at least speed up by factor 8 and not ~2... just breaking down the result of the old implementation into single times, I am ending up with ~2 seconds per work item. Considering the 8 processes with constant processing time of ~2 seconds, I would assume at least 20 seconds overall processing time ± some processing scheduling overhead.

I don't know any detail about the underlying implementation but is any work item queried separately in the mapping phase? → #149 could help a lot even without multiprocessing because we can parse during the await request.get phase.

I am wondering whether we could have a benchmark graph on our README.

Sure, we can do. I can prepare a Jupyter notebook. Do you have a reasonable setup? Currently, I am working from home via a VPN connection which biases the results 😉

@casabre
Copy link
Contributor Author

casabre commented Mar 2, 2022

Furthermore, I don't know if json parsing is faster compared to xml parsing. With ujson there is a least a fast json implementation available.
No significant improvements with json

@casabre
Copy link
Contributor Author

casabre commented Mar 2, 2022

I found also a spot in the lowest base.py layer which was looking for a MP change 😉. Now, the speed improvements are going from factor ~2 to ~6.

Change Overall time
Initial ~145 seconds
ProcessPool for _get_paged_resources ~80 - 90 seconds
ProcessPool for _get_paged_resources + ThreadPool for __initializeFromRaw ~24 seconds

I was wondering why looping over the OrderedDict items is taking that long. There is no complex data filtering/mapping applied.

@casabre casabre changed the title Adding multiprocessing to _get_paged_resources Adding multiprocessing Mar 3, 2022
Copy link
Owner

@dixudx dixudx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for such a HUGE improvement.

@casabre
Copy link
Contributor Author

casabre commented Apr 5, 2022

@dixudx when are you planning a new release? Would need quite soon because Git cloning won't become an option in future :).

@dixudx
Copy link
Owner

dixudx commented Apr 6, 2022

@casabre Sure. Already shipped with 0.8.0. You can install with pip now.

@dixudx dixudx mentioned this pull request Apr 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants