Adding multiprocessing #148

casabre · 2022-03-01T12:44:10Z

First implementation of multiprocessing for review

casabre · 2022-03-01T12:45:33Z

@dixudx which formatting tool are you using? I would adapt my changes to this because my VS Code formatted with Black...

casabre · 2022-03-01T13:33:46Z

I observed a speed-up of approximately 43% for a runSavedQueryByID call with 82 entries. However, the parsing takes quite long time in comparison to the requests.get round-trip time.

Time with old implementation for the described call ~145 seconds
Time with multiprocessing for the described call ~83 seconds
The requests.get call takes ~1.5 seconds

dixudx

Looks good to me.
Would you please format the code? Thanks

dixudx · 2022-03-02T01:51:42Z

which formatting tool are you using? I would adapt my changes to this because my VS Code formatted with Black...

@casabre Please run tox -e pycodestyle. Refer to the testing guide for details.

dixudx · 2022-03-02T01:55:40Z

I observed a speed-up of approximately 43% for a runSavedQueryByID call with 82 entries. However, the parsing takes quite long time in comparison to the requests.get round-trip time.
* Time with old implementation for the described call `~145 seconds`

* Time with multiprocessing for the described call `~83 seconds`

* The `requests.get` call takes `~1.5 seconds`

Really HUGE improvements. 👍🏻👍🏻👍🏻

I am wondering whether we could have a benchmark graph on our README.

casabre · 2022-03-02T06:03:31Z

Really HUGE improvements. 👍🏻👍🏻👍🏻

Thanks a lot but I dug deeper because I was wondering about the long overall processing time. I re-used the previous mentioned scenario. There are now 83 entries in the query.

Action	Processing Time	Comment
`requests.get`	~ 1.5 - 2 seconds	Bias via VPN
`xmltodict.parse`	~ 0.1 seconds
process query dict	~ 85 - 90 seconds	Process spawning takes ~ 5 seconds. Running it in a row uses the instance from garbage collection which speeds up

I am wondering where the processing time is lost because I used 8 cores which should at least speed up by factor 8 and not ~2... just breaking down the result of the old implementation into single times, I am ending up with ~2 seconds per work item. Considering the 8 processes with constant processing time of ~2 seconds, I would assume at least 20 seconds overall processing time ± some processing scheduling overhead.

I don't know any detail about the underlying implementation but is any work item queried separately in the mapping phase? → #149 could help a lot even without multiprocessing because we can parse during the await request.get phase.

I am wondering whether we could have a benchmark graph on our README.

Sure, we can do. I can prepare a Jupyter notebook. Do you have a reasonable setup? Currently, I am working from home via a VPN connection which biases the results 😉

casabre · 2022-03-02T07:15:59Z

~~Furthermore, I don't know if json parsing is faster compared to xml parsing. With ujson there is a least a fast json implementation available.~~
No significant improvements with json

casabre · 2022-03-02T08:26:18Z

I found also a spot in the lowest base.py layer which was looking for a MP change 😉. Now, the speed improvements are going from factor ~2 to ~6.

Change	Overall time
Initial	`~145 seconds`
`ProcessPool` for `_get_paged_resources`	`~80 - 90 seconds`
`ProcessPool` for `_get_paged_resources` + `ThreadPool` for `__initializeFromRaw`	`~24 seconds`

I was wondering why looping over the OrderedDict items is taking that long. There is no complex data filtering/mapping applied.

dixudx

Looks good to me. Thanks for such a HUGE improvement.

casabre · 2022-04-05T13:05:14Z

@dixudx when are you planning a new release? Would need quite soon because Git cloning won't become an option in future :).

dixudx · 2022-04-06T13:06:40Z

@casabre Sure. Already shipped with 0.8.0. You can install with pip now.

Adding multiprocessing to _get_paged_resources

f90c953

dixudx reviewed Mar 2, 2022

View reviewed changes

casabre added 2 commits March 2, 2022 07:38

Updating .gitignore

d99a638

Formatting with tox

f586256

Further speed improvements

1203d7d

casabre changed the title ~~Adding multiprocessing to _get_paged_resources~~ Adding multiprocessing Mar 3, 2022

dixudx approved these changes Mar 4, 2022

View reviewed changes

dixudx merged commit 68ed836 into dixudx:master Mar 4, 2022

dixudx mentioned this pull request Mar 9, 2022

workItem.getChildren() taking forever and printing 'unable to handle' error messages #150

Closed

dixudx mentioned this pull request Apr 9, 2022

RTC Query taking forever #151

Closed

casabre mentioned this pull request Apr 21, 2022

Improving multiprocessing #156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding multiprocessing #148

Adding multiprocessing #148

casabre commented Mar 1, 2022

casabre commented Mar 1, 2022

casabre commented Mar 1, 2022 •

edited

Loading

dixudx left a comment

dixudx commented Mar 2, 2022

dixudx commented Mar 2, 2022

casabre commented Mar 2, 2022 •

edited

Loading

casabre commented Mar 2, 2022 •

edited

Loading

casabre commented Mar 2, 2022 •

edited

Loading

dixudx left a comment

casabre commented Apr 5, 2022

dixudx commented Apr 6, 2022

Adding multiprocessing #148

Adding multiprocessing #148

Conversation

casabre commented Mar 1, 2022

casabre commented Mar 1, 2022

casabre commented Mar 1, 2022 • edited Loading

dixudx left a comment

Choose a reason for hiding this comment

dixudx commented Mar 2, 2022

dixudx commented Mar 2, 2022

casabre commented Mar 2, 2022 • edited Loading

casabre commented Mar 2, 2022 • edited Loading

casabre commented Mar 2, 2022 • edited Loading

dixudx left a comment

Choose a reason for hiding this comment

casabre commented Apr 5, 2022

dixudx commented Apr 6, 2022

casabre commented Mar 1, 2022 •

edited

Loading

casabre commented Mar 2, 2022 •

edited

Loading

casabre commented Mar 2, 2022 •

edited

Loading

casabre commented Mar 2, 2022 •

edited

Loading