Parallelize update script #115

dflupu · 2022-04-30T19:44:55Z

It does so by adding the workerpool node module, moving a bunch of code into a separate worker file and running that code in the worker pool. It also adds a --max-worker-processes flag to configure the number of workers.

Edit: Missed a pool.terminate() call initially. Had to do a bit of refactoring in order to be able to throw that in. As it was, there was no way to know when all of the async branches were done. This last commit makes the diff quite a bit more unreadable but that was rather unavoidable. The script is a bit dated and uses a combination of callbacks and async/await.

r0qs

Hi, thank you for your contribution. I only have some comments :)

r0qs · 2022-11-03T18:08:31Z

update

+const readFile = util.promisify(fs.readFile)
+const readdir = util.promisify(fs.readdir)


Why not use fs.Promises instead?

E.g.

const { readdir, readFile } = require('node:fs/promises') // ... const files = await readdir(path)

r0qs · 2022-11-03T18:18:10Z

update

-            )
+const pool = workerpool.pool('./update-worker.js', {
+  maxWorkers: options.maxWorkerProcesses,
+  // Threads would be ideal, but do not work as expected due to an issue with the swarmhash module


What issue? Could you provide more information about the problem using the thread workerType?

r0qs · 2022-11-03T18:40:17Z

update

+  pool.terminate()
+}
+
+main();


Remove the extra semicolon. Also, please remember to run npm run lint -- --fix

Suggested change

main();

main()

r0qs · 2022-11-03T20:21:38Z

update

+  const latestRelease = parsedList
+    .slice()
+    .reverse()
+    .filter(function (listEntry) { return listEntry.prerelease === null })


It seems that you changed the behavior of the processDir function. Some entries that do not have a prerelease now contains the prerelease field set to null in the final list.json instead of not having the field. The prerelease field must not be set in the parsedList when there is not prerelease.

For example, this one:

{ "path": "soljson-v0.1.1+commit.6ff4cd6.js", "version": "0.1.1", "prerelease": null, "build": "commit.6ff4cd6", "longVersion": "0.1.1+commit.6ff4cd6", "keccak256": "0xd8b8c64f4e9de41e6604e6ac30274eff5b80f831f8534f0ad85ec0aff466bb25", "sha256": "0xfc54135edfe07d835216f87fa6c3b9049e1415c2825027b926daad018e09269f", "urls": [ "bzzr://8f3c028825a1b72645f46920b67dca9432a87fc37a8940a2b2ce1dd6ddc2e29b", "dweb:/ipfs/QmPPGxsMtQSEUt9hn9VAtrtBTa1u9S5KF1myw78epNNFkx" ] },

should be without the prerelease field:

{ "path": "soljson-v0.1.1+commit.6ff4cd6.js", "version": "0.1.1", "build": "commit.6ff4cd6", "longVersion": "0.1.1+commit.6ff4cd6", "keccak256": "0xd8b8c64f4e9de41e6604e6ac30274eff5b80f831f8534f0ad85ec0aff466bb25", "sha256": "0xfc54135edfe07d835216f87fa6c3b9049e1415c2825027b926daad018e09269f", "urls": [ "bzzr://8f3c028825a1b72645f46920b67dca9432a87fc37a8940a2b2ce1dd6ddc2e29b", "dweb:/ipfs/QmPPGxsMtQSEUt9hn9VAtrtBTa1u9S5KF1myw78epNNFkx" ] },

r0qs · 2022-11-03T21:31:36Z

update

+  const batchPromises = []
+
+  for (let i = 0; i < values.length; i += batchSize) {
+    batchPromises.push(pool.exec('workerMain', [dir, values.slice(i, i + batchSize), oldList]))


Did you run any profiling to see how your changes improved the performance of the current code? The execution time of your implementation does not seem to have any significant improvement in my machine:
Current implementation: 702.21s user 16.80s system 106% cpu 11:15.54 total
PR implementation: 670.77s user 17.92s system 104% cpu 11:01.69 total

Maybe we could also move some of the I/O operations to the workers instead of only moving the hash operations? Like, we could have a worker per binary directory and then multiple child-workers per-directory (each one reading/processing a subset of files in each directory). The main worker thread in each directory would then aggregate the "child/file workers" results and write the respective list.[json,js,txt] files or something like that. What do you think?

I ran a quick profiling here, which seems to hint that your code spends most of the time waiting for file descriptors to become ready. Not sure if this is actually the case though:

PR code:

[C++]: ticks total nonlib name 1987 71.3% 93.9% epoll_pwait@@GLIBC_2.6 21 0.8% 1.0% __write@@GLIBC_2.2.5 14 0.5% 0.7% fwrite@@GLIBC_2.2.5 12 0.4% 0.6% recvmsg@@GLIBC_2.2.5 11 0.4% 0.5% std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long) 9 0.3% 0.4% std::ostream::sentry::sentry(std::ostream&) 8 0.3% 0.4% __libc_malloc@@GLIBC_2.2.5 [Summary]: ticks total nonlib name 37 1.3% 1.7% JavaScript 2078 74.5% 98.3% C++ 72 2.6% 3.4% GC 673 24.1% Shared libraries

Current code:

[C++]: ticks total nonlib name 18358 2.9% 4.9% epoll_pwait@@GLIBC_2.6 901 0.1% 0.2% __dynamic_cast 793 0.1% 0.2% __cxxabiv1::__vmi_class_type_info::__do_dyncast(long, __cxxabiv1::__class_type_info::__sub_kind, __cxxabiv1::__class_type_info const*, void const*, __cxxabiv1::__class_type_info const*, void const*, __cxxabiv1::__class_type_info::__dyncast_result&) const 711 0.1% 0.2% __libc_malloc@@GLIBC_2.2.5 [Summary]: ticks total nonlib name 1923 0.3% 0.5% JavaScript 24553 3.9% 6.5% C++ 12993 2.1% 3.5% GC 256700 40.5% Shared libraries 349903 55.3% Unaccounted

I think it would be beneficial to have some benchmark showing the performance improvement of your implementation.

dflupu added 3 commits April 30, 2022 22:36

parallelize update script

21c57f4

fixup! parallelize update script

deccf03

add missing pool.terminate call

3701bf7

axic requested a review from cameel June 1, 2022 11:28

cameel requested review from matheusaaguiar and r0qs November 3, 2022 14:19

r0qs requested changes Nov 3, 2022

View reviewed changes

cameel added the external contribution ⭐ label Nov 22, 2022

r0qs mentioned this pull request Dec 1, 2022

Insert reportedVersion in list.json #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize update script #115

Parallelize update script #115

dflupu commented Apr 30, 2022 •

edited

Loading

r0qs left a comment

r0qs Nov 3, 2022

r0qs Nov 3, 2022 •

edited

Loading

r0qs Nov 3, 2022

r0qs Nov 3, 2022 •

edited

Loading

r0qs Nov 3, 2022 •

edited

Loading

		const readFile = util.promisify(fs.readFile)
		const readdir = util.promisify(fs.readdir)

Parallelize update script #115

Are you sure you want to change the base?

Parallelize update script #115

Conversation

dflupu commented Apr 30, 2022 • edited Loading

r0qs left a comment

Choose a reason for hiding this comment

r0qs Nov 3, 2022

Choose a reason for hiding this comment

r0qs Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

r0qs Nov 3, 2022

Choose a reason for hiding this comment

r0qs Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

r0qs Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

dflupu commented Apr 30, 2022 •

edited

Loading

r0qs Nov 3, 2022 •

edited

Loading

r0qs Nov 3, 2022 •

edited

Loading

r0qs Nov 3, 2022 •

edited

Loading