Performance notes are wrong #6

slonka · 2015-06-19T14:36:16Z

Hello,

if it took 30 minutes to process 9.1GB file, it means that the throughput was 5,06 MB/s.
(9.1G = 1024 * 9.1 MB = 9100 MB, 9100 / (30 * 60s) = 5,055555556 MB/s
5400 disks have 40 MB/s read / write throughput, so they are not the bottleneck. To speed things up you can use lbzip2 which is multi-threaded (it helped me a lot).

Best regards

mirkonasato · 2015-06-19T22:21:15Z

Well, when creating a db neo4j is not simply writing data sequentially to the disk so I wouldn't expect it to reach the max throughput. In my tests the disk made a huge difference so I called it the "critical factor" (not "bottleneck"). But thanks for suggesting lbzip2, will add it to the README.

slonka · 2015-06-20T08:06:51Z

I only described what I thought was wrong with the description of first step of the importing process, which is read -> regexp -> write (creating intermediate XML file). The second part is still running (7 hours, and it only imported 70M links). I have no idea how you managed to do it in only 10 minutes.

I've run jvisualvm, iotop, htop and discovered that at the beginning the process is mostly running read / write operations (org.neo4j.io.fs.StoreFIleChannel.write / read). It creates 50K links per 3 seconds and at that pace the whole thing would take 1 hour and 40 minutes. After a while it starts to run more MuninnPageCache operations (flushAtIORatio, parkUntilEvictionRequired) and slows down significantly.

In the first part of the operation the CPU usage was maxed out (95-100% on 4 cores) and the read/write throughput was 10 MB/s and 5 MB/s respectively. Now in the second part the CPU usage is really low (around 10%) write throughput is around 10 MB/s.

iostat shows that cpu is mostly waiting on IO or idle

avg-cpu: %user %nice %system %iowait %steal %idle
5,10 0,00 1,58 43,36 0,00 49,96

I think there is something wrong with caching mechanism in neo4j. What do you think?

mirkonasato · 2015-06-21T22:06:58Z

Fair point, I didn't realise you were talking about the first step only.

Yes, the second part i.e. creating the graph db is where having an SSD really helps. I haven't really investigated much, but I guess it must be doing a lot of random access operations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance notes are wrong #6

Performance notes are wrong #6

slonka commented Jun 19, 2015

mirkonasato commented Jun 19, 2015

slonka commented Jun 20, 2015

mirkonasato commented Jun 21, 2015

Performance notes are wrong #6

Performance notes are wrong #6

Comments

slonka commented Jun 19, 2015

mirkonasato commented Jun 19, 2015

slonka commented Jun 20, 2015

mirkonasato commented Jun 21, 2015