Skip to content
This repository has been archived by the owner on Dec 1, 2023. It is now read-only.

Performance notes are wrong #6

Open
slonka opened this issue Jun 19, 2015 · 3 comments
Open

Performance notes are wrong #6

slonka opened this issue Jun 19, 2015 · 3 comments

Comments

@slonka
Copy link

slonka commented Jun 19, 2015

Hello,

if it took 30 minutes to process 9.1GB file, it means that the throughput was 5,06 MB/s.
(9.1G = 1024 * 9.1 MB = 9100 MB, 9100 / (30 * 60s) = 5,055555556 MB/s
5400 disks have 40 MB/s read / write throughput, so they are not the bottleneck. To speed things up you can use lbzip2 which is multi-threaded (it helped me a lot).

Best regards

@mirkonasato
Copy link
Owner

Well, when creating a db neo4j is not simply writing data sequentially to the disk so I wouldn't expect it to reach the max throughput. In my tests the disk made a huge difference so I called it the "critical factor" (not "bottleneck"). But thanks for suggesting lbzip2, will add it to the README.

@slonka
Copy link
Author

slonka commented Jun 20, 2015

I only described what I thought was wrong with the description of first step of the importing process, which is read -> regexp -> write (creating intermediate XML file). The second part is still running (7 hours, and it only imported 70M links). I have no idea how you managed to do it in only 10 minutes.

I've run jvisualvm, iotop, htop and discovered that at the beginning the process is mostly running read / write operations (org.neo4j.io.fs.StoreFIleChannel.write / read). It creates 50K links per 3 seconds and at that pace the whole thing would take 1 hour and 40 minutes. After a while it starts to run more MuninnPageCache operations (flushAtIORatio, parkUntilEvictionRequired) and slows down significantly.

In the first part of the operation the CPU usage was maxed out (95-100% on 4 cores) and the read/write throughput was 10 MB/s and 5 MB/s respectively. Now in the second part the CPU usage is really low (around 10%) write throughput is around 10 MB/s.

iostat shows that cpu is mostly waiting on IO or idle

avg-cpu: %user %nice %system %iowait %steal %idle
5,10 0,00 1,58 43,36 0,00 49,96

I think there is something wrong with caching mechanism in neo4j. What do you think?

@mirkonasato
Copy link
Owner

Fair point, I didn't realise you were talking about the first step only.

Yes, the second part i.e. creating the graph db is where having an SSD really helps. I haven't really investigated much, but I guess it must be doing a lot of random access operations.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants