Package bigcscvreader
offers a multi-threaded approach for reading a large CSV file in order to improve the time of reading and processing it.
It spawns multiple goroutines, each reading a piece of the file.
Read rows are put into channels equal in number to the spawned goroutines, in this way also the processing of those rows can be parallelized.
$ go get -u github.com/actforgood/bigcsvreader
Please refer to this example.
go test -race -timeout=15m -benchmem -benchtime=2x -bench .
goos: darwin
goarch: amd64
pkg: github.com/actforgood/bigcsvreader
cpu: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Benchmark50000Rows_50Mb_withBigCsvReader-8 2 8101972370 ns/op 61740600 B/op 100267 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadAll-8 2 67070393110 ns/op 68507768 B/op 100043 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneAndReuseRecord-8 2 69045793069 ns/op 57606112 B/op 50018 allocs/op
Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneProcessParalell-8 2 8286623971 ns/op 61607272 B/op 100037 allocs/op
Benchmarks are made with a file of ~50Mb
in size, also a fake processing of any given row of 1ms
was taken into consideration.
bigcsvreader was launched with 8
goroutines.
Other benchmarks are made using directly the encoding/csv
go package.
As you can see, bigcsvreader reads and processes all rows in ~8s
.
Go standard csv package reads and processes all rows in ~67s
(sequentially).
Go standard csv package read and a parallel processing of rows timing is comparable to the one of bigcsvreader (so this strategy is a good alternative to this package).
ReadAll
API has the disadvantage of keeping all rows into memory.
Read
rows one by one API with ReuseRecord
flag set has the advantage of fewer allocations, but has the cost of sequentially reading rows.
Note: It's a coincidence that parallelized version timing was ~equal to sequential timing divided by no of started goroutines. You should not take this as a rule.
Bellow are some process stats captured with unix TOP
command while running each benchmark.
Bench | %CPU | MEM |
---|---|---|
Benchmark50000Rows_50Mb_withBigCsvReader | 21.6 | 8156K |
Benchmark50000Rows_50Mb_withGoCsvReaderReadAll | 5.3 | 67M |
Benchmark50000Rows_50Mb_withGoCsvReaderReadOneByOneAndReuseRecord | 10.1 | 5704K |
(!) Known issue: This package does not work as expected with multiline columns.
This package is released under a MIT license. See LICENSE.