Whole CSV binary documents can be decoded with decode/1,2
.
decode/1
assumes default RFC4180-style
options, that is:
- Fields are separated by commas.
- Fields are optionally enclosed in double quotes.
- Double quotes in enclosed fields are quoted by another double quote.
decode/2
allows using custom options:
#{separator => Separator, % $, (default), $; or $\t
enclosure => Enclosure, % $" (default), $' or 'undefined'
quote => Quote} % $" (default), $', $\\ or 'undefined'
Restrictions for option combinations:
- If
Enclosure
isundefined
(ie, no enclosing),Quote
must also beundefined
. - If
Enclosure
is$"
,Quote
can be$"
or$\\
. - If
Enclosure
is$'
,Quote
can be$'
or$\\
.
Lines are separated by \r
, \n
or \r\n
. Empty lines are ignored by the decoder.
The result of decoding is a list of CSV lines, which are in turn lists of CSV fields, which are in turn binaries representing the field values.
Assume the following CSV data:
a,b,c
"d,d","e""e","f
f"
In an Erlang binary, this will look like:
1> CsvBinary = <<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>.
<<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>
Decoded with decode/1
, this will become:
2> hnc_csv:decode(CsvBinary).
[[<<"a">>,<<"b">>,<<"c">>],
[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]]
hnc_csv
provides the functions decode_fold/3,4
, decode_filter/2,3
,
decode_map/2,3
, decode_filtermap/2,3
and decode_foreach/2,3
which
allow decoding and processing decoded lines in one operation, much
like the lists
functions foldl/3
, filter/2
, map/2
, filtermap/2
and foreach/2
.
In fact, decode/1,2
is implemented via decode_fold/3,4
.
Those functions take a provider
as their first parameter. A provider
here means a 0-arity function which, when called, returns either a tuple
where the first element is a chunk of binary data and the second is
a new provider function for the next chunk of data, or the atom
end_of_data
to indicate that the provider has delivered all data.
hnc_csv
comes with two convenience functions, get_binary_provider/1,2
and get_file_provider/1,2
which return providers for binaries or
files, respectively.
The following is an implementation of a provider which delivers data taken from a given list of binaries:
-module(example_provider).
-export([get_list_provider/1]).
get_list_provider(L) ->
fun() -> list_provider(L) end.
list_provider([]) ->
end_of_data;
list_provider([Bin|More]) when is_binary(Bin) ->
{Bin, fun() -> list_provider(More) end}.
get_list_provider/1
creates the initial provider, which is a call tolist_provider/1
wrapped in a 0-arity function.list_provider/1
is the actual implementation of the provider, which returns eitherend_of_data
when the list given as argument is exhausted, or otherwise a tuple with the head element of the list as first and a call to itself with the tail of the list wrapped in a 0-arity function as second element.
This provider can then be used as follows, for example to count the lines and fields in the CSV data which the provider delivers:
1> Provider = example_provider:get_list_provider([<<"a,b">>, <<",c\r">>,
<<"\nd,">>, <<"e,f">>,
<<"\r\n">>]).
#Fun<example_provider.0.64990923>
2> hnc_csv:decode_fold(Provider,
fun(Line, {LCnt, FCnt}) -> {LCnt+1, FCnt+length(Line)} end,
{0, 0}).
{2,6}
For more complex scenarios than what the built-in functions provide
for, the functions decode_init/0,1,2
, decode_add_data/2
,
decode_next_line/1
and decode_flush/1
can be used together to
decode and process CSV documents.
decode_init/0,1,2
creates a decoder state to be used in the other functions listed above.decode_add_data/2
adds another chunk of unprocessed data to the state and returns an updated state.decode_next_line/1
decodes and returns the next line, together with an updated state. If the data in the state is exhausted, the atomend_of_data
is returned instead of a line.decode_flush/1
returns any as yet unfinished line in the given state, together with any yet unprocessed data. If there is no unfinished line in the state, the atomundefined
is returned instead of a line.
In fact, decode_fold/4
is implemented using those functions.
CSV documents can be encoded with encode/1,2
.
encode/1
assumes default RFC4180-style
options, that is:
- Fields are separated by commas
- Fields are optionally enclosed in double quotes
- Double quotes in enclosed fields are quoted by another double quote
- Lines are separated by
\r\n
encode/2
allows using custom options:
#{separator => Separator, % $, (default), $; or $\t
enclosure => Enclosure, % $" (default), $' or 'undefined'
quote => Quote, % $" (default), $', $\\ or 'undefined'
enclose => Enclose, % 'optionally' (default), 'never' or 'always'
end_of_line => EndOfLine} % `<<"\r\n">> (default), <<"\n">> or <<"\r">>
Restrictions for option combinations:
- If
Enclose
isnever
(ie, no enclosing), bothEnclosure
andQuote
must beundefined
. - If
Enclose
isoptionally
oralways
,Enclosure
andQuote
must not beundefined
. - If
Enclosure
is$"
,Quote
can be$"
or$\\
. - If
Enclosure
is$'
,Quote
can be$'
or$\\
.
The input for encoding is a list of CSV lines, which are in turn lists of CSV fields, which are in turn binaries representing the field values.
The result is a CSV binary document.
Assume the following CSV structure:
1> Csv = [[<<"a">>,<<"b">>,<<"c">>],[<<"d,d">>,<<"e\"e">>,<<"f\r\nf">>]].
Encoded with encode/1
, this will become:
2> hnc_csv:encode(Csv).
<<"a,b,c\r\n\"d,d\",\"e\"\"e\",\"f\r\nf\"\r\n">>
- Maria Scott (Maria-12648430)
- Jan Uhlig (juhlig)