ETL Utilities for Clojure. This library began with functions that worked with data on disk, such as database dumps and log files, at least that was the original purpose of the library, it has since grown to include other utilities.
IO and File utilities.
Returns a Reader or an InputStream, respectively, that will read from the given string.
Reads a fixed-length string.
Changes the permissions on a file by shelling out to the chmod
command.
Creates the given directory, just returning true if the given directory already exists (as opposed to throwing an exception).
Tests if a file exists.
Establishes a symlink for a file.
freeze invokes the java serialization and returns a byte array. Thaw does the opposite: takes a byte array and deserializes it.
Uses Java serialization to write an object to the given file, truncating if it exists.
Deserializes a serialized object from a file.
Ensures a directory path exists (recursively), doing nothing if it already exists.
Compress a string, returning the bytes.
This can be used in divide and conquer scenarios where you want to process different segments of a single file in parallel. It takes an input file name and a desired block size. Block boundaries will be close to the desired size – the size is used as a seek position, any line remnant present at that position is read, such that a given block will end cleanly at a line boundary.
Returns a lazy sequence of lines from a RandomAccessFile up to a given limit. If a line spans the limit, the entire line will be returned, so that a valid line is always returned.
Returns a sequence of lines from the file across the given starting and ending positions.
lang/make-periodic-invoker
can be used to easily create ‘progress’ indicators or bars
(let [total 1000 period 100 progress (lang/make-periodic-invoker period (fn [val & [is-done]] (if (= is-done :done) (printf "All Done! %d\n" val) (printf "So far we did %d, we are %3.2f%% complete.\n" val (* 100.0 (/ val 1.0 total))))))] (dotimes [ii total] ;; do some work / processing here (progress)) (progress :final :done))
Produces the following output:
So far we did 100, we are 10.00% complete. So far we did 200, we are 20.00% complete. So far we did 300, we are 30.00% complete. So far we did 400, we are 40.00% complete. So far we did 500, we are 50.00% complete. So far we did 600, we are 60.00% complete. So far we did 700, we are 70.00% complete. So far we did 800, we are 80.00% complete. So far we did 900, we are 90.00% complete. So far we did 1000, we are 100.00% complete. All Done! 1000
Module for working with line-oriented data files in-situ on disk. These tools allow you to create (somewhat) arbitrary indexes into a file and walk through the indexed values.
Given the tab delimited file file.txt
:
99 line with larger key 1 is is the second line 2 this is a line 3 this is another line 99 duplicated line for key
We can create an index on the id
column id:
(index-file! "file.txt" ".file.txt.id-idx" #(first (.split % "\t")))
That index can then be used to read groups of records from the file with
the same key values:
(record-blocks-via-index "file.txt" ".file.txt.id-idx")
( [ "1\tis is the second line" ] [ "2\tthis is a line" ] [ "3\tthis is another line" ] [ "99\tline with larger key" "99\tduplicated line for key" ] )
clj-etl-utils
is available via Clojars:
UTF and BOM
http://unicode.org/faq/utf_bom.html How to pick a random sample from a listUS Census Tigerline Data: Zip Codes
This code is covered under the same as Clojure.
Kyle Burton <[email protected]>
Paul Santa Clara <[email protected]>
Tim Visher <[email protected]>