ETL Utilities for Clojure. This library began with functions that worked with data on disk, such as database dumps and log files, at least that was the original purpose of the library, it has since grown to include other utilities.
NB: this library was created in 2010, since then many of the functions and utilities here are availabile in other libraries (eg: camel-snake-kebab).
IO and File utilities.
Returns a Reader or an InputStream, respectively, that will read from the given string.
Reads a fixed-length string.
Changes the permissions on a file by shelling out to the chmod
command.
Creates the given directory, just returning true if the given directory already exists (as opposed to throwing an exception).
Tests if a file exists.
Establishes a symlink for a file.
freeze invokes the java serialization and returns a byte array. Thaw does the opposite: takes a byte array and deserializes it.
Uses Java serialization to write an object to the given file, truncating if it exists.
Deserializes a serialized object from a file.
Ensures a directory path exists (recursively), doing nothing if it already exists.
Compress a string, returning the bytes.
This can be used in divide and conquer scenarios where you want to process different segments of a single file in parallel. It takes an input file name and a desired block size. Block boundaries will be close to the desired size - the size is used as a seek position, any line remnant present at that position is read, such that a given block will end cleanly at a line boundary.
Returns a lazy sequence of lines from a RandomAccessFile up to a given limit. If a line spans the limit, the entire line will be returned, so that a valid line is always returned.
Returns a sequence of lines from the file across the given starting and ending positions.
Semi-structured Text parsing library. The library uses a command set to extract portions of a text based document. Commands are instructions such as: start, end forward-char, backward-char, forward-past, rewind-to and so on. The landmarks can be either literal or logical (regular expressions, 'types', etc). This extractor can often succeed in situations where there is mixed media (javascript embedded in html), or inconsisten structure (the outputs of pdf to text). It should be able to operate on a document as long as there are identifiable sets of landmarks.
;; this example extracts the last line from the document (string)
(lp/extract
(lp/make-parser "This is some text. There is some in the middle.\nThere is some towards the end, but it is not this.\nA few sentences in all.")
[[:end nil]
[:rewind-to "\n"]]
[:end nil])
lang/make-periodic-invoker
can be used to easily create 'progress' indicators or bars
(let [total 1000
period 100
started-at-ms (.getTime (java.util.Date.))
progress (make-periodic-invoker
period
(fn [status counter]
(let [elapsed-ms (- (.getTime (java.util.Date.)) started-at-ms)
elapsed-secs (/ elapsed-ms 1000)
num-remaining (- total counter)
rate-per-sec (/ counter elapsed-secs)
eta-secs (/ num-remaining rate-per-sec)]
(cond
(= status :final)
(printf "All Done! %d processed in %d seconds at %3.2f/s\n" counter (long elapsed-secs) (double rate-per-sec))
:else
(printf "So far we've completed %d of %d, we are %3.2f%% complete at %3.2f per second, we should be done in %d seconds.\n"
counter
total
(double (* 100.0 (/ counter total 1.0)))
(double rate-per-sec)
(long eta-secs))))))]
(dotimes [ii total]
;; do some work / processing here
(Thread/sleep ^long (rand-nth [0 1 2 3 4 5]))
(progress))
(progress :final)
:done)
Produces the following output:
So far we've completed 100 of 1000, we are 10.00% complete at 273.22 per second, we should be done in 3 seconds.
So far we've completed 200 of 1000, we are 20.00% complete at 294.55 per second, we should be done in 2 seconds.
So far we've completed 300 of 1000, we are 30.00% complete at 300.30 per second, we should be done in 2 seconds.
So far we've completed 400 of 1000, we are 40.00% complete at 298.73 per second, we should be done in 2 seconds.
So far we've completed 500 of 1000, we are 50.00% complete at 305.06 per second, we should be done in 1 seconds.
So far we've completed 600 of 1000, we are 60.00% complete at 308.01 per second, we should be done in 1 seconds.
So far we've completed 700 of 1000, we are 70.00% complete at 306.35 per second, we should be done in 0 seconds.
So far we've completed 800 of 1000, we are 80.00% complete at 308.76 per second, we should be done in 0 seconds.
So far we've completed 900 of 1000, we are 90.00% complete at 303.85 per second, we should be done in 0 seconds.
So far we've completed 1000 of 1000, we are 100.00% complete at 303.67 per second, we should be done in 0 seconds.
All Done! 1000 processed in 3 seconds at 303.58/s
Static reference data and lookup tables
- USA State Names and Abbreviations
- USA Airport Codes
- USA Phone Number Area Codes
- ISO Country Codes
Ready made Regular expressions to match items like numbers, iso country codes, USA postal codes, USA state names, USA phone numbers, ipv4 adresses, etc.
Sequence utilities, eg: make-stream-sampler
, make-reservoir-sampler
, reservoir-sample-seq
.
Text manipulation helper functions, eg: md5->string
, string->md5
, sha1->string
, string->sha1
, substr
that supports negative indexes to read from the right hand side, human-readable-byte-count
, canonical-phone-number
, encode-base64
, decode-base64
Module for working with line-oriented data files in-situ on disk. These tools allow you to create (somewhat) arbitrary indexes into a file and efficiently find lines based on the indexed values.
Given the tab delimited file file.txt
:
99 line with larger key
1 is is the second line
2 this is a line
3 this is another line
99 duplicated line for key
We can create an index on the id
column id:
(index-file! "file.txt" ".file.txt.id-idx" #(first (.split % "\t" 2)))
That index can then be used to read groups of records from the file with the same key values:
(record-blocks-via-index "file.txt" ".file.txt.id-idx")
( [ "1\tis is the second line" ]
[ "2\tthis is a line" ]
[ "3\tthis is another line" ]
[ "99\tline with larger key"
"99\tduplicated line for key" ] )
clj-etl-utils
is available via Clojars:
https://clojars.org/com.github.kyleburton/clj-etl-utils
UTF and BOM
http://unicode.org/faq/utf\_bom.html
How to pick a random sample from a list
US Census Tigerline Data: Zip Codes
This code is covered under the same as Clojure.
Kyle Burton [email protected]
Paul Santa Clara [email protected]
Tim Visher [email protected]