Skip to content

Latest commit

 

History

History
130 lines (100 loc) · 4.77 KB

README.md

File metadata and controls

130 lines (100 loc) · 4.77 KB

clj-rabin

A Clojure implementation of a rolling Rabin hash + chunker.

Hacking

Bring up a REPL:

lein with-profile +dev 

Use the CDC notebook to try chunking a dataset:

(comment
  (def chunks
    (atom nil))

  (load-dataset! "data/natural_images" chunks)

  (let [rows->long (fn [r] (into {} (map (fn [[k v]] [k (long v)])) r))
        stats-all (chunk-ds->agg-stats @chunks)
        stats-cdc (-> @chunks (ds/unique-by-column :sha256) chunk-ds->agg-stats)
        stats-per-file (->> @chunks
                            (rd/group-by-column-agg :file {:block-count (rd/count-distinct :sha256)})
                            (rd/aggregate {:avg-blocks-per-file (rd/mean :block-count)}))

        ; maps containing aggregate vals
        all (first (map rows->long (ds/rows stats-all)))
        cdc (first (map rows->long (ds/rows stats-cdc)))
        blocks (first (map rows->long (ds/rows stats-per-file)))
        reduced-bytes (- (:total-bytes all)
                         (:total-bytes cdc))]
    {:all all
     :cdc cdc
     :blocks blocks
     :diff {:reduced-bytes reduced-bytes
            :reduced-percent (->> (:total-bytes all)
                                  (/ reduced-bytes)
                                  (* 100)
                                  double)}}))

@chunks is a tech.ml.dataset with all the chunks from the data loaded. Each chunk also gets a SHA-256 hash to ensure that the block is actually unique:

{:all {:total-bytes 359403192, :total-blocks 36628, :avg-block-size-bytes 9812},
 :cdc {:total-bytes 179670613, :total-blocks 18311, :avg-block-size-bytes 9812},
 :blocks {:avg-blocks-per-file 2},
 :diff {:reduced-bytes 179732579, :reduced-percent 50.00862068025261}}
Datasets tested

Rabin parameter overrides are shown, otherwise assume defaults from clj-rabin.hash/default-ctx.

Audio

Overall: terrible ratios

Million song dataset

Audio codec info:

Input #0, mp3, from 'MP3-Example/Blues/Blues-TRADWSG128F4259317.mp3':
  Duration: 00:00:30.04, start: 0.025057, bitrate: 96 kb/s
  Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 96 kb/s
    Metadata:
      encoder         : LAME3.99r
    Side data:
      replaygain: track gain - -5.900000, track peak - unknown, album gain - unknown, album peak - unknown, 
{:all {:total-bytes 544824272, :total-blocks 977662, :avg-block-size-bytes 557},
 :cdc {:total-bytes 543861899, :total-blocks 43428, :avg-block-size-bytes 12523},
 :blocks {:avg-blocks-per-file 651},
 :diff {:reduced-bytes 962373, :reduced-percent 0.1766391567811795}}

Indian Music Raga

Audio codec info:

Input #0, wav, from 'raga/asavari02.wav':
  Duration: 00:03:46.82, bitrate: 705 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, 1 channels, s16, 705 kb/s
{:all {:total-bytes 1105687312, :total-blocks 73031, :avg-block-size-bytes 15139},
 :cdc {:total-bytes 1105687295, :total-blocks 73014, :avg-block-size-bytes 15143},
 :blocks {:avg-blocks-per-file 890},
 :diff {:reduced-bytes 17, :reduced-percent 1.537505207439696E-6}}
Images

Overall: awesome ratios on images

Natural images (jpeg)

{:all {:total-bytes 359403192, :total-blocks 36628, :avg-block-size-bytes 9812},
 :cdc {:total-bytes 179670613, :total-blocks 18311, :avg-block-size-bytes 9812},
 :blocks {:avg-blocks-per-file 2},
 :diff {:reduced-bytes 179732579, :reduced-percent 50.00862068025261}}

Ripe and unripe tomatoes (jpeg)

{:all {:total-bytes 122864035, :total-blocks 151613, :avg-block-size-bytes 810},
 :cdc {:total-bytes 52075246, :total-blocks 29899, :avg-block-size-bytes 1741},
 :blocks {:avg-blocks-per-file 428},
 :diff {:reduced-bytes 70788789, :reduced-percent 57.6155495788495}}

Resources / Credits