Gutenhash Challenge

Find the hashed-string needle in the haystack of text from Project Gutenberg

Datasets

Big Dataset - 12 GB Gzip archive containing most English TXT files from Project Gutenberg
- magnet:?xt=urn:btih:992f2d347a182e48a6aa9c5a2bfc39b665b2c469&dn=books.tar.gz
Small Dataset - 680MB ISO containing a smaller subset of eBooks from Project Gutenberg
- magnet:?xt=urn:btih:35476861704f8088828dd1a0918cc6fee714ac3b&dn=smolbooks.iso

Problem Statement

Find the substring in the set of books from Project Gutenberg that md5 hashes to the given hash.

Acquiring and Cleaning the Dataset

Download the Files from Project Gutenberg {#download_the_files_from_project_gutenberg}

We typically use the PG2003-08.ISO 700MB dataset for searching located at https://www.gutenberg.org/ebooks/11220

$ wget https://www.gutenberg.org/files/11220/PG2003-08.ISO

Flatten the Directory Structure

These next commands will pull all the text files out of their directories and place them in the root, and then delete the empty directories.

$ find /local/directory -type f -iname "*.txt" -exec mv '{}' /local/directory ';'

$ find /local/directory -type d -empty -delete

Compress the Files Into a Tarball

$ tar --remove-files -cvzf archive.tar.gz /local/directory/*.txt

or if you have pigz (Parallel Implementation of GZip) and pv installed:

$ tar --remove-files --use-compress-program="pigz --best --recursive | pv" -cf archive.tar.gz /local/directory/*.txt

How to Solve

Hash Generator

Choose a 19-character substring from one of the .txt files

$ function generateMD5() {
     offset=$[$RANDOM$RANDOM$RANDOM%$(cat $1|wc -c)];
     substr=$(dd if="$1" bs=1 skip=$offset count=$2 2>/dev/null)
     echo -ne "file: $1\noffset: $offset\nsubstring: '$substr'\nhash: "
     echo -n $substr|md5
   }
$ generateMD5 /home/pi/data/01/etext02.txt 19
file: /home/pi/data/01/etext02.txt
offset: 72
substring: 'blah blah blah blah'
hash: d41d8cd98f00b204e9800998ecf8427e

Trivial Solution

File: gutenhash_streamer.py

import md5
import sys

target = sys.argv[1]
strlen = int(sys.argv[2])
file_handle = sys.stdin

file_contents = file_handle.read()
last_possible_index = len(file_contents) - (strlen - 1)
for i in range(0, last_possible_index):
   substr = file_contents[i:i+strlen-1]
   md5sum = md5.md5(unicode(substr)).hexdigest()
   if md5sum == target:
     print("found!: '{}'".format( substr ))
     sys.exit(0)

Print some text file out using `cat` and bind cat's STDOUT pipe to gutenhash_streamer.py's STDIN pipe:

$ cat u/etext09/somefile.txt | python gutenhash_streamer.py c29c193a59d27ed00448b204579a0874 8
found!: '34567890'

Use the `find` command to print all text files to STDOUT and bind the pipe to gutenhash_streamer.py's STDIN pipe:

$ find u/ -iname "*.txt" -exec cat {} \; | python gutenhash_streamer.py c29c193a59d27ed00448b204579a0874 8
found!: '34567890'

This trivial streaming python solution: 1. imports the *md5* and *sys* libraries, we use *sys* for command line arguments, the file handle for *stdin*, and exiting the process. 2. parses the target hash (as a string) and substring length (as an int) 3. reads all of *stdin* into the *file_contents* variable in memory 4. loops over all possible starting indexes for substrings and uses *md5()* to get the hash, then tests it against the *target* hash, and exits if it matches. (we use *unicode()* here because *md5.md5()* requires unicode).

Your challenge is to make this performant and parallelizable!

During Competition

The substring length and file will change during competition.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
gutenberg-md5-search		gutenberg-md5-search
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gutenhash Challenge

Datasets

Problem Statement

Acquiring and Cleaning the Dataset

Download the Files from Project Gutenberg {#download_the_files_from_project_gutenberg}

Flatten the Directory Structure

Compress the Files Into a Tarball

How to Solve

Hash Generator

Trivial Solution

During Competition

About

Releases

Packages

Languages

License

ClusterFights/gutenhash-challenge

Folders and files

Latest commit

History

Repository files navigation

Gutenhash Challenge

Datasets

Problem Statement

Acquiring and Cleaning the Dataset

Download the Files from Project Gutenberg {#download_the_files_from_project_gutenberg}

Flatten the Directory Structure

Compress the Files Into a Tarball

How to Solve

Hash Generator

Trivial Solution

During Competition

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages