Implementation of "Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration"[1]
Independent study project by Alexia Lou, under supervision of Professor Rachel Pottinger at University of British Columbia
- start the program:
to run from runnable .jar file: double click explore-by-example.jar
to run from source .jar file: extract files from explore-by-example-src.jar, import the project into Eclipse, run - dataset
choose where to import the dataset (i.e. local file or MySQL database server)
- for the complete PhotoObjectAll dataset, please send a request using this page
- databaset I used is attached seperately
- for dataset with size less than 5G,
- please login this page (see login.txt)
- select "DR9" under Contest
- click [submit] to send query
- retrieved dataset will be saved in "MyDB"
- to download the dataset: select "MyDB", then the table with to download, in the table detail on the right, click [download], then go to "Output"
- wait for data to be imported intp the program, after which a window will display with samples retrieved from the data base
- select relevant samples, click [confirm] [^1]
- after the program finish retrieving next set of samples, another sample selection window will display, alone with another window displaying variable information about the model trained using current labeled samples
- press [exit] at any time to exit the program, labeled samples and trained model will be saved autometically in the directory where the program is saved.
####Acceptable Input Format:
- local file format: Arff, C4.5, CSV, libsvm, svm light, Binary serialized instances, XRFF
- remote database server: MySQL
####Parameter Input:
- cluster base: number of clusters to divide the dataset into at the first iteration
- cluster growth factor: factor by which the number of cluster grow at each iteration
- FN penalty: number of extra samples retrieved around each false negative data object. (used when total number of relevant objects discovered during discovery phrase <= number of false negative predictions made at current iteration)
- FN distance: minimum distance from false negative cluster center to retrieve samples from. (used when total number of relevant objects discovered during discovery phrase > number of false negative predictions made at current iteration)
- max boundaries: maximum number of samples retrieved around each boundary
- at each iteration:
- all samples labeled so far
- prediction stats of current classifier
- decision tree built based on current training data
- when exit the program (stored in the same directory as the program files):
- trained classifier (output.model)
- training set labeled by user (train.csv)
The implementation is based on the optimized version of each space exploration phrase
"Error, not in CLASSPATH?":
please refer to this web page for detail information -
Known cause(s) of exception:
- if the original dataset has attribute named "class" or "cluster"
- may throw exception for dataset containing non-numeric data
[1]. Dimitriadou, Kyriaki, Olga Papaemmanouil, and Yanlei Diao. "Explore-by-Example: An Automatic Query Steering Framework for Interactive Data Exploration." (2014).
[^1] query used to select Relevant_Samples.csv from PhotoObjAllTop3000.csv:
SELECT DISTINCT * from PhotoObjAllTop3000
(rowc between 1165.711 and 1199.086 and not (colc between 344.8247 and 399.4847) and not (ra between 336.566699604645 and 336.572752552481))
or (colc between 344.8247 and 399.4847 and not (rowc between 1165.711 and 1199.086) and not (ra between 336.566699604645 and 336.572752552481))
or (ra between 336.566699604645 and 336.572752552481 and not(rowc between 1165.711 and 1199.086) and not (colc between 344.8247 and 399.4847)))