phd-thesis-iii

Done

Implement mRMR to reduce dataset - done (JMIM, JMI, mRMR usinc scikit)

Implement Ensemble - done (using scikit and both soft ensemble using accuracy based weightage and hard ensemble with 1 vote each)

Not Done

Implement Treelet Clustering on reduced dataset

Use reliability based error measure for training and validation

Partially Done

Implement SVM | LDA | ANT | ANT Miner | BAT | BAT Miner on the reduced dataset

PSO not yet applied for Feature Selection, classifiers used so far are SVM | ExtraTree | RF | MLP | AdaBoost and Decision Tree

Produce Results and Fine Tune

Fine Tuning still in progress

Progress Tracking

Oct 2016

Planned out the strategy to implement the agenda/strategy outlined above.
Tried to implement mrmr from https://github.com/nlhepler/mrmr but for some reason it seemed to require lots of updates, lots of data type conversions and still the end results it produced for me were not reliable as it seemed to generate sequential list of consecutive vectors from the left of the matrix.

Nov 2016

Finally found https://github.com/danielhomola/mifs as an embarrasingly parallel implementation of mrmr and it worked for me.
Now implementing Treelet & ANT with the produced subset

Dec 2016

Started using Python's Scikit to start producing results on CPUs (using multiple cores via python). If needed we will use GPU again.
Implemented Email notifications to get notified when long running tasks get completed.
Explored and used mRMR, JMI and JMIM based Feature Selection
Explored and shortlisted SVM, ExtraTree, RandomForest, MLP, AdaBoost and Decision Trees
Started listing down the results in google sheet to be able to identify trends or patterns
Keeping trained classifiers and datasets as pickles so we could reload them when needed and move ahead with Ensemble creation

Jan 2017

replanned the following pathway

Applying FS multiple times to reduce from several thousand features to 250 features and then from 250 to 200, 150, 100, 50 and 10 best features to see what works best for us
Exploring to see if GA can be used to tune parameters for the classifiers - not exactly sure how to use it though

Replan to do the following tasks along the way

Retrain the models for other datasets - so far we have done so only on 1 dataset (DataSet A)
Decide on Ensemble Scheme i.e. simple voting or weighted avg or even SVN/ANN on individual classifiers
Calculate overall Ensemble performance
Use LOOCV instead of 10 CV
Try out different forms of error functions, including Mattia's reliability parameter
Enlist all parameter values for each classifier for each settings and decide on which parameters to tune in what range.
See if treelet/meta genes on top of mrmr can improve further
see if bagging/boosting can further help

Mar 2017

Ran into issues when applying Feature Selection due to missing values in the columns. Wrote program to find out the reason for crash and realized this. Suggested approaches for now are...

remove such features altogether
impute values using avg of other vectors so it doesnt impact the selection based on this imputed value
impute values using a novel technique, find most correlated vector considering all available features and copy over the value from the most similar vector

Also figuring out the way to find best parameter values for the classifiers

Jun 2017

Started off with Dataset B and also restarted with Dataset A to include both normalized and unnormalized data
Figuring out which of LOOCV and K-fold CV is best suited for us. As per Mattia's thesis, LOOCV is the best but turns out it results in worse accuracy in our case.
Dumping all results in unformatted text files, this will be a challenge later to structure the data
The individual results are not very encouraging in terms of accuracy. Though the time it takes is pretty affordable now.
Also started creating Ensembles to see if that helps in getting better results

Jul 2017

Even Ensembles are not proving to be very helpful with better accuracy. Seems like we need to revisit our strategy.
Grouped the generated results into folders
Writing parser to parse and extact results
Dumping the data into a database

Aug 2017

Repopulated DB from the results files
Preparing the results for meeting with Advisor

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
10-JMI.joblib.pkl		10-JMI.joblib.pkl
100-JMI.joblib.pkl		100-JMI.joblib.pkl
150-JMI.joblib.pkl		150-JMI.joblib.pkl
200-JMI.joblib.pkl		200-JMI.joblib.pkl
250-JMI.joblib.pkl		250-JMI.joblib.pkl
50-JMI.joblib.pkl		50-JMI.joblib.pkl
AdaBoost_k-fold.joblib.pkl		AdaBoost_k-fold.joblib.pkl
AdaBoost_k-fold.py		AdaBoost_k-fold.py
CorrByParts.py		CorrByParts.py
DataSetBGSE24417MAQCIITraining_data.joblib.pkl		DataSetBGSE24417MAQCIITraining_data.joblib.pkl
DataSetBGSE24417MAQCIITraining_targets.joblib.pkl		DataSetBGSE24417MAQCIITraining_targets.joblib.pkl
DataSetBGSE24417MAQCIIValidation_data.joblib.pkl		DataSetBGSE24417MAQCIIValidation_data.joblib.pkl
DataSetBGSE24417MAQCIIValidation_targets.joblib.pkl		DataSetBGSE24417MAQCIIValidation_targets.joblib.pkl
DataSetLoaderLib.py		DataSetLoaderLib.py
DataSetLoaderLib.pyc		DataSetLoaderLib.pyc
DatasetA_ValidationClasses.joblib.pkl		DatasetA_ValidationClasses.joblib.pkl
DatasetA_ValidationClasses.joblib.pkl.backup		DatasetA_ValidationClasses.joblib.pkl.backup
ExtraTreeClassifier_k-fold.py		ExtraTreeClassifier_k-fold.py
ExtraTreesClassifier_k-fold.joblib.pkl		ExtraTreesClassifier_k-fold.joblib.pkl
GlobalUtils.py		GlobalUtils.py
GlobalUtils.pyc		GlobalUtils.pyc
LungCancer-Harvard2.zip		LungCancer-Harvard2.zip
MLP_k-fold.joblib.pkl		MLP_k-fold.joblib.pkl
MLP_k-fold.py		MLP_k-fold.py
MeanClassifier.py		MeanClassifier.py
README.md		README.md
SVM.joblib.pkl		SVM.joblib.pkl
SelectSubsetmRMR.py		SelectSubsetmRMR.py
SimilarityCalculator.py		SimilarityCalculator.py
cel_loader.py		cel_loader.py
datasetB10-MRMR.joblib.pkl		datasetB10-MRMR.joblib.pkl
divide.txt		divide.txt
divide2.txt		divide2.txt
divide_dataset.py		divide_dataset.py
dnn_kfold_(datasetA).py		dnn_kfold_(datasetA).py
dt_k-fold.joblib.pkl		dt_k-fold.joblib.pkl
dt_k-fold.py		dt_k-fold.py
emailSender.py		emailSender.py
feature_selection_datasetB.py		feature_selection_datasetB.py
findcol.py		findcol.py
genetic.txt		genetic.txt
geneticParameter.py		geneticParameter.py
genetic_feature_selection.py		genetic_feature_selection.py
identify_empty.py		identify_empty.py
log.txt		log.txt
mrmr2nd.py		mrmr2nd.py
multithreaded_mrmr.py		multithreaded_mrmr.py
objs.pickle.filepart		objs.pickle.filepart
output.txt		output.txt
parse_dataset_results.py		parse_dataset_results.py
randomForest.py		randomForest.py
randomForest_k-fold.py		randomForest_k-fold.py
ref_1_distances.py		ref_1_distances.py
ref_1_utils.py		ref_1_utils.py
ref_2_pca.py		ref_2_pca.py
ref_3_angle.py		ref_3_angle.py
ref_4_mi.py		ref_4_mi.py
ref_4_mifs.py		ref_4_mifs.py
replaceNaN.py		replaceNaN.py
rf_k-fold.py		rf_k-fold.py
selected_indices.joblib.pkl		selected_indices.joblib.pkl
selected_indices_MRMR.joblib.pkl		selected_indices_MRMR.joblib.pkl
selected_indicesv2.joblib.pkl		selected_indicesv2.joblib.pkl
sendemail.py		sendemail.py
sendemail.pyc		sendemail.pyc
svm.txt		svm.txt
svm_k-fold.py		svm_k-fold.py
svm_parameter_selection.py		svm_parameter_selection.py
test_all(test_datasetB).py		test_all(test_datasetB).py
test_all(test_datasetB_NP).py		test_all(test_datasetB_NP).py
test_all.py		test_all.py
test_all_kfold_(datasetA).py		test_all_kfold_(datasetA).py
test_all_kfold_(datasetA_NP).py		test_all_kfold_(datasetA_NP).py
test_all_kfold_(datasetB).py		test_all_kfold_(datasetB).py
test_all_kfold_(datasetB_NP).py		test_all_kfold_(datasetB_NP).py
test_all_loo_(datasetA).py		test_all_loo_(datasetA).py
test_all_loo_(datasetA_NP).py		test_all_loo_(datasetA_NP).py
test_all_loo_(datasetB).py		test_all_loo_(datasetB).py
test_all_loo_(datasetB_NP).py		test_all_loo_(datasetB_NP).py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phd-thesis-iii

Done

Implement mRMR to reduce dataset - done (JMIM, JMI, mRMR usinc scikit)

Implement Ensemble - done (using scikit and both soft ensemble using accuracy based weightage and hard ensemble with 1 vote each)

Not Done

Implement Treelet Clustering on reduced dataset

Use reliability based error measure for training and validation

Partially Done

Implement SVM | LDA | ANT | ANT Miner | BAT | BAT Miner on the reduced dataset

Produce Results and Fine Tune

Progress Tracking

Oct 2016

Nov 2016

Dec 2016

Jan 2017

replanned the following pathway

Mar 2017

Jun 2017

Jul 2017

Aug 2017

About

Releases

Packages

Languages

JavedZahoor/phd-thesis-iii

Folders and files

Latest commit

History

Repository files navigation

phd-thesis-iii

Done

Implement mRMR to reduce dataset - done (JMIM, JMI, mRMR usinc scikit)

Implement Ensemble - done (using scikit and both soft ensemble using accuracy based weightage and hard ensemble with 1 vote each)

Not Done

Implement Treelet Clustering on reduced dataset

Use reliability based error measure for training and validation

Partially Done

Implement SVM | LDA | ANT | ANT Miner | BAT | BAT Miner on the reduced dataset

Produce Results and Fine Tune

Progress Tracking

Oct 2016

Nov 2016

Dec 2016

Jan 2017

replanned the following pathway

Mar 2017

Jun 2017

Jul 2017

Aug 2017

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages