Implement Ensemble - done (using scikit and both soft ensemble using accuracy based weightage and hard ensemble with 1 vote each)
PSO not yet applied for Feature Selection, classifiers used so far are SVM | ExtraTree | RF | MLP | AdaBoost and Decision Tree
Fine Tuning still in progress
- Planned out the strategy to implement the agenda/strategy outlined above.
- Tried to implement mrmr from https://github.com/nlhepler/mrmr but for some reason it seemed to require lots of updates, lots of data type conversions and still the end results it produced for me were not reliable as it seemed to generate sequential list of consecutive vectors from the left of the matrix.
- Finally found https://github.com/danielhomola/mifs as an embarrasingly parallel implementation of mrmr and it worked for me.
- Now implementing Treelet & ANT with the produced subset
- Started using Python's Scikit to start producing results on CPUs (using multiple cores via python). If needed we will use GPU again.
- Implemented Email notifications to get notified when long running tasks get completed.
- Explored and used mRMR, JMI and JMIM based Feature Selection
- Explored and shortlisted SVM, ExtraTree, RandomForest, MLP, AdaBoost and Decision Trees
- Started listing down the results in google sheet to be able to identify trends or patterns
- Keeping trained classifiers and datasets as pickles so we could reload them when needed and move ahead with Ensemble creation
- Applying FS multiple times to reduce from several thousand features to 250 features and then from 250 to 200, 150, 100, 50 and 10 best features to see what works best for us
- Exploring to see if GA can be used to tune parameters for the classifiers - not exactly sure how to use it though
Replan to do the following tasks along the way
- Retrain the models for other datasets - so far we have done so only on 1 dataset (DataSet A)
- Decide on Ensemble Scheme i.e. simple voting or weighted avg or even SVN/ANN on individual classifiers
- Calculate overall Ensemble performance
- Use LOOCV instead of 10 CV
- Try out different forms of error functions, including Mattia's reliability parameter
- Enlist all parameter values for each classifier for each settings and decide on which parameters to tune in what range.
- See if treelet/meta genes on top of mrmr can improve further
- see if bagging/boosting can further help
Ran into issues when applying Feature Selection due to missing values in the columns. Wrote program to find out the reason for crash and realized this. Suggested approaches for now are...
- remove such features altogether
- impute values using avg of other vectors so it doesnt impact the selection based on this imputed value
- impute values using a novel technique, find most correlated vector considering all available features and copy over the value from the most similar vector
Also figuring out the way to find best parameter values for the classifiers
- Started off with Dataset B and also restarted with Dataset A to include both normalized and unnormalized data
- Figuring out which of LOOCV and K-fold CV is best suited for us. As per Mattia's thesis, LOOCV is the best but turns out it results in worse accuracy in our case.
- Dumping all results in unformatted text files, this will be a challenge later to structure the data
- The individual results are not very encouraging in terms of accuracy. Though the time it takes is pretty affordable now.
- Also started creating Ensembles to see if that helps in getting better results
- Even Ensembles are not proving to be very helpful with better accuracy. Seems like we need to revisit our strategy.
- Grouped the generated results into folders
- Writing parser to parse and extact results
- Dumping the data into a database
- Repopulated DB from the results files
- Preparing the results for meeting with Advisor