Skip to content

Latest commit

 

History

History
80 lines (62 loc) · 9.36 KB

README.md

File metadata and controls

80 lines (62 loc) · 9.36 KB

ML_for_learner

该项目旨在使用numpy实现一个类scikit-learn的mini机器学习库,对于相关的知识,均配有blog文章对其理论进行讲解,对于部分功能,还配有notebook分析代码实现上的细节。该项目的初衷是为那些算法学习者提供从理论到实现的一站式服务。

由于本人学识有限,并且没有Python开发经验,该库目前还是一个非常松散的代码集合体。如果你在blog、notebook或者code中发现任何纰漏或bug,甚至是觉得哪写的不通顺,都可以联系我,当然也可以直接在项目页面提issue,谢谢。

QQ: 435248055   |   WeChat: QQ435248055   |   Blog


点击算法名称进入相应Blog了解算法理论,notebook指导如何step-by-step的去实现该算法,code为模块化的代码文件。

注:除非特别说明,各模型所接受的数据格式均为numpy.ndarray格式,部分也可接受List或者嵌套List,除此之外的数据格式本人暂不保证。由于目前的Python type hint还不支持numpy,所以在代码中未说明(感谢微信昵称@Stream的提醒)。

Supervised learning

Class Algorithm Implementation Code
Generalized Linear Models Linear Regression notebook code
Logistic regression notebook code
Nearest Neighbors Nearest Neighbors Classification notebook code
Naive Bayes Gaussian Naive Bayes notebook code
Support Vector Machine SVC notebook code
Decision Trees ID3 Classification notebook code
ID3 Regression notebook code
CART Classification notebook code
CART Regression notebook code
Ensemble methods Random Forests Classification notebook code
Random Forests Regression notebook code
AdaBoosting Classification notebook code

Unsupervised learning

Class Algorithm Implementation Code
Gaussian mixture models Gaussian Mixture notebook code
Clustering K-means notebook code
DBSCAN notebook code
Association Rules Apriori notebook
Collaborative Filtering User-based notebook
Item-based notebook
LFM notebook

Model selection and evaluation

Class Approach Code
Model Selection Dataset Split code
K-Fold code
Stratified K-Fold code
Metrics Accuracy code
Log loss code
F1-score code
AUC code
Explained Variance code
Mean Absolute Error code
Mean Squared Error code
R Square code
Euclidean Distances code

Preprocessing data

Class Algorithm Implementation Code
Feature Scaling StandardScaler code
MinMaxScaler code
Unsupervised dimensionality reduction PCA notebook code
SVD notebook code
Supervised dimensionality reduction Linear Discriminant Analysis notebook code
Text Feature Count Feature code
TF-IDF code

Known Issues

整体代码重用性较低。

random forest没有实现并行。

LDA代码存在功能欠缺。

K-Fold代码中使用了np.append(),效率较低。