This repository is a Python implementation of linear regression from scratch. If you are interested in linear regression, you can check out this video explaining the basic ideas behind it and also dissecting the inner workings of this code base.
In order to use this repository, you first need to install the required libraries. Move ("cd") to the project folder and run the following command: pip install -r requirements.txt
If you want to train the linear regression you can run main.py
. You'll see the loss/error printed in the terminal as the AI trains. Once the training is done, a graph will appear showing the loss. In order to test your model, you can run the test.py
file. This program will graph both the original data and the fitted line to visually show the accuracy. After the graph is closed, you can give the program input values to predict unseen data.
Both main.py
and test.py
are configurable. In both cases, you can specify the CSV file without training/testing data and the corresponding column names for the input and output/label values. main.py
also allows to change the learning rate and the number of epochs.
This repository has two built-in datasets you can train on - ice-cream.csv
and boston.csv
.
ice-cream.csv
is very simplistic dataset with artificial data on the temperature of the day and the amount of ice creams sold. The data was generated by a line with the slope of 5.5 and a y-intercept of -52.0 + a random noise between -20 and 20. You can compare these values to the model's parameters.
boston.csv
is a collection of real life data about the housing in the area of Boston. Meaning of the columns:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per $10,000
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's
After the AI/model is trained, the parameters are saved in the model.txt
file. First number represents the slope (parameter "a") and the second parameter is the y-intercept (parameter "b"). The save file is then used by test.py
.