Upload a CSV, it'll clean then sort your data into clusters and send it back with a new column that has cluster labels.
For example, this:
Item Type Apples Fruit Bananas Fruit Cat Fren Kodiak bear Fren
Will get sent back to you as this:
Item Type Cluster Apples Fruit Cluster 0 Bananas Fruit Cluster 0 Cat Fren Cluster 1 Kodiak bear Fren Cluster 1
Demo and docs: https://pydacc-production.up.railway.app
Note: demo is using railway's free plan so might not always be available
Clusters your data based on how many groups you want. This is set by the k
parameter.
Same as clustering
but the number of groups are automatically decided.
Note: There is a bug with how Swagger (docs) creates the request so trying it in the docs may not work :(
Links to relevant issues:
import requests
import pandas as pd
import io
import os
url = 'https://pydacc-production.up.railway.app'
input_csv = 'example.csv' # in the repo
endpoint = 'auto-clustering'
output_file_name = 'cluster_model'
output_file_path = None # save csv to custom path
files = [
('path_to_csv', (os.path.basename(input_csv), open(input_csv, 'rb'), 'text/csv')),
('column_drop_threshold', (None, '0.99')),
('file_name', (None, 'cluster_model')),
('drop_columns', (None, 'category')),
('drop_columns', (None, 'account_code')),
('categorical_columns', (None, 'city')),
('numerical_columns', (None, 'TotalOrderCount')),
('numerical_columns', (None, 'TotalOrderValue')),
('numerical_columns', (None, 'outstanding_debt')),
('numerical_columns', (None, 'TotalReturnedValue')),
('numerical_columns', (None, 'TotalReturnedQty')),
('numerical_columns', (None, 'TotalStock')),
('ignore_features', (None, 'id')),
('ignore_features', (None, 'name')),
('output_format', (None, 'csv')),
]
response = requests.post(f'{url}/{endpoint}', files=files)
# create dataframe
df = pd.read_csv(io.StringIO(response.text))
# save file
if output_file_path:
df.to_csv(f'{output_file_path}/{output_file_name}.csv', index=False)
else:
df.to_csv(f'{output_file_name}.csv', index=False)
# print dataframe
df
curl -X 'POST' \
'https://pydacc-production.up.railway.app/auto-clustering' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F '[email protected];type=text/csv' \
-F 'column_drop_threshold=0.99' \
-F 'file_name=cluster_model' \
-F 'drop_columns=category' \
-F 'drop_columns=account_code' \
-F 'categorical_columns=city' \
-F 'numerical_columns=TotalOrderCount' \
-F 'numerical_columns=TotalOrderValue' \
-F 'numerical_columns=outstanding_debt' \
-F 'numerical_columns=TotalReturnedValue' \
-F 'numerical_columns=TotalReturnedQty' \
-F 'numerical_columns=TotalStock' \
-F 'ignore_features=id' \
-F 'ignore_features=name' \
-F 'output_format=csv'
Clone:
git clone https://github.com/batmanscode/pydacc.git
Then run the server with:
uvicorn api:app --reload
And optionally --host 0.0.0.0 --port 8080
.
Addtional info
pydacc/
contains the python library.api.py
is a FastAPI wrapper that makes the functions inpydacc/
available as an API.- See FastAPI docs for more info about the server.
Railway has a good free tier. I've created a template for this project; you can create an account and deploy this yourself by using the button below!