most online shorppers only provide bare minimum of information needed when signing up as a new user or making a transaction on the site (i.e credit card details, delivery address etc). They do not provide their age, gender or any other personal details when they register as a new customer or they will simply purchase their items as a ‘Guest’ user.
- Transaction data: saved as a database file (test_data.db.zip) and a json file (test_data.json.zip).
- steps in data preparations/processing:
- Stage 1:
- unzipping the database file usig an encrypted password
- using sqlite 3 to connect to the database file (test_data.db)
- finding the table name within the database file (customers)
- writing SQL queries to answer stage 1 SQL questions on 'customers' table
- Stage 2 & 3:
- unzipping the json file, converting to a pandas dataframe for data preprocessing
- creating new features for k-means unsupervised learning to predict gender (2 clusters)
- Stage 4: documentation
- Stage 1:
Using test_data.db to write SQL queries to answer the following questions:
- What was the total revenue ($) for customers who have paid by credit card?
- What % of customers who have purchased female items have paid by credit card?
- What was the average revenue for customers who used either iOS, Android or Desktop?
- To run an email campaign promoting a new mens luxury brand, list of customers to target?
Data preprocessing and cleaning (two columns are intentionally corrupted, to be identified and fixed)
- detail of the process is within the code
BUILD - Building a model to predict the gender using the features provided and engineered
- there is no gender flag --> unsupervised learning
- detail of the process is within the code
Documentating the process, findings and code into a reproducible document that can be understood by a business user and answers:
-
How did you clean the data and what was wrong with it?
-
What are the features you used as-is and which one did you engineer using the given ones? What do they mean in the real world?
-
What does the output look like - how close is the accuracy of the prediction in light of data with labelled flags?
-
What other features and variables can you think of, that can make this process more robust? Can you make a recommendation of top 5 features you'd seek to find apart from the ones given here
-
Summarize your findings in an executive summary
-
A pdf version of the report is attached in this repo.