Skip to content

Latest commit

 

History

History
126 lines (99 loc) · 8.44 KB

README.md

File metadata and controls

126 lines (99 loc) · 8.44 KB

UNIwise ML workshop

The purpose of this workshop is for you to get hands on experience with how we at UNIwise use data for machine learning purposes. You will learn about the types of data we handle, the challenges we face, and ponder how to overcome them.

Getting Started

  • Clone the repo with git clone [email protected]:UNIwise/ml-workshop.git or download the repo
  • Check you have python installed by running python -v in your favourite terminal.
    If you do not have it installed:
  • Navigate to the root of the project and run pip install -r requirements.txt to install the required packages.
  • Run the example to make sure everything works:
    • Run python example/visualize.py to visualize the data.
    • Run python example/predict.py to predict & visualize outliers.

Now you're ready to go! To begin you should add your code to the predict_outliers function in predict.py, but feel free to add other functions or play with the visualization if you wish.

To predict with your implementation run python <difficulty>/predict.py. E.g python hard/predict.py to predict outliers on the hard dataste using your model.

To visualize run different variables run python <difficulty>/visualize.py <status_key>, e.g. python hard/visualize.py FACIAL_RECOGNITION to visualize the facial recognition variable on your predictions on the hard dataset.
For the datasets with only one variable (easy and example) just run python <difficulty>/visualize.py.

Outlier Detection

The challenge of this workshop is detecting students acting suspicious during an exam i.e. outlier detection. In outlier detection we do not have labels or 'ground truths' to guide the training of a model, nor the evaluation of it. It should therefore be noted that outliers are not necessarily cheaters, but their data just looks different compared to their peers. At UNIwise we never call students cheaters, but we provide insights for the institutions to draw their own conclusions.

Datasets

This repository contains three different datasets in increasing complexity and one used as an example, where we have supplied a simple model for outlier detection. We recommend starting with the easy dataset, but depending on your level of experience and/or fighting spirit, feel free to start with any dataset. Note that while results are cool, we would also love to talk with you about your considerations, potential problems you see with your methods or how you would handle hypothetical issues we might suggest.

Example

In the example dataset we have generated data for 10 students doing a 3-hour exam.
The data is structured in the following manner:

time user_id status_key value
2023-02-10 12:00:00 0 CHARACTERS_TYPED 6543
2023-02-10 12:05:00 0 CHARACTERS_TYPED 7534
2023-02-10 12:00:00 1 CHARACTERS_TYPED 3213

With columns:
time: The timestamp of the datapoint.
user_id: A unique student identifier.
status_key: Variable the following column belongs to. In Example status_key is always CHARACTERS_TYPED, representing the character count up until this time.
value: The value status_key had at this time.

Easy

This dataset is similar to the example dataset in that it only contains one measurement, CHARACTERS_TYPED, but this dataset contains data for more students.
The data is structured in the following manner:

time user_id status_key value
2023-02-10 12:00:00 0 CHARACTERS_TYPED 6543
2023-02-10 12:05:00 0 CHARACTERS_TYPED 7534
2023-02-10 12:00:00 1 CHARACTERS_TYPED 3213

With columns:
time: The timestamp of the datapoint.
user_id: A unique student identifier.
status_key: Variable the following column belongs to. In Easy status_key is always CHARACTERS_TYPED, representing the character count up until this time.
value: The value status_key had at this time.

Medium

For this dataset there are multiple possible values for status_key. Now students can be considered outliers just according to one variable or a combination of multiple. The data is structured in the following manner:

time user_id status_key value
2023-02-10 12:00:00 0 CHARACTERS_TYPED 6543
2023-02-10 12:05:00 0 CHARACTERS_TYPED 7534
2023-02-10 12:00:00 0 FACIAL_RECOGNITION 95
2023-02-10 12:05:00 0 FACIAL_RECOGNITION 93
2023-02-10 12:00:00 0 VOICE_DETECTION 4
2023-02-10 12:05:00 0 VOICE_DETECTION 2
2023-02-10 12:00:00 1 CHARACTERS_TYPED 3213

With columns:
time: The timestamp of the datapoint.
user_id: A unique student identifier.
status_key: Variable the following column belongs to. CHARACTERS_TYPED represents the character count up until this time, FACIAL_RECOGNITION represents a percentage match to a reference image of the student using facial recognition and VOICE_DETECTION represents a count of sentences of spoken words since last observation.
value: The value status_key had at this time.

Here you will have to be creative and use the new features to help you figure out, if more people should be considered outliers based on the extra variables.

Hard

This dataset contains the same variables as the medium dataset, but this time, students from multiple exams are included and the exams were held at different times. This means that comparing variables across exams is going to pose some problems. What is considered an outlier CHARACTERS_TYPED value in one exam at a given time, may be different than the value at an equivalent time in a different exam.

time user_id exam_id status_key value
2023-02-10 12:00:00 0 0 CHARACTERS_TYPED 6543
2023-02-10 12:05:00 0 0 CHARACTERS_TYPED 7534
2023-02-10 12:00:00 0 0 FACIAL_RECOGNITION 95
2023-02-10 12:05:00 0 0 FACIAL_RECOGNITION 93
2023-02-10 12:00:00 0 0 VOICE_DETECTION 4
2023-02-10 12:05:00 0 0 VOICE_DETECTION 2
2023-02-10 12:00:00 1 0 CHARACTERS_TYPED 3213
2023-02-10 12:00:00 0 1 CHARACTERS_TYPED 3654

With columns:
time: The timestamp of the datapoint.
user_id: A unique student identifier for a given exam.
exam_id: A unique exam identifier.
status_key: Variable the following column belongs to. CHARACTERS_TYPED represents the character count up until this time, FACIAL_RECOGNITION represents a percentage match to a reference image of the student using facial recognition and VOICE_DETECTION represents a count of sentences of spoken words since last observation.
value: The value status_key had at this time.

Questions to Ponder

We do not expect you to solve these questions, but they are important things to consider if one were to deploy a model for finding outlier students in real exams.

  • How to handle student starting late? Or ending early?
  • What if some students were given extra time?
  • What if not all exams have the same status_keys available?
  • How would you explain to an invigilator why a student is considered an outlier?