Skip to content

Latest commit



107 lines (89 loc) · 4.46 KB

File metadata and controls

107 lines (89 loc) · 4.46 KB

Here is the codebase for running the frame based DriveCLIP framework. The summary results on different distracted driving datasets are:

summary_results zero-shot_results


Dataset Modality # of Classes Link # of training samples
DMD RGB 10 link 78917(per fold)
StateFarm RGB 10 link 20204 (per fold)
SynDD1 RGB 18 link 4404 (1fps, dashboard)
SAM-DD RGB 10 link 36839 (28 drivers)


  1. The frame-based CLIP model was trained to predict the distracted driving actions for each frame. So, while doing the inference, the model will assign each frame with a predicted action label and the confidence score (probability score for that certain predicted action label).

For training on SynDD1 dataset, we used 8 distracted action classes. They (class-ID & class label) are as follows:

0 "driver is adjusting his or her hair while driving a car"
1 "driver is drinking water from a bottle while driving a car"
2 "driver is eating while driving a car"
3 "driver is picking something from the floor while driving a car"
4 "driver is reaching behind to the backseat while driving a car"
5 "driver is singing a song with music and smiling while driving"
6 "driver is talking to the phone on hand while driving a car"
7 "driver is yawning while driving a car"

Therefore, the model can predict only these 8 classes. Also, there are some variants of the model weights. (ViT-L/14 backbone based model gives the best result)

  1. Data folder structure (SynDD1) should be like this: [0-7] action classes on SynDD1 dataset
      ├── syn1fps_dash
      │   ├── 0
      │   ├── 1
      │   ├── 2
      │   ├── 3
      │   ├── 4
      │   ├── 5
      │   ├── 6
      │   └── 7
      ├── syn5fps_dash
      │   ├── 0
      │   ├── 1
      │   ├── 2
      │   ├── 3
      │   ├── 4
      │   ├── 5
      │   ├── 6
      │   └── 7
  1. The subject splitting profile files are saved in driverprofile folder. For SynDD1 see subject_splitting_profile.json and for StateFarm see driver_img_list.csv.
"fold0": {
    "train": [
    "val": [
    "test": [

Steps to run the inference on the video file:

  1. Download the video and save it in .MP4 format in a video/ folder

  2. Create a conda environment and run the requirements.txt file

  3. Open and specify the CLIP backbone (model_name) and FPS (default=1FPS). Then run the python file from terminal by following the command: python --video video_path –frame Extracted_frame_directory_path For example: python –-video video/Dashboard_user_id_13522_NoAudio_5.MP4 --frame frame_folder

  4. Prediction results will be stored in a .json file named frame_prediction.json This file consists of the following format:

     ``` { frame_01_path: {prediction label, [list of prediction prob. scores]}, 
           frame_02_path: {prediction label, [list of prediction prob. scores]}, ... } ```

Steps to run frame-based experiments:

  1. Set up the conda environment and run requirements.txt files from CLIP main repo
  2. Upload .mp4 video files into synvid folder (see synvid/video_list.csv) and run the to extract frames at different fps. Store the frames in [0-7] folder structure shown above.
  3. Check the driver profile folder to prepare the driver split
  4. Feed the frames to CLIP model
  5. run the run files

Repo for model comparison: Link

Repo for VideoCLIP: Link