GitHub - ihp-lab/SEMPI: [ICMI 2024] SEMPI: A Database for Understanding Social Engagement in Video-Mediated Multiparty Interaction

SEMPI: A Database for Understanding Social Engagement in Video-Mediated Multiparty Interaction

Maksim Siniukov^* · Yufeng Yin^* · Eli Fast · Yingshan Qi · Aarav Monga · Audrey Kim · Mohammad Soleymani
University of Southern California
^*Equal Contribution

We present a database for automatic understanding of Social Engagement in MultiParty Interaction (SEMPI). Social engagement is an important social signal characterizing the level of participation of an interlocutor in a conversation. Social engagement involves maintaining attention and establishing connection and rapport. Machine understanding of social engagement can enable an autonomous agent to better understand the state of human participation and involvement to select optimal actions in human-machine social interaction. Recently, video-mediated interaction platforms, e.g., Zoom, have become very popular. The ease of use and increased accessibility of video calls have made them a preferred medium for multiparty conversations, including support groups and group therapy sessions. To create this dataset, we first collected a set of publicly available video calls posted on YouTube. We then segmented the videos by speech turn and cropped the videos to generate single-participant videos. We developed a questionnaire for assessing the level of social engagement by listeners in a conversation probing the relevant nonverbal behaviors for social engagement, including back-channeling, gaze, and expressions. We used Prolific, a crowd-sourcing platform, to annotate 3,505 videos of 76 listeners by three people, reaching a moderate to high inter-rater agreement of 0.693. This resulted in a database with aggregated engagement scores from the annotators. We developed a baseline multimodal pipeline using the state-of-the-art pre-trained models to track the level of engagement achieving the CCC score of 0.454. The results demonstrate the utility of the database for future applications in video-mediated human-machine interaction and human-human social skill assessment.

Download the data

Download the labels and the extracted features from the here. For video and audio files, fill in the form. Extract the files to 'data/'. Use python script 'data/engagement/code/crop_face.py' to get cropped face images and extract frames from videos. Training the models requires both the HuBERT and InceptionI3D model weights, which are included in the download file. For licensing details, please refer to the HuBERT and Inception3D model licenses.

Download pretrained models

Download engagement model weights from here, put the weights from 'model_weights/pretrained_models' to 'code/checkpoints/engagement/' Put the weights with corresponding names from 'model_weights/' to 'code/checkpoints/hubert_base_ls960.pt' 'code/checkpoints/hubert_large_ll60k.pt' 'code/checkpoints/rgb_imagenet.ptmodels' 'data/engagement/code/shape_predictor_68_face_landmarks.dat'

After that you should get a dataset folder structure like below:

code
├── requirements.txt
├── data.py
├── metr.py
├── solver_base.py               
├── videotransforms.py       
├── solver_data_process.py      
├── main.py                      # Script for model training
├── eval.py                      # Script for model evaluation
├── checkpoints/
    ├── hubert_base_ls960.pt
    ├── hubert_large_ll60k.pt
    ├── rgb_imagenet.pt
    ├── shape_predictor_68_face_landmarks.dat
    └──  engagement/             # Pretrained models weights
        ├──  model_ccc_fold_0.pt 
        ...
        └──  model_ccc_fold_4.pt
└── models/
    ├── hubert.py                # Implementation of the HuBERT model
    ├── multimodal.py            # Multimodal model
    └── pytorch_i3d.py           # I3D model for video analysis
data
└── engagement/
    ├── video                    # Video clips
    ├── code/                    # Scripts for feature extraction
    │   ├── crop_face.py         # Crops face regions from videos
    │   ├── get_audio.py         # Extracts audio from videos
    │   ├── get_text.py          # Converts speech to text
    │   ├── get_video.py         # Extracts and aligns frames
    │   ├── utils.py            
    │   └── shape_predictor_68_face_landmarks.dat  # Facial landmark detection model
    ├── frame                    # Extracted video frames
    ├── aligned_frame            # Aligned frames
    ├── audio                    # Extracted audio
    ├── featopenface             # Extracted facial Action Units (AUs) using OpenFace
    ├── text                     # Extracted text features from speech
    ├── annotations_raw_clf_reg.csv  # Raw engagement scores from annotators
    ├── label_0402_fold_0        # Regression labels for fold 0
    ...
    ├── label_0402_fold_4        # Regression labels for fold 4
    └── additional_labels        # Additional labels for classification

Set up enviroment for training

Create conda enviroment with PyTorch of your CUDA version, install fairseq and other dependencies.

cd code
conda create --name eng_env python=3.9
conda activate eng_env
pip install torch==2.2.2+cu121 torchvision==0.17.2+cu121 torchaudio==2.2.2+cu121 -f https://download.pytorch.org/whl/torch_stable.html  # Install PyTorch for your CUDA version 
pip install -r requirements.txt
git clone https://github.com/pytorch/fairseq # Install fairseq
cd fairseq
pip install --editable ./
cd ..

Model training

Use the following script to train and evaluate the model

python main.py --ckpt_name label_0402_fold_0 --device 1 --model_freeze part --targettype ccc --num_labels 1 --label_root label_0402_3REG_1 --activation_fn tanh --name label_0402_fold_0 --extra_dropout 1 --kfolds 1 --hidden_size 32 --weight_decay 0.01 --expnum 8 --openfacefeat 1 --openfacefeat_extramlp 1 --openfacefeat_extramlp_dim 64

Model evaluation

Use the following script to evaluate the model

python eval.py --ckpt_name label_0402_fold_0 --device 1 --model_freeze part --targettype ccc --num_labels 1 --label_root label_0402_3REG_1 --activation_fn tanh --name label_0402_fold_0 --extra_dropout 1 --kfolds 1 --hidden_size 32 --weight_decay 0.01 --expnum 8 --openfacefeat 1 --openfacefeat_extramlp 1 --openfacefeat_extramlp_dim 64

License

SEMPI is available under an USC Research License.

3rd-party components may have their respective licenses. Please contact their respective authors to obtain licenses.

Citation

@inproceedings{2024SEMPI,
  author = {Siniukov, Maksim and Yin, Yufeng and Fast, Eli and Qi, Yingshan and Monga, Aarav and Kim, Audrey and Soleymani, Mohammad},
  title = {SEMPI: A Database for Understanding Social Engagement in Video-Mediated Multiparty Interaction},
  booktitle = {Proceedings of the 26th International Conference on Multimodal Interaction},
  month     = {July},
  year      = {2024},
  doi = {10.1145/3678957.3685752}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
code		code
data/engagement		data/engagement
model_weights		model_weights
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEMPI: A Database for Understanding Social Engagement in Video-Mediated Multiparty Interaction

Download the data

Download pretrained models

Set up enviroment for training

Model training

Model evaluation

License

Citation

About

Releases

Packages

Languages

License

ihp-lab/SEMPI

Folders and files

Latest commit

History

Repository files navigation

SEMPI: A Database for Understanding Social Engagement in Video-Mediated Multiparty Interaction

Download the data

Download pretrained models

Set up enviroment for training

Model training

Model evaluation

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages