Health Coaching Dialogue Corpus

Health coaching has been recognized as a successful approach for encouraging health behavior changes by having a professional provide evidence-based interventions, setting realistic goals, and encouraging goal adherence. However, health coaching is expensive, time-intensive, and unavailable around the clock. Therefore, we aim to build a health coaching dialogue system that converses with the patients via text messages and helps them to set Specific, Measurable, Attainable, Realistic, and Time-bound (S.M.A.R.T.) goals.

We conducted two rounds of data collection, where we recruited patients who exchanged SMS with certified health coaches for several weeks. Each week the coach conversed with the patients to set, follow up, and evaluate S.M.A.R.T. goals. We share the data from the second round, as allowed by the IRB approval (protocol 2016-0862, University of Illinois at Chicago), for those patients who explicitly consented to the release. The dataset is thoroughly de-identified. We regret that we cannot release the data from the first round of data collection.

Note: Round 3 dataset is currently being prepared / under de-identification check.

This corpus has been referenced in the following papers:

Human-Human Health Coaching via Text Messages: Corpus, Annotation, and Analysis. SIGDIAL 2020.
Goal Summarization for Human-Human Health Coaching Dialogues. FLAIRS 2020.
Summarizing Behavioral Change Goals from SMS Exchanges to Support Health Coaches. SIGDIAL 2021.
Towards Enhancing Health Coaching Dialogue in Low-Resource Settings. COLING 2022.
Modeling Low-Resource Health Coaching Dialogues via Neuro-Symbolic Goal Summarization and Text-Units-Text Generation. COLING 2024.

Dataset (Round 2)

In the second round, we recruited 30 patients who participated in the study for eight weeks; they are numbered from 01 to 30. The release contains data for 26 patients, because four patients (ids: 01, 02, 10, and 15) did not consent to the release of their data. We include the data in its entirety, other than messages at the end of the study, that pertain to the logistics of an in-person exit interview. Additionally, two patients' data (id: 6 and id: 13) contains much less information since they did not complete the study. Please note that the models in our papers were evaluated on data from 28 patients from round two, since we did exclude patients 6 and 13, and used all the rest of the data.

This repository contains two files:

|--/human_labeled
| |--patient5_smart.xml
| |--patient5_phases.xml
| |--patient7_smart.xml
| |--patient7_phases.xml

The human_labeled file contains the dialogue metatext for patient #5 and #7 with SMART tags and goal phases labeled manually (ground truth) using GATE. The stage labels (i.e., the stage for each week, which is either goal_setting or goal_action) are also included in the *phases.xml files.

|--/raw_with_predict_labels
| |--patient3.csv
| |--patient4.csv
| |--patient5.csv
| |--    ...

The raw_with_predict_labels file contains all the raw dialogue texts for the patient from #3 to #30, excluding #10, and #15, with the model-predicted SMART tags, phases, and dialogue act labels for potential user references. The columns are:

id: unique ID of the utterances in the order of conversation sequence.
speaker: the interlocutor of the utterance, which is either coach or patient.
utterance: the de-identified content of the utterances.
time: the time stamp of the utterance.
smart: the predicted sequential labels of the SMART tags for each token of an utterance. 'O' indicates that the corresponding token has no SMART tag. See a detailed explanation below.
tokens: the tokenized utterance, a list of tokens used for smart attribute and phase prediction.
phases: the predicted goal phases indicated by the content of utterance. For example, "Great job on your steps yesterday you walked over 2,500! Keep it up today!" is predicted as a follow-up phase.
act: the predicted dialogue act based on the content of utterance.
stage: indicates the stage for each week, which is either goal_setting or goal_action. The goal_setting stage also indicates the beginning of each week. Note this column is manually labeled.

For the SMART tags used and corresponding abbreviation/examples:

SA:'specificity-activity', activity, e.g., walk, jog,
ST:'specificity-time', time, e.g., between 6 and 8, during lunch time,
SL:'specificity-location', location, e.g., 'at work', 'in the house'
MQAmt: 'measurability-number', amount, e.g., '7700 steps', '5 flights'
MQDist: 'measurability-distance', distance, e.g., 2 miles, 3 blocks, from bus stop to walmart
MQDur: 'measurability-duration',#duration, e.g., 40min
MDName: 'measurability-name', day names e.g., fri, monday, mon wed
MDNum: 'measurability-other', day numbers, 5 days, every day
MRep:'measurability-repetition', repetition e.g., twice a day, once a day, daily
AS:'attainability-score', attainability, e.g., 10, 9, 8

For details regarding how these labels are defined and predicted please refer to the papers listed above.

New (2024/12): Limited patient demographic information (race/gender) is now available.

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Citation

@inproceedings{gupta-etal-2020-human,
    title = "Human-Human Health Coaching via Text Messages: Corpus, Annotation, and Analysis",
    author = "Gupta, Itika  and
      Di Eugenio, Barbara  and
      Ziebart, Brian  and
      Baiju, Aiswarya  and
      Liu, Bing  and
      Gerber, Ben  and
      Sharp, Lisa  and
      Nabulsi, Nadia  and
      Smart, Mary",
    booktitle = "Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue",
    month = jul,
    year = "2020",
    address = "1st virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.sigdial-1.30",
    pages = "246--256",
}
@inproceedings{zhou-etal-2024-modeling,
    title = "Modeling Low-Resource Health Coaching Dialogues via Neuro-Symbolic Goal Summarization and Text-Units-Text Generation",
    author = "Zhou, Yue  and
      Di Eugenio, Barbara  and
      Ziebart, Brian  and
      Sharp, Lisa  and
      Liu, Bing  and
      Agadakos, Nikolaos",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1005",
    pages = "11498--11509",
    abstract = "Health coaching helps patients achieve personalized and lifestyle-related goals, effectively managing chronic conditions and alleviating mental health issues. It is particularly beneficial, however cost-prohibitive, for low-socioeconomic status populations due to its highly personalized and labor-intensive nature. In this paper, we propose a neuro-symbolic goal summarizer to support health coaches in keeping track of the goals and a text-units-text dialogue generation model that converses with patients and helps them create and accomplish specific goals for physical activities. Our models outperform previous state-of-the-art while eliminating the need for predefined schema and corresponding annotation. We also propose a new health coaching dataset extending previous work and a metric to measure the unconventionality of the patient{'}s response based on data difficulty, facilitating potential coach alerts during deployment.",
}

Acknowledgment

This data was collected under award # 1838770 SCH: INT: The Virtual Assistant Health Coach: Learning to Autonomously Improve Health Behaviors, by the US National Science Foundation.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
human_labeled		human_labeled
raw_with_predicted_labels		raw_with_predicted_labels
HC_r2_demographics.csv		HC_r2_demographics.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Health Coaching Dialogue Corpus

Note: Round 3 dataset is currently being prepared / under de-identification check.

Dataset (Round 2)

License

Citation

Acknowledgment

About

Releases

Packages

uic-nlp-lab/virtualcoachdata

Folders and files

Latest commit

History

Repository files navigation

Health Coaching Dialogue Corpus

Note: Round 3 dataset is currently being prepared / under de-identification check.

Dataset (Round 2)

License

Citation

Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages