Skip to content

uic-nlp-lab/virtualcoachdata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Health Coaching Dialogue Corpus

Health coaching has been recognized as a successful approach for encouraging health behavior changes by having a professional provide evidence-based interventions, setting realistic goals, and encouraging goal adherence. However, health coaching is expensive, time-intensive, and unavailable around the clock. Therefore, we aim to build a health coaching dialogue system that converses with the patients via text messages and helps them to set Specific, Measurable, Attainable, Realistic, and Time-bound (S.M.A.R.T.) goals.

We conducted two rounds of data collection, where we recruited patients who exchanged SMS with certified health coaches for several weeks. Each week the coach conversed with the patients to set, follow up, and evaluate S.M.A.R.T. goals. We share the data from the second round, as allowed by the IRB approval (protocol 2016-0862, University of Illinois at Chicago), for those patients who explicitly consented to the release. The dataset is thoroughly de-identified. We regret that we cannot release the data from the first round of data collection.

Note: Round 3 dataset is currently being prepared / under de-identification check.

This corpus has been referenced in the following papers:

Dataset (Round 2)

In the second round, we recruited 30 patients who participated in the study for eight weeks; they are numbered from 01 to 30. The release contains data for 26 patients, because four patients (ids: 01, 02, 10, and 15) did not consent to the release of their data. We include the data in its entirety, other than messages at the end of the study, that pertain to the logistics of an in-person exit interview. Additionally, two patients' data (id: 6 and id: 13) contains much less information since they did not complete the study. Please note that the models in our papers were evaluated on data from 28 patients from round two, since we did exclude patients 6 and 13, and used all the rest of the data.

This repository contains two files:

|--/human_labeled
| |--patient5_smart.xml
| |--patient5_phases.xml
| |--patient7_smart.xml
| |--patient7_phases.xml

The human_labeled file contains the dialogue metatext for patient #5 and #7 with SMART tags and goal phases labeled manually (ground truth) using GATE. The stage labels (i.e., the stage for each week, which is either goal_setting or goal_action) are also included in the *phases.xml files.

|--/raw_with_predict_labels
| |--patient3.csv
| |--patient4.csv
| |--patient5.csv
| |--    ...

The raw_with_predict_labels file contains all the raw dialogue texts for the patient from #3 to #30, excluding #10, and #15, with the model-predicted SMART tags, phases, and dialogue act labels for potential user references. The columns are:

  • id: unique ID of the utterances in the order of conversation sequence.
  • speaker: the interlocutor of the utterance, which is either coach or patient.
  • utterance: the de-identified content of the utterances.
  • time: the time stamp of the utterance.
  • smart: the predicted sequential labels of the SMART tags for each token of an utterance. 'O' indicates that the corresponding token has no SMART tag. See a detailed explanation below.
  • tokens: the tokenized utterance, a list of tokens used for smart attribute and phase prediction.
  • phases: the predicted goal phases indicated by the content of utterance. For example, "Great job on your steps yesterday you walked over 2,500! Keep it up today!" is predicted as a follow-up phase.
  • act: the predicted dialogue act based on the content of utterance.
  • stage: indicates the stage for each week, which is either goal_setting or goal_action. The goal_setting stage also indicates the beginning of each week. Note this column is manually labeled.

For the SMART tags used and corresponding abbreviation/examples:

  • SA:'specificity-activity', activity, e.g., walk, jog,
  • ST:'specificity-time', time, e.g., between 6 and 8, during lunch time,
  • SL:'specificity-location', location, e.g., 'at work', 'in the house'
  • MQAmt: 'measurability-number', amount, e.g., '7700 steps', '5 flights'
  • MQDist: 'measurability-distance', distance, e.g., 2 miles, 3 blocks, from bus stop to walmart
  • MQDur: 'measurability-duration',#duration, e.g., 40min
  • MDName: 'measurability-name', day names e.g., fri, monday, mon wed
  • MDNum: 'measurability-other', day numbers, 5 days, every day
  • MRep:'measurability-repetition', repetition e.g., twice a day, once a day, daily
  • AS:'attainability-score', attainability, e.g., 10, 9, 8

For details regarding how these labels are defined and predicted please refer to the papers listed above.

New (2024/12): Limited patient demographic information (race/gender) is now available.

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Citation

@inproceedings{gupta-etal-2020-human,
    title = "Human-Human Health Coaching via Text Messages: Corpus, Annotation, and Analysis",
    author = "Gupta, Itika  and
      Di Eugenio, Barbara  and
      Ziebart, Brian  and
      Baiju, Aiswarya  and
      Liu, Bing  and
      Gerber, Ben  and
      Sharp, Lisa  and
      Nabulsi, Nadia  and
      Smart, Mary",
    booktitle = "Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue",
    month = jul,
    year = "2020",
    address = "1st virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.sigdial-1.30",
    pages = "246--256",
}
@inproceedings{zhou-etal-2024-modeling,
    title = "Modeling Low-Resource Health Coaching Dialogues via Neuro-Symbolic Goal Summarization and Text-Units-Text Generation",
    author = "Zhou, Yue  and
      Di Eugenio, Barbara  and
      Ziebart, Brian  and
      Sharp, Lisa  and
      Liu, Bing  and
      Agadakos, Nikolaos",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1005",
    pages = "11498--11509",
    abstract = "Health coaching helps patients achieve personalized and lifestyle-related goals, effectively managing chronic conditions and alleviating mental health issues. It is particularly beneficial, however cost-prohibitive, for low-socioeconomic status populations due to its highly personalized and labor-intensive nature. In this paper, we propose a neuro-symbolic goal summarizer to support health coaches in keeping track of the goals and a text-units-text dialogue generation model that converses with patients and helps them create and accomplish specific goals for physical activities. Our models outperform previous state-of-the-art while eliminating the need for predefined schema and corresponding annotation. We also propose a new health coaching dataset extending previous work and a metric to measure the unconventionality of the patient{'}s response based on data difficulty, facilitating potential coach alerts during deployment.",
}

Acknowledgment

This data was collected under award # 1838770 SCH: INT: The Virtual Assistant Health Coach: Learning to Autonomously Improve Health Behaviors, by the US National Science Foundation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published