The goal of this project is to understand how software engineers comprehend computer programs.
Do they use beacons or references in programs to ease their comprehension? Are their critical parts of the program they tend to focus and spend time on more?
We also investigate how state of the art generative models like GPT perform on the task of identifying such beacons.
conda create -n program-comprehension python=3.8.11
conda env update --file env.yml --prune
conda activate program-comprehension
or
PIP_EXISTS_ACTION=w conda env create -f env.yml
Programs used in the behavioral experiments were sourced from the following repositories:
https://github.com/githubhuyang/refactory
https://github.com/jkoppel/QuixBugs
First, get model output information for each stimuli:
For each problem, this mode generates a torch pkl containing a dict: tokens -> tensor.
Path: ./experiments/custom-anonym
python comprehend/model_outputs.py \
--model_names santa-coder \
--number_of_records -1 \
--dataset_name custom-anonym \
--dataset_path ./data \
--infer_interval 1 \
--expt_dir ./experiments \
--mode 1
This mode generates a CSV for each problem.
Path: ./experiments/custom-anonym
python comprehend/model_outputs.py \
--model_names santa-coder \
--number_of_records -1 \
--dataset_name custom-anonym \
--dataset_path ./data \
--infer_interval 1 \
--expt_dir ./experiments \
--mode 2
Next, align model output data with participant responses available as Qualtrics data (which needs to be placed in ./data
)
python comprehend/prepare_dataset.py \
--responses_path data/code-comprehend_March 13, 2023_10.00.xlsx \
--token_wise_ll_support_path experiments/custom-anonym \
--token_wise_representations_path experiments/custom-anonym \
--out_path experiments/results
Analyze the prepared data by training models
python comprehend/analyze.py \
--dataset_path experiments/results \
For comprehend/model_outputs.py
[
"--model_names", "codeberta-small",
"--number_of_records", "-1",
"--infer_interval", "2",
"--dataset_name", "custom-anonym",
"--dataset_path", "./data",
]