Skip to content

Commit

Permalink
Merge pull request #6 from nlpaueb/vkougia_overall_upgrading
Browse files Browse the repository at this point in the history
Update
  • Loading branch information
vasilikikou authored Oct 11, 2019
2 parents 72ecff3 + b082a19 commit 6a90ed4
Show file tree
Hide file tree
Showing 11 changed files with 387 additions and 527 deletions.
36 changes: 23 additions & 13 deletions SiVL19/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,20 @@
A Survey on Biomedical Image Captioning
=================

Implementation of the baseline and evaluation methods described in the [paper.](https://arxiv.org/abs/1905.13302)
Code to download and preprocess the datasets, run the baselines and evaluate
the results as described in the paper
[A Survey on Biomedical Image Captioning](https://www.aclweb.org/anthology/W19-1803).

> V. Kougia, J. Pavlopoulos and I Androutsopoulos, "A Survey on Biomedical Image Captioning".
Proceedings of the Workshop on Shortcomings in Vision and Language of the Annual Conference
of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2019), Minneapolis, USA, 2019.

## Dependencies ##
To use this code you will need to install python 3.6 and the packages from the requirements.txt file. To install them run
```shell
pip install -r requirements.txt.
```
To use the MS COCO evaluation script (coco_evaluation.py) follow the instructions described [here](https://github.com/salaniz/pycocoevalcap) to install the library.

## Datasets ##

Expand All @@ -10,7 +23,7 @@ available, so to download it you need to follow the instructions described [here
in the Participant registration section. Then, you can run the corresponding script that uses the downloaded *csv*
file.
For each dataset a folder is created that contains the images and the data *tsv* files with
the following format: *image_name <\t> caption*.
the following format: *image_name <\t> caption*. All data files as well as the results files should follow this format.


```shell
Expand All @@ -30,16 +43,13 @@ the demo script *sivl_run_me.ipynb*.

## Evaluation ##

The evaluation with the WMS can be performed as shown in *sivl_run_me.ipynb*.
To evaluate with the BLEU 1-4, METEOR and ROUGE measures we used the [MS COCO caption evaluation code](https://github.com/tylin/coco-caption).
After you git clone the code and have the specified requirements run the following commands to move our
two scripts in the coco-caption folder and perform the evaluation.
To run the main script *mscoco_main_eval.py* give as arguments the path to the dataset folder that contains
the *json* files and the dataset name.
The evaluation with the WMD and the MS COCO captioning measures can be performed as in *sivl_run_me.ipynb*.
You can either use the *compute_wmd* and *compute_scores* methods for the WMD and MS COCO evaluations respectively (as shown in *sivl_run_me.ipynb*)
or run the main methods of the scripts providing the necessary arguments as shown below:
```shell
git clone https://github.com/tylin/coco-caption.git
mv bio_image_caption/SiVL19/mscoco_main_eval.py coco-caption/
mv bio_image_caption/SiVL19/bio_eval.py coco-caption/
python coco-caption/mscoco_main_eval.py /dataset_folder dataset_name
```
# For the WMD evaluation:
python wmd_evaluation.py path_to_gold_captions/gold.tsv path_to_results/results.tsv path_to_embeddings/emb.bin

# For the MSCOCO evalaution:
python coco_evaluation.py path_to_gold_captions/gold.tsv path_to_results/results.tsv
```
89 changes: 0 additions & 89 deletions SiVL19/bio_eval.py

This file was deleted.

79 changes: 79 additions & 0 deletions SiVL19/coco_evaluation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
import re
import argparse
import pandas as pd
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.meteor.meteor import Meteor
from pycocoevalcap.rouge.rouge import Rouge

parser = argparse.ArgumentParser(description="Takes as arguments a file with the gold captions and "
"a file with the generated ones and computes "
"BLEU 1-4, METEOR and Rouge-L measures")
parser.add_argument("gold", help="Path to tsv file with gold captions")
parser.add_argument("generated", help="Path to tsv file with generated captions")


def preprocess_captions(images_captions):
"""
:param images_captions: Dictionary with image ids as keys and captions as values
:return: Dictionary with the processed captions as values
"""

# Clean for BioASQ
bioclean = lambda t: re.sub('[.,?;*!%^&_+():-\[\]{}]', '',
t.replace('"', '').replace('/', '').replace('\\', '').replace("'",
'').strip().lower())
pr_captions = {}
# Apply bio clean to data
for image in images_captions:
# Save caption to an array to match MSCOCO format
pr_captions[image] = [bioclean(images_captions[image])]

return pr_captions


def compute_scores(gts, res):
"""
Performs the MS COCO evaluation using the Python 3 implementation (https://github.com/salaniz/pycocoevalcap)
:param gts: Dictionary with the image ids and their gold captions,
:param res: Dictionary with the image ids ant their generated captions
:print: Evaluation score (the mean of the scores of all the instances) for each measure
"""

# Preprocess captions
gts = preprocess_captions(gts)
res = preprocess_captions(res)

# Set up scorers
scorers = [
(Bleu(4), ["Bleu_1", "Bleu_2", "Bleu_3", "Bleu_4"]),
(Meteor(), "METEOR"),
(Rouge(), "ROUGE_L")
]

# Compute score for each metric
for scorer, method in scorers:
print("Computing", scorer.method(), "...")
score, scores = scorer.compute_score(gts, res)
if type(method) == list:
for sc, m in zip(score, method):
print("%s : %0.3f" % (m, sc))
else:
print("%s : %0.3f" % (method, score))


if __name__ == "__main__":

args = parser.parse_args()
gold_path = args.gold
results_path = args.generated

# Load data
gts_data = pd.read_csv(gold_path, sep="\t", header=None, names=["image_ids", "captions"])
gts = dict(zip(gts_data.image_ids, gts_data.captions))

res_data = pd.read_csv(results_path, sep="\t", header=None, names=["image_ids", "captions"])
res = dict(zip(res_data.image_ids, res_data.captions))

compute_scores(gts, res)
86 changes: 0 additions & 86 deletions SiVL19/create_json_files.py

This file was deleted.

68 changes: 40 additions & 28 deletions SiVL19/create_vocabulary.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,44 +2,56 @@
import os


def create_vocabulary(filepath):
def create_vocabulary(filepath, results_path):
"""
Creates vocabulary of unique words and computes statistics for the train captions
# clean for BioASQ
bioclean = lambda t: re.sub('[.,?;*!%^&_+():-\[\]{}]', '', t.replace('"', '').replace('/', '').replace('\\', '').replace("'",'').strip().lower()).split()
:param filepath: The path to the train data tsv file with the form: "image \t caption"
:param results_path: The folder in which to save the vocabulary file
:return: The average caption length
"""

total_words = []
pr_captions = []
# Clean for BioASQ
bioclean = lambda t: re.sub('[.,?;*!%^&_+():-\[\]{}]', '',
t.replace('"', '').replace('/', '').replace('\\', '')
.replace("'",'').strip().lower()).split()

#load data
train_path = os.path.join(filepath, "train_images.tsv")

with open(train_path, "r") as file:
total_words = []
pr_captions = []

for line in file:
line = line.replace("\n", "").split("\t")
# Read data
with open(filepath, "r") as file:

tokens = bioclean(line[1])
for token in tokens:
total_words.append(token)
caption = " ".join(tokens)
pr_captions.append(caption)
for line in file:
line = line.replace("\n", "").split("\t")

# Apply bioclean to the caption
tokens = bioclean(line[1])
for token in tokens:
total_words.append(token)
caption = " ".join(tokens)
pr_captions.append(caption)

print("Total number of captions is",len(pr_captions))

unique_captions = set(pr_captions)
print("Total number of unique captions is", len(unique_captions))
print("Total number of captions is",len(pr_captions))

mean_length = len(total_words)/len(pr_captions)
print("The average caption length is", mean_length, "words")
# Find the unique captions in the train data
unique_captions = set(pr_captions)
print("Total number of unique captions is", len(unique_captions))

#create vocabulary of unique words
vocabulary = set(total_words)
print("Unique words are", len(vocabulary))
with open(os.path.join(filepath, "vocabulary.txt"), 'w') as output_file:
for word in vocabulary:
output_file.write(word)
output_file.write("\n")
# Compute the mean caption length
mean_length = len(total_words)/len(pr_captions)
print("The average caption length is", mean_length, "words")

# Create vocabulary of unique words
vocabulary = set(total_words)
print("Unique words are", len(vocabulary))
# Save vocabulary file to dataset folder
with open(os.path.join(results_path, "vocabulary.txt"), 'w') as output_file:
for word in vocabulary:
output_file.write(word)
output_file.write("\n")


return mean_length
return mean_length
Loading

0 comments on commit 6a90ed4

Please sign in to comment.