Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-reproducible MSRVTT results - I get R@1 accuracy less than 1% #51

Open
lennartmoritz opened this issue Apr 21, 2024 · 2 comments
Open

Comments

@lennartmoritz
Copy link

lennartmoritz commented Apr 21, 2024

I am trying to verify/reproduce your paper's validation results without training it myself and expected 42.6% R@1 accuracy for MSR-VTT.

But when I follow the instructions from TRAIN_AND_VALIDATE.md (I only did the eval.sh, no training) I get results that are as bad as randomly guessing with about 0.1% R@1 accuracy. See my out.log here:

Eval Epoch: 0, eval Video-Text Retrieval under MSRVTT test data
2024-04-21,14:07:56 | INFO | MSRVTT sim matrix size: 1000, 1000
2024-04-21,15:02:43 | INFO | Length-T: 1000, Length-V:1000
2024-04-21,15:02:47 | INFO | MSRVTT Text-to-Video:
2024-04-21,15:02:53 | INFO | >>> R@1: 0.0 - R@5: 0.6 - R@10: 0.8 - Median R: 516.0 - Mean R: 518.7
2024-04-21,15:03:00 | INFO | MSRVTT Video-to-Text:
2024-04-21,15:03:03 | INFO | >>> V2T$R@1: 0.1 - V2T$R@5: 0.6 - V2T$R@10: 0.8 - V2T$Median R: 491.0 - V2T$Mean R: 498.2

What I need:

Please tell me how i can select your final model for the eval script, which will lead to the same results you that you published.

What I suspect is wrong:

Well, I guess the issue is that I am trying to evaluate the untrained model here instead of your trained version.
Maybe I misunderstood the instructions, and the pretrained weights I downloaded are not the same as your fully trained model described in the paper.

I have also tried to get your final model by running my eval_msrvtt.sh script with the TRANSFORMERS_OFFLINE=0 environment variable and an empty cache_dir in hopes of downloading the fully trained version. Strangely enough this leads to slightly different results in my out.log:

2024-04-19,13:59:28 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/tokenizer_config.json to /raid/1moritz/models/languagebind/cache_dir/tmpctkzbg3u
2024-04-19,13:59:29 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/vocab.json to /raid/1moritz/models/languagebind/cache_dir/tmp6_ww7ayw
2024-04-19,13:59:29 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/merges.txt to /raid/1moritz/models/languagebind/cache_dir/tmp3g7ehptb
2024-04-19,13:59:30 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/tokenizer.json to /raid/1moritz/models/languagebind/cache_dir/tmp4h042saq
2024-04-19,13:59:31 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/special_tokens_map.json to /raid/1moritz/models/languagebind/cache_dir/tmp0exqanes
2024-04-19,13:59:31 | INFO | {'vl_ret': [{'msrvtt': <torch.utils.data.dataloader.DataLoader object at 0x7f9015f066b0>}]})
2024-04-19,13:59:31 | INFO |
Eval Epoch: 0, eval Video-Text Retrieval under MSRVTT test data
2024-04-19,14:06:35 | INFO | MSRVTT sim matrix size: 1000, 1000
2024-04-19,14:06:35 | INFO | Length-T: 1000, Length-V:1000
2024-04-19,14:06:35 | INFO | MSRVTT Text-to-Video:
2024-04-19,14:06:35 | INFO | >>> R@1: 0.0 - R@5: 0.4 - R@10: 0.7 - Median R: 511.0 - Mean R: 505.5
2024-04-19,14:06:35 | INFO | MSRVTT Video-to-Text:
2024-04-19,14:06:35 | INFO | >>> V2T$R@1: 0.2 - V2T$R@5: 0.6 - V2T$R@10: 0.9 - V2T$Median R: 500.0 - V2T$Mean R: 504.9

How to reproduce:

I follow TRAIN_AND_VALIDATE.md.

  1. Download cache of pretrained weights from your google drive and specify CACHE_DIR.
  2. Download MSRVTT from the source you mentioned in TRAIN_AND_VALIDATE.md
  3. Change the data_root here.
  4. Make minimal changes to eval.sh and save it as eval_msrvtt.sh. Then execute the script.

This is my eval_msrvtt.sh:

CACHE_DIR="/raid/1moritz/models/languagebind/cache_dir"
RESUME="video_language.pt"
ANNOTATION="path/to/data"
# this script is for 640 total batch_size (n(16) GPUs * batch_size(10) * accum_freq(4))
cd /srv/home/1moritz/Repositories/LanguageBind
# TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_addr $CHIEF_IP \
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
    -m main  \
    --train-data ${ANNOTATION} \
    --train-num-samples 3020000 \
    --clip-type "vl" --add-time-attn \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 16 \
    --lr 1e-4 --coef-lr 1 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 8 --force-patch-dropout 0.3 \
    --epochs 16 --batch-size 10 --accum-freq 4 --warmup 2000 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
    --do_eval \
    --val_vl_ret_data "msrvtt"
@lennartmoritz lennartmoritz changed the title Non-reproducible MSRVTT results show R@1 accuracy less than 1% Non-reproducible MSRVTT results - I get R@1 accuracy less than 1% Apr 22, 2024
@e1four15f
Copy link

e1four15f commented Apr 30, 2024

Hi @lennartmoritz, I'm currently using this model for my project and I'm having the same issue with eval_msrvtt.sh.

I wrote my own script for model evaluation. Unfortunatelly, FT models does not show the expected results, but Large models are ok (LanguageBind_Video, LanguageBind_Audio)

You may try run my script, it gave me around 41.50 R@1, 65.80 R@5, 75.50 R@10

from collections import defaultdict

import torch
import pandas as pd
import numpy as np
from more_itertools import chunked
from tqdm.auto import tqdm

from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer


def compute_metrics(x):
    sx = np.sort(-x, axis=1)
    d = np.diag(-x)
    d = d[:, np.newaxis]
    ind = sx - d
    ind = np.where(ind == 0)
    ind = ind[1]
    metrics = {}
    metrics['R1'] = float(np.sum(ind == 0)) * 100 / len(ind)
    metrics['R5'] = float(np.sum(ind < 5)) * 100 / len(ind)
    metrics['R10'] = float(np.sum(ind < 10)) * 100 / len(ind)
    metrics['MR'] = np.median(ind) + 1
    metrics["MedianR"] = metrics['MR']
    metrics["MeanR"] = np.mean(ind) + 1
    # metrics["cols"] = [int(i) for i in list(ind)]
    return metrics


def main():
    device = torch.device('cuda:0')
    clip_type = {
        'video': 'LanguageBind_Video',#_FT',  # also LanguageBind_Video
        'audio': 'LanguageBind_Audio',#_FT',  # also LanguageBind_Audio
        # 'image': 'LanguageBind_Image',
        # 'thermal': 'LanguageBind_Thermal',
        # 'depth': 'LanguageBind_Depth',
    }

    model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir').to(device)
    model.eval()

    tokenizer = LanguageBindImageTokenizer.from_pretrained('lb203/LanguageBind_Image', cache_dir='./cache_dir/tokenizer_cache_dir')
    modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}

    df = pd.read_csv('../data/MSRVTT/MSRVTT_JSFUSION_test.csv')

    language_data = df['sentence'].values.tolist()
    video_data = df['video_id'].apply(lambda x: str(f'../data/MSRVTT/videos/all/{x}.mp4')).values.tolist()

    def embed(x: list[list], dtypes: list[str]) -> list:
        inputs = {}
        for data, dtype in zip(x, dtypes):
            if dtype == 'language':
                inputs['language'] = to_device(tokenizer(data, max_length=77, padding='max_length', truncation=True, return_tensors='pt'), device)
            elif dtype in ['image', 'video', 'audio', 'depth', 'thermal', 'language']:
                inputs[dtype] = to_device(modality_transform[dtype](data), device)
            else:
                raise

        with torch.no_grad():
            embeddings = model(inputs)

        embeddings = {k: v.detach().cpu().numpy() for k, v in embeddings.items()}
        return embeddings


    batch_size = 16
    results = defaultdict(lambda: np.random.rand(0, 768))
    for batch in tqdm(list(zip(
            chunked(language_data, batch_size),
            chunked(video_data, batch_size)
        ))):
        embeddings = embed(
            batch,
            dtypes=['language', 'video']
        )
        results['language'] = np.concatenate([results['language'], embeddings['language']])
        results['video'] = np.concatenate([results['video'], embeddings['video']])

    video = results['video']
    language = results['language']

    np.save('experiments/MSR-VTT_test_video_embeddings.npy', video)
    np.save('experiments/MSR-VTT_test_language_embeddings.npy', language)

    sim_matrix = torch.tensor(video @ language.T)
    print('VT', compute_metrics(sim_matrix))
    print('TV', compute_metrics(sim_matrix.T))


if __name__ == '__main__':
    main()

@lennartmoritz
Copy link
Author

Hey @e1four15f thank you for your code example. In the mean time, i wrote a similar script to yours based on the inference example script from the repo. But i've noticed, that this is considerably slower than when i used the eval script. I suspect it has to do with the used batch sizes. Have you found a way to select a batch size for inference with your script?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants