Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about frame_difference metric calculation #31

Open
yankee624 opened this issue Aug 29, 2024 · 4 comments
Open

Question about frame_difference metric calculation #31

yankee624 opened this issue Aug 29, 2024 · 4 comments

Comments

@yankee624
Copy link

yankee624 commented Aug 29, 2024

Hi, thanks for the great work!

I have a question about the following line:

to_append_num_frames = min(next_turn_num_frames, turn_num_frames - 1) # avoid bias. current as center, two equal left/right side

This line caps the frame_difference to a certain value. (The comment says 'avoiding the bias' but I don't quite understand it. I would really appreciate more detailed explanation.)

Specifically, when the model fails to reply(=output eos) before current turn ends:

  • next_turn_num_frames: it test whether the model can reply until the next turn, not the next-next turn or more future turns. The model may be able to reply in more future turns, so I don't get why you set the maximum to the next turn.

  • number of frames in current turn - 1: I also don't understand this. In extreme case, if the current turn has 1 frame, it just sets frame_diff to 0.

    • For example, let's assume the following frames and responses:
      [a] [a] [b] "b appeared" [c] "c appeared" [c] [c] [d] "d appeared"
      where [k] is the frame with content k, and the response is in the "".
      Turn 1: [a] [a] [b] "b appeared" (num frames = 3)
      Turn 2: [c] "c appeared" (num frames = 1)
      Turn 3: [c] [c] [d] "d appeared" (num frames = 3)

    Let's say we're currently in Turn 2. Then, even if the model fails to reply in the first (and the only) frame [c], the frame_diff becomes 0 becuase number of frames in current turn - 1 = 0. I think we have to consider whether model can reply in the frames in Turn 3.

@chenjoya
Copy link
Collaborator

Hi, 'avoiding the bias' is to avoid the model uses a simple behavior, e.g. always predict after the ground-truth timestamp. The motivation is to ensure the chance equally: (1) predicting in advance, (2) predicting afterwards.

The case you provided is correct. The problem does exist. Many thanks for the careful checking! Maybe we can make the evaluation temporal boundary still be the last response between the next response, no min() operation here.

BTW, do you have any other advice to improve that? I can update the results on a new arxiv version.

@yankee624
Copy link
Author

yankee624 commented Aug 30, 2024

Thank you for the reply!
I think 'avoiding the bias' may make sense in the training time, where we can train the model to produce early/late response equally. But the frame_diff metric is calculated in evaluation time, where the model is already trained to output response in a specific way, and we are forecully capping the maximum frame_diff. This seemed a bit unnatural. As you said, using to_append_num_frames = next_turn_num_frames without the min() operation seems better. (although this still caps the limit the frame_diff to next_turn_num_frames... It may be helpful to report something like "frame_diff compared to turn_num_frames")

Another thought: I can't think of a scenario where early response is useful. For example, as shown in Figure 4 of the paper, the user would query something like "Remind me when yellow card appears", then, early response is totally useless since yellow card didn't appear yet. Only the reponse after yellow card occurrence is useful. So I thought only the just-in-time or late responses should be considered for frame_diff calculation. But I'm not sure about this. If there's a scenario where early response can also be useful, please remind me of that.

@chenjoya
Copy link
Collaborator

chenjoya commented Sep 5, 2024

Hi, many thanks for the discussion! I think limiting by next_turn_num_frames is reasonable, since the next turn may have new context that needs the model to output. Therefore, it is not suitable for us to calculate the frame_diff using the frames from the next turn.

However, I agree with you that we can evaluate the model performance without using min(). I will update the results here first and add a table in the supplement to suggest that. Currently, I think the metrics are still fair for comparing ablations since all variants use the same evaluation.

An early response can also be useful. For example, when a dangerous situation is about to occur, it would be better for the model to report in advance rather than with a delay.

@yankee624
Copy link
Author

Thank you so much for the discussion! I really appreciate your help.
Still, for the early response scenario, i think the model should report at the moment "when a dangerous situation is about to occur (when some dangerous signals are captured)", not earlier than that. Earlier response means that the model is reporting even when there is no signal of danger. So I thought only the "just-in-time" and "late" responses are meaningful, not the "early" response.
I will think more about the scenarios. Thank you again for the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants