GPT4 performance does not seem to be as bad as reported in the paper #4

wangyu-ustc · 2024-10-21T15:13:10Z

Hi authors, I put this into GPT-4o:

Below is a conversation between two people: James and John. The conversation takes place over multiple days and the date of each conversation is wriiten at the beginning of the conversation.

DATE: 8:57 pm on 7 November, 2022
CONVERSATION:
James said, "Hey John! Guess what? Me and my family are currently on the road trip! We`ve already visited my friends Josh and Mark and had such a great time!"
 and shared a photo of a group of people posing for a picture.
John said, "Hey James! That sounds awesome! I had a super fun weekend - I worked with a game developer on a project and it was great to see my ideas come to life. It was an incredible experience! "
 and shared a photo of a wooden board game with four pieces of wood and four pieces of colored balls.
James said, "That sounds amazing. What was the project you worked on?"

John said, "I collaborated with a game developer to create an online board game - it's a fun and unique experience!"

James said, " I can imagine how proud you must feel seeing your ideas come to life in a game. Has it been released for others to try yet?"

John said, "We're about to release a demo soon so others can try it out. Can't wait for their feedback and suggestions."

James said, "Can't wait to try it. Keep me posted when it's out - I wanna support you and give my thoughts."

John said, "Appreciate your support. I'll definitely let you know when it's out and I'm really excited to hear your thoughts."

James said, "By the way, we did one good thing on the way to Mark and Josh."
... (Here I omitted a lot of text)



Based on the above conversations, write short answers for each of the following questions in a few words. 
Write the answers in the form of a json dictionary where each entry contains the question number as "key" and the short answer as "value". 
Use single-quote characters for named entities and double-quote characters for enclosing json elements. Answer with exact words from the conversations whenever possible.

0: What did James lose progress on due to a power outage?
1: What games were played at the gaming tournament organized by John on 31 October, 2022?
2: What was the purpose of the gaming tournament organized by John on 31 October, 2022?
3: What decision did James and Samantha make on 31 October, 2022?
4: Where did James and Samantha decide to live together on 31 October, 2022?
5: Why did James and Samantha choose an apartment near McGee's bar?
6: What game is John hooked on playing on 5 November, 2022?
7: What did John suggest James practice before playing FIFA 23 together?
8: What project did John work on with a game developer by 7 November, 2022?
9: What is the name of John's cousin's dog?

And I got these outputs:

{
  "0": "a game",
  "1": "Fortnite, Overwatch, Apex Legends",
  "2": "raising money for a children's hospital",
  "3": "move in together",
  "4": "an apartment not far from McGee's bar",
  "5": "they love spending time there",
  "6": "FIFA 23",
  "7": "a little first",
  "8": "an online board game",
  "9": "Luna"
}

The ground-truth answers are:

{
  "0": "a game",
  "1": "Fortnite, Overwatch, Apex Legends",
  "2": "To raise money for a children's hospital",
  "3": "To move in together",
  "4": "In an apartment not far from McGee's bar",
  "5": "They love spending time together at the bar",
  "6": "FIFA 23",
  "7": "Control with a gamepad and timing",
  "8": "An online board game",
  "9": "Luna"
}

Looks like GPT4o can at least have 90% above accuracy, I guess the F1 score should also be pretty high, but in the paper GPT-4-turbo usually got under 30% f1 score, am I doing anything wrong? Perhaps this is not how it is evaluated in the paper? Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT4 performance does not seem to be as bad as reported in the paper #4

GPT4 performance does not seem to be as bad as reported in the paper #4

wangyu-ustc commented Oct 21, 2024

GPT4 performance does not seem to be as bad as reported in the paper #4

GPT4 performance does not seem to be as bad as reported in the paper #4

Comments

wangyu-ustc commented Oct 21, 2024