diff --git a/index.html b/index.html index 84e62ab..80c838c 100644 --- a/index.html +++ b/index.html @@ -110,6 +110,22 @@ .cost_text { margin-bottom: -5px; } + footer p { + color: var(--gray-500, #6B7280); + + /* text-xs/font-medium */ + font-family: Inter; + font-size: 12px; + font-style: normal; + font-weight: 500; + line-height: 16px; /* 133.333% */ + + margin: 0; + + text-align: center; + + margin-bottom: 24px; + } h2 { color: var(--gray-900, #111827); @@ -195,7 +211,7 @@ #lenny { position: absolute; right: 30%; - top: 26em; + top: 47.5%; } @media screen and (max-width: 600px) { .feature_cards { @@ -249,29 +265,13 @@ gtag('config', 'G-S0F5Y25KSC'); - -

How's GPT-4V Doing?

A collection of experiments measuring the performance of GPT-4 Vision.

+

Percentages measure how many of our tests passed.

Made with ❤️ by the team at Roboflow.

Last updated November 14, 2023.

Learn about our methodology. @@ -283,13 +283,12 @@

How's GPT-4V Doing?

Response Time

-

Over the last 1 day - , the average response time was 1.0ms.

+

Over the last 1 day, the average response time was 1.0ms.

+

This number only accounts for requests made by this application.

-

100.0%

-

Uptime

+

1.0 ms

@@ -299,8 +298,6 @@

Response Time

Zero-Shot Classification

-

Validate GPT-4V's ability to classify objects.

-

Learn more about this test.

@@ -310,16 +307,15 @@

Zero-Shot Classification

-

In this test, we test GPT-4V's ability to classify an object.

Prompt

-                            What is in the image? Return the class of the object in the image. Here are the classes: fruit, bowl. You can only return one class from that list.
+                            What is in the image? Return the class of the object in the image. Here are the classes: Toyota Camry, Tesla Model 3. You can only return one class from that list.
                         

Image

- -

Answer

+ +

Result

-                            
+                            Toyota Camry
                         
@@ -327,8 +323,6 @@

Answer

Counting

-

Validate GPT-4V's ability to count objects.

-

Learn more about this test.

@@ -338,15 +332,15 @@

Counting

-

In this test, we test GPT-4V's ability to count objects.

Prompt

                             Count the fruit in the image. Return a single number.
                         

Image

+

Result

-                            
+                            10
                         
@@ -354,8 +348,6 @@

Image

Document OCR

-

Validate GPT-4V's ability to read document text.

-

Learn more about this test.

@@ -371,8 +363,9 @@

Prompt

Image

+

Result

-                            
+                            I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the Midnights album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times.
                         
@@ -381,7 +374,6 @@

Image

Handwriting OCR

Validate GPT-4V's ability to read handwriting.

-

Learn more about this test.

@@ -397,9 +389,9 @@

Prompt

Image

-

Image

+

Result

-                            
+                            The words of songs on the album have been echoing in my head all week. "Fades into the grey of my day old tea."
                         
@@ -412,13 +404,16 @@

Methodology

Every day, we run a set of tests to evaluate how GPT-4 Vision (GPT-4V) performs over time.

These tests are designed to monitor core features of GPT-4V.

-

Each test runs the same prompt and image through GPT-4V and compares the answer to a human-written answer.

+

Each test runs the same prompt and image through GPT-4V and compares the Result to a human-written Result.

While making this website, we experimented with prompts and chose the prompt that gave the most accurate results.

Tests are run at 1am PT every day. This site is updated when all tests are complete.

If a line is red, it means the test failed that day; if a line is green, the test passed.

+
+

This project is not affiliated with OpenAI.

+
\ No newline at end of file diff --git a/results/2023-11-14.json b/results/2023-11-14.json index b4078e5..c9d192e 100644 --- a/results/2023-11-14.json +++ b/results/2023-11-14.json @@ -1 +1 @@ -{"zero_shot_classification": [false], "count_fruit": [false], "request_times": [1.356907844543457, 1.4885571002960205, 8.274598121643066, 4.83161997795105], "document_ocr": [false], "handwriting_ocr": [false]} \ No newline at end of file +{"zero_shot_classification": [true], "count_fruit": [true], "request_times": [2.079479694366455, 1.3974571228027344, 12.67345905303955, 3.7373461723327637], "document_ocr": [true], "handwriting_ocr": [true]} \ No newline at end of file diff --git a/template.html b/template.html index 851d1c1..59cb85b 100644 --- a/template.html +++ b/template.html @@ -211,7 +211,7 @@ #lenny { position: absolute; right: 30%; - top: 26em; + top: 47.5%; } @media screen and (max-width: 600px) { .feature_cards { @@ -284,11 +284,11 @@

How's GPT-4V Doing?

Response Time

Over the last {{results['day_count']}} day{% if results['day_count'] > 1 %}s{% endif %}, the average response time was {{results['avg_response_time']}}ms.

+

This number only accounts for requests made by this application.

-

100.0%

-

Uptime

+

{{results['avg_response_time']}} ms

@@ -309,13 +309,13 @@

Zero-Shot Classification

Prompt

-                            What is in the image? Return the class of the object in the image. Here are the classes: fruit, bowl. You can only return one class from that list.
+                            What is in the image? Return the class of the object in the image. Here are the classes: Toyota Camry, Tesla Model 3. You can only return one class from that list.
                         

Image

- -

Answer

+ +

Result

-                            {{ results['zero_shot_classification_answer'] }}
+                            {{ results['zero_shot_result'] }}
                         
@@ -338,8 +338,9 @@

Prompt

Image

+

Result

-                            {{ results['count_fruit_answer'] }}
+                            {{ results['count_result'] }}
                         
@@ -362,8 +363,9 @@

Prompt

Image

+

Result

-                            {{ results['document_ocr_answer'] }}
+                            {{ results['document_ocr_result'] }}
                         
@@ -387,9 +389,9 @@

Prompt

Image

-

Image

+

Result

-                            {{ results['handwriting_ocr_image'] }}
+                            {{ results['handwriting_result'] }}
                         
@@ -402,7 +404,7 @@

Methodology

Every day, we run a set of tests to evaluate how GPT-4 Vision (GPT-4V) performs over time.

These tests are designed to monitor core features of GPT-4V.

-

Each test runs the same prompt and image through GPT-4V and compares the answer to a human-written answer.

+

Each test runs the same prompt and image through GPT-4V and compares the Result to a human-written Result.

While making this website, we experimented with prompts and chose the prompt that gave the most accurate results.

Tests are run at 1am PT every day. This site is updated when all tests are complete.

If a line is red, it means the test failed that day; if a line is green, the test passed.

diff --git a/web.py b/web.py index 60c9525..83c46b0 100644 --- a/web.py +++ b/web.py @@ -78,16 +78,20 @@ def predict( return response.choices[0].message.content, inference_time def zero_shot_classification(): + classes = ["Tesla Model 3", "Toyota Camry"] + base_model = GPT4V( ontology=CaptionOntology({"Tesla Model 3": "Tesla Model 3", "Toyota Camry": "Toyota Camry"}), api_key=os.environ["OPENAI_API_KEY"], ) - result, inference_time = base_model.predict("images/car.jpeg", classes=["Tesla Model 3", "Toyota Camry"]) + result, inference_time = base_model.predict("images/car.jpeg", classes=classes) return ( - result == sv.Classifications(class_id=np.array([0]), confidence=np.array([1])), + # 1 maps with Tesla Model 3 + result == sv.Classifications(class_id=np.array([1]), confidence=np.array([1])), inference_time, + classes[result.class_id[0]], ) @@ -104,7 +108,7 @@ def count_fruit(): prompt="Count the fruit in the image. Return a single number.", ) - return result == "6", inference_time + return result == "10", inference_time, result def document_ocr(): base_model = GPT4V( @@ -116,10 +120,10 @@ def document_ocr(): "images/swift.png", classes=[], result_serialization="text", - prompt="Read the text in the image." + prompt="Read the text in the image. Return only the text, with puncuation." ) - return result == "I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the Midnights album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times.", inference_time + return result == "I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the Midnights album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times.", inference_time, result def handwriting_ocr(): @@ -132,18 +136,18 @@ def handwriting_ocr(): "images/ocr.jpeg", classes=[], result_serialization="text", - prompt="Read the text in the image." + prompt="Read the text in the image. Return only the text, with puncuation." ) - return result == 'The words of songs on the album have been echoing in my head all week. "Fades into the grey of my day old tea."', inference_time + return result == 'The words of songs on the album have been echoing in my head all week. "Fades into the grey of my day old tea."', inference_time, result results = {"zero_shot_classification": [], "count_fruit": [], "request_times": [], "document_ocr": [], "handwriting_ocr": []} -zero_shot, inference_time = zero_shot_classification() -count_fruit, count_inference_time = count_fruit() -document_ocr, ocr_inference_time = document_ocr() -handwriting_ocr, handwriting_inference_time = handwriting_ocr() +zero_shot, inference_time, zero_shot_result = zero_shot_classification() +count_fruit, count_inference_time, count_result = count_fruit() +document_ocr, ocr_inference_time, ocr_result = document_ocr() +handwriting_ocr, handwriting_inference_time, handwriting_result = handwriting_ocr() results["zero_shot_classification"].append(zero_shot) results["count_fruit"].append(count_fruit) @@ -176,27 +180,30 @@ def handwriting_ocr(): og_results = results.copy() +print(results, "rrr") + results = { - k: ",".join([str(1 if v else 0) for v in value]) for k, value in results.items() + k: ",".join([str(1 if v[0] else 0) for v in value]) for k, value in results.items() } +print(results, "rrr") results["zero_shot_classification_success_rate"] = ( - sum([1 if i else 0 for i in og_results["zero_shot_classification"]]) + sum([1 if i[0] else 0 for i in og_results["zero_shot_classification"]]) / len(og_results["zero_shot_classification"]) * 100 ) results["count_fruit_success_rate"] = ( - sum([1 if i else 0 for i in og_results["count_fruit"]]) + sum([1 if i[0] else 0 for i in og_results["count_fruit"]]) / len(og_results["count_fruit"]) * 100 ) results["document_ocr_success_rate"] = ( - sum([1 if i else 0 for i in og_results["document_ocr"]]) + sum([1 if i[0] else 0 for i in og_results["document_ocr"]]) / len(og_results["document_ocr"]) * 100 ) results["handwriting_ocr_success_rate"] = ( - sum([1 if i else 0 for i in og_results["handwriting_ocr"]]) + sum([1 if i[0] else 0 for i in og_results["handwriting_ocr"]]) / len(og_results["handwriting_ocr"]) * 100 ) @@ -206,6 +213,11 @@ def handwriting_ocr(): results["document_ocr_length"] = len(og_results["document_ocr"]) results["handwriting_ocr_length"] = len(og_results["handwriting_ocr"]) +results["document_ocr_result"] = ocr_result +results["handwriting_result"] = handwriting_result +results["zero_shot_result"] = zero_shot_result +results["count_result"] = count_result + results["avg_response_time"] = round(sum([float(i) for i in results["request_times"]]) / len( results["request_times"] ), 2)