Towards Foundation Models for 3D Vision: How Close Are We?

Yiming Zuo* · Karhan Kayan* · Maggie Wang · Kevin Jeon · Jia Deng · Thomas L. Griffiths

(*equal contribution)

Princeton University

Paper

Citation

If you use our benchmark or data in your work, please cite our 3DV paper:

@article{zuo2025towards,
  title={Towards Foundation Models for 3D Vision: How Close Are We?},
  author={Zuo, Yiming and Kayan, Karhan and Wang, Maggie and Jeon, Kevin and Deng, Jia and Griffiths, Thomas L},
  journal={International Conference on 3D Vision (3DV)},
  year={2025}
}

Download the Benchmark

Download the pre-generated image and question pairs in this google drive link. We also provide instructions on how to generate the benchmark from the original datasets here.

Relative Depth Estimation

Evaluate VLM accuracy

Go to the LLM_evaluations/relative_depth folder and run python generate_gpt4v_response.py (replace with your own OpenAI API key). This will save the GPT response in json format. We also provide the raw response we collected from GPT4v, GPT4o, and Gemini in this google drive link.

Then run python evaluate_gpt4v_response.py, which computes the accuracy.

Human response raw data

Go to the Human_Study/relative_depth folder and run python evaluate_mturk.py.

Spatial VQA

Evaluate VLM accuracy

Go to the LLM_evaluations/clevr_vqa folder and run python generate_gpt_response.py (replace with your own OpenAI API key). This will save the GPT response in json format. We also provide the raw response we collected from GPT4v, GPT4o, Gemini, and the specialized model MDETR in this google drive link.

Then run python compute_acc.py, which computes the accuracy.

Human response raw data

Go to the Human_Study/clevr_vqa folder and run python evaluate_mturk.py.

Relative Camera Pose Estimation

Evaluate VLM accuracy

Go to the LLM_evaluations/relative_camera_pose folder and run python generate_gpt4v_response.py (replace with your own OpenAI API key). This will save the GPT response in json format. We also provide the raw response we collected from GPT4v, GPT4o, and Gemini in this google drive link.

Then run python evaluate_gpt4v_response.py, which computes the accuracy.

Human response raw data

Go to the Human_Study/relative_camera_pose folder and run python evaluate_mturk.py.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Human_Study		Human_Study
LLM_evaluations		LLM_evaluations
benchmark_data		benchmark_data
benchmark_data_generator/relative_depth		benchmark_data_generator/relative_depth
media		media
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_generation.md		benchmark_generation.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Foundation Models for 3D Vision: How Close Are We?

Princeton University

Paper

Citation

Download the Benchmark

Relative Depth Estimation

Evaluate VLM accuracy

Human response raw data

Spatial VQA

Evaluate VLM accuracy

Human response raw data

Relative Camera Pose Estimation

Evaluate VLM accuracy

Human response raw data

About

Releases

Packages

Languages

License

princeton-vl/UniQA-3D

Folders and files

Latest commit

History

Repository files navigation

Towards Foundation Models for 3D Vision: How Close Are We?

Princeton University

Paper

Citation

Download the Benchmark

Relative Depth Estimation

Evaluate VLM accuracy

Human response raw data

Spatial VQA

Evaluate VLM accuracy

Human response raw data

Relative Camera Pose Estimation

Evaluate VLM accuracy

Human response raw data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages