This repo is a forked version of this repository
- Multilingual Support
- Improve Quality of extract_essentials (Summary)
- Keyword based curation
demo.mov
Generate a comprehensive review from an arXiv paper, then turn it into a blog post. That is the goal of this project, and it comes with a set of tools to accomplish that. If you are curious how to build your own paper reviewing blog, check out AI Paper Reviewer which is powered by this project to auto-generate blog posts on the Hugging Face Daily Papers. The video above is the demo of it.
As of Dec. 2024, this project also powers AI Paper Reviewer for NeurIPS 2024 web page.
At high level, there are two Python scripts, collect.py
and convert.py
.
collect.py
: Collect and generate reviews for a given arXiv ID.convert.py
: Convert the collected reviews into a blog post. The blog post follows a fixed design template.
# mendatory
$ export GEMINI_API_KEY="..."
# optional, only if you want to use Upstage's Document Parse
$ export UPSTAGE_API_KEY="..."
# optional, only if you want to upload images to R2
$ export R2_ACCESS_KEY_ID="..."
$ export R2_SECRET_ACCESS_KEY="..."
$ export R2_S3_ENDPOINT_URL="..."
$ export R2_DOMAIN_NAME="..."
# install dependencies
$ pip install -r requirements.txt
# poppler is required to convert pdf to images
# for Ubuntu, use apt install poppler-utils
$ brew install poppler
To collect and generate reviews for a given arXiv ID, run the collect.py
script with the following options:
$ python collect.py --help
usage: collect.py [-h] [--arxiv-id ARXIV_ID] [--workers WORKERS]
[--use-upstage] [--stop-at-no-html]
[--known-affiliations-path KNOWN_AFFILIATIONS_PATH]
[--known-categories-path KNOWN_CATEGORIES_PATH]
[--lang LOCALE]
options:
-h, --help show this help message and exit
--arxiv-id ARXIV_ID arXiv ID
--workers WORKERS Number of workers
--use-upstage Use Upstage to extract figures from images
--stop-at-no-html Stop if no HTML is found
--known-affiliations-path KNOWN_AFFILIATIONS_PATH
Path to known affiliations
--known-categories-path KNOWN_CATEGORIES_PATH
Path to known categories
--lang LOCALE locale
To minimize the cost, it is recommended to run the collect.py
script with --stop-at-no-html
option. This will make sure to run the workflow on the paper that is its dedicated HTML page(arXiv's experimental HTML page).
$ python collect.py --arxiv-id "..." --stop-at-no-html
If you want to run the workflow on the paper which does not have its dedicated HTML page, you need to extract visual information(i.e. figures, tables, charts) from the imaged version of the paper. For this case, --use-upstage
option as in the following command will give you the best results.
$ python collect.py --arxiv-id "..." --use-upstage
If you are not a user of Upstage, or if you don't want to be charged for the Upstage APIs, you can remove the --use-upstage
option. In this case, the script will use Gemini to extract the visual information from the imaged version of paper. However, this approach is the best effort and not recommended if you care about how accurately visual information is parsed. Gemini is not optimized to determine the coordinates of the visual information on given images.
$ python collect.py --arxiv-id "..."
To convert the collected reviews into a blog post, run the convert.py
script with the following options:
$ python convert.py --help
usage: convert.py [-h] [--arxiv-id ARXIV_ID] [--template TEMPLATE] [--hf-daily-papers-date-tag HF_DAILY_PAPERS_DATE_TAG] [--upload-images-r2]
[--stop-at-no-html]
options:
-h, --help show this help message and exit
--arxiv-id ARXIV_ID arXiv ID
--template TEMPLATE Template file
--hf-daily-papers-date-tag HF_DAILY_PAPERS_DATE_TAG
--upload-images-r2 Cloudflare R2 to upload images
--stop-at-no-html Stop if no HTML is found
The blog post follows the fixed design template. If you want to customize the design, you have to modify the template by yourself.