Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOS M1 Chip Failed to use #1197

Closed
rhythmcao opened this issue Dec 5, 2024 · 5 comments
Closed

MacOS M1 Chip Failed to use #1197

rhythmcao opened this issue Dec 5, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@rhythmcao
Copy link

Description of the bug | 错误描述

Hello Xiaomeng,

Thanks very much for providing this wonderful tool "MinerU" which really helps us a lot. All my teammates acknowledge that it is the most powerful, versatile and easy to use PDF parsing tool.

Unfortunately, I got stuck in some awkward problems when using it on the MacOS M1 chip version Sonoma 14.6.1 (For another Windows platform, we successfully integrate it into our project). The latest version of magic-pdf is 0.10.5 till Dec 5th, 2024. After I install it according to the official tutorial https://github.com/opendatalab/MinerU (of course with a separate conda Python 3.10 and no trouble with pip install and models downloading), I got the following problems when I try to run magic-pdf -h:

  File "/Users/rhythmcao/miniforge3/envs/mineru-mac/lib/python3.10/site-packages/unimernet/datasets/data_utils.py", line 15, in <module>
    import decord
  File "/Users/rhythmcao/miniforge3/envs/mineru-mac/lib/python3.10/site-packages/decord/__init__.py", line 4, in <module>
    from ._ffi.runtime_ctypes import TypeCode
  File "/Users/rhythmcao/miniforge3/envs/mineru-mac/lib/python3.10/site-packages/decord/_ffi/runtime_ctypes.py", line 8, in <module>
    from .base import _LIB, check_call
  File "/Users/rhythmcao/miniforge3/envs/mineru-mac/lib/python3.10/site-packages/decord/_ffi/base.py", line 47, in <module>
    _LIB, _LIB_NAME = _load_lib()
  File "/Users/rhythmcao/miniforge3/envs/mineru-mac/lib/python3.10/site-packages/decord/_ffi/base.py", line 39, in _load_lib
    lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)
  File "/Users/rhythmcao/miniforge3/envs/mineru-mac/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/rhythmcao/miniforge3/envs/mineru-mac/lib/python3.10/site-packages/decord/libdecord.dylib, 0x000A): Symbol not found: _CGLGetCurrentContext
  Referenced from: <E8598F3E-7CA7-39DF-82C8-3B35178979A6> /Users/rhythmcao/miniforge3/envs/mineru-mac/lib/python3.10/site-packages/decord/.dylibs/libavfilter.8.44.100.dylib
  Expected in:     <9A0872F4-B2BE-34EB-A7D2-16467518EBD0> /System/Library/Frameworks/OpenGL.framework/Versions/A/OpenGL

According to the error information, I checked the following folders:

  1. "~/miniforge3/envs/mineru-mac/lib/python3.10/site-packages/decord/": all libs such as "libdecord.dylib" and ".dylibs/libavfilter.8.44.100.dylib" do exist.
  2. "/System/Library/Frameworks/OpenGL.framework/Versions/A/": Under the folder "A/", I only get 3 sub-folders without "OpengL/":
    "Libraries Resources _CodeSignature"

Thus, I guess the problem arises from the support of OpenGL with MacOS M1? But I am not sure what the underlying cause is (forgive me that I am not familiar with computer vision and library OpenGL). I would really appreciate it if you could provide some help.

By the way, I also found another solved issue 273 about using MinerU on MacOS Sonoma . Luckily, I succeeded with the version "0.6.2b1":

pip install magic-pdf[full]==0.6.2b1
pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/ 
pip install torch==2.3.1 torchvision==0.18.1 torchtext==0.18.0
magic-pdf --version

which gives me exactly the version 0.6.2b1.

However, version 0.6.2b1 seems to be deprecated, and the shell command is incompatible with the latest version 0.10.5. Thus, I really hope to use exactly the latest version 0.10.5 for more advanced support.

If MinerU does not support the latest version 0.10.5 on MacOS M1, could you give me a more detailed API doc about the usage of the old magic-pdf pdf-command such that I can work similarly with the new command magic-pdf -p pdf_path -o output_folder -m auto. For example:

  • can I directly modify the json file ~/magic-pdf.json to change the model directory or enable table-config like the version 0.10.5, since magic-pdf pdf-command requests us to provide the argument --model PATH (which model?)
  • how to set the output folder parameter using magic-pdf pdf-command

Thanks a lot.

How to reproduce the bug | 如何复现

Just follow the official guide on MacOS M1:

conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com

And the models have been downloaded successfully.

Operating system | 操作系统

MacOS

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.10.x

Device mode | 设备模式

cpu

@rhythmcao rhythmcao added the bug Something isn't working label Dec 5, 2024
@myhloli
Copy link
Collaborator

myhloli commented Dec 5, 2024

I am running well on the m4+15.1.1 system. It is speculated that the instability of the PyPI mirror source might have caused some dependencies to fail during installation. I recommend recreating a conda environment and using a mirror source for the installations.

conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple

@rhythmcao
Copy link
Author

Finally, I succeed with this output 🤗! MacOS does not support NVIDIA, so I guess the problem is solved.

image

The problem seems to be caused by my erroneously set environment variable DYLD_LIBRARY_PATH, not the failure of MinerU at all. When I follow the official advice to add the path:

export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:/System/Library/Frameworks/OpenGL.framework/Versions/A/Libraries

or simply type unset DYLD_LIBRARY_PATH, the error is gone!

Thanks again for this wonderful tool. I will close this issue.

@samqin123
Copy link

分享个我的错误体会:我用的vscode,没注意右下角python环境选了3.9,一直没注意,所以pip Install pdf[all]或者[all-cpu]安装重视不顺利,然后python3.9不支持detectron2,各种方法都报错,突然看到了3.9, 后来换成3.10,立刻就可以运行了。所以一定要先确认自己的环境是python3.10,很简单 Python --version就可以看到了。

@samqin123
Copy link

image

@flight505
Copy link

I tried to follow the work around form this issue but I cant get it working. To use MinerU for the current project I need it integrated and running on Apple silicon - MPS would be nice. For now I will just stick to marker with llm as it runs mps and run time pr scientific paper is less then 1min on my macbook. I you have a better solution for the script please share it

import os
import subprocess
import time
import shutil
from termcolor import cprint
from tqdm import tqdm
import platform

# Constants
PDF_DIR = "pdfs"
OUTPUT_DIR = "markdown_output"

def setup_mac_environment():
    """Setup environment variables for M-series Macs"""
    if platform.system() == "Darwin" and platform.machine() == "arm64":
        cprint("M-series Mac detected, configuring OpenGL paths...", "cyan")
        opengl_path = "/System/Library/Frameworks/OpenGL.framework/Versions/A/Libraries"
        current_path = os.environ.get("DYLD_LIBRARY_PATH", "")
        if opengl_path not in current_path:
            os.environ["DYLD_LIBRARY_PATH"] = f"{current_path}:{opengl_path}" if current_path else opengl_path

def convert_pdf_to_markdown(pdf_path, output_dir):
    try:
        cprint(f"\nStarting conversion of {pdf_path}...", "yellow")
        
        # Get base filename without extension
        base_name = os.path.splitext(os.path.basename(pdf_path))[0]
        temp_output_dir = os.path.join(output_dir, base_name)
        final_output_path = os.path.join(output_dir, f"{base_name}.md")
        
        # Remove existing output
        if os.path.exists(temp_output_dir):
            shutil.rmtree(temp_output_dir)
        if os.path.exists(final_output_path):
            os.remove(final_output_path)
        
        # Convert PDF using magic-pdf
        cprint("Converting with text mode...", "cyan")
        process = subprocess.Popen(
            ["magic-pdf", "-p", pdf_path, "-o", output_dir, "-m", "txt"],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            universal_newlines=True
        )
        
        stdout, stderr = process.communicate()
        
        if process.returncode != 0:
            raise Exception(f"Command failed with error: {stderr}")
        
        # Wait a bit for file system
        time.sleep(1)
        
        # Check for output in the expected directory structure
        txt_dir = os.path.join(temp_output_dir, "txt")
        if not os.path.exists(txt_dir):
            raise Exception("Output directory structure not found")
        
        # Combine all text files into one markdown file
        with open(final_output_path, "w", encoding="utf-8") as outfile:
            # Write metadata
            outfile.write(f"# {base_name}\n\n")
            
            # Process text files in order
            text_files = sorted([f for f in os.listdir(txt_dir) if f.endswith(".txt")])
            for text_file in text_files:
                with open(os.path.join(txt_dir, text_file), "r", encoding="utf-8") as infile:
                    outfile.write(infile.read() + "\n\n")
        
        # Clean up temporary directory
        shutil.rmtree(temp_output_dir)
        
        final_size = os.path.getsize(final_output_path)
        cprint(f"\nSuccessfully converted {pdf_path} to {final_output_path} (Size: {final_size/1024:.2f}KB)", "green")
        return True
    except Exception as e:
        cprint(f"\nError converting {pdf_path}: {str(e)}", "red")
        return False

def main():
    try:
        # Setup environment for M-series Macs
        setup_mac_environment()
        
        # Create output directory if it doesn't exist
        os.makedirs(OUTPUT_DIR, exist_ok=True)
        
        # Get list of PDF files
        pdf_files = [f for f in os.listdir(PDF_DIR) if f.endswith('.pdf')]
        
        if not pdf_files:
            cprint("No PDF files found in the pdfs directory!", "red")
            return
        
        cprint(f"Found {len(pdf_files)} PDF files to convert", "cyan")
        
        # Process each PDF
        success_count = 0
        for i, pdf_file in enumerate(pdf_files, 1):
            cprint(f"\nProcessing file {i}/{len(pdf_files)}", "cyan")
            pdf_path = os.path.join(PDF_DIR, pdf_file)
            if convert_pdf_to_markdown(pdf_path, OUTPUT_DIR):
                success_count += 1
        
        cprint(f"\nConversion complete! Successfully converted {success_count}/{len(pdf_files)} files", "green")
        
    except Exception as e:
        cprint(f"\nAn error occurred: {str(e)}", "red")

if __name__ == "__main__":
    main() 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants