Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent fails with ERR_DLOPEN_FAILED on Windows #331

Open
aryanjassal opened this issue Nov 18, 2024 · 10 comments
Open

Agent fails with ERR_DLOPEN_FAILED on Windows #331

aryanjassal opened this issue Nov 18, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@aryanjassal
Copy link
Member

aryanjassal commented Nov 18, 2024

Describe the bug

On Windows, when running a fresh install by downloading the executable directly from GitHub, the executable works perfectly. However, after closing the agent and trying to open it again, a ERR_DLOPEN_FAILED error is raised. Note that the process is being run in a VirtualBox VM.

PS C:\Users\vboxuser\Downloads> .\pk.exe
pkg/prelude/bootstrap.js:2255
    return ancestor.dlopen.apply(process, args);
                           ^

Error: The specified module could not be found.
C:\Users\vboxuser\AppData\Local\Temp\pkg\037eb9aea39dce95eb334f7338fdc7c24733ae226b64dab4cec852f7af796b54\@matrixai\quic-win32-x64\node.napi.node
    at process.dlopen (pkg/prelude/bootstrap.js:2255:28)
    at Module._extensions..node (node:internal/modules/cjs/loader:1473:18)
    at Module.load (node:internal/modules/cjs/loader:1207:32)
    at Module._load (node:internal/modules/cjs/loader:1023:12)
    at Module.require (node:internal/modules/cjs/loader:1235:19)
    at Module.require (pkg/prelude/bootstrap.js:1851:31)
    at require (node:internal/modules/helpers:176:18)
    at requireBinding (C:\snapshot\Polykey-CLI\dist\polykey.js:2116:7277)
    at C:\snapshot\Polykey-CLI\dist\polykey.js:2116:7520
    at C:\snapshot\Polykey-CLI\dist\polykey.js:2:315 {
  code: 'ERR_DLOPEN_FAILED'
}

Node.js v20.11.1

I have verified that all the files exist where they should be. The path C:\Users\...\node.napi.node is an actual binary which exists and can be opened. This happened after I attempted to run a node while one was already running using an alternative node path.

It doesn't matter how I reset the state, removing temp files, removing cache, reinstalling binary, reinstalling node, nothing works.

To Reproduce

  1. Run executable
  2. Create a new node
  3. Try creating another node from a new window while the original node is still running
  4. See error
  5. Close first node
  6. See this error when trying to restart it

Expected behavior

Polykey should launch as expected every time

Screenshots

Platform

  • Device: Dell Precision 3480
  • OS: Windows 11
  • Version: 0.14.0

Additional context

  • When I closed the first node after seeing the error messages pop up for the second node, I got an error message like Error: unknown memory.
  • This VM is a fresh VM with basically nothing on it, so I had to install node to run it.
  • The networking settings is correctly set in the VM, where I can ping the VM and it can ping me. However, we never got the actual networking bit, as it is failing on startup, so that's not really relevant.
  • We attempted to clone the source and try running that, but it was still failing.
  • Removing the node.napi.node file gave us the expected MODULE_NOT_FOUND error, and the file is also not locked as we can still open and read the file.
  • The polykey state was not removed, so I can provide access to that if necessart

Notify maintainers

This is a general bug, so there is no one maintainer.

@aryanjassal @tegefaulkes

@aryanjassal aryanjassal added the bug Something isn't working label Nov 18, 2024
Copy link

linear bot commented Nov 18, 2024

@CMCDragonkai
Copy link
Member

Probably a new native library failed to be bundled. Also isn't the CI supposed to test basic functionality cross platform to prevent regression? This should be caught by the CI - please check @tegefaulkes @aryanjassal

@aryanjassal
Copy link
Member Author

Probably a new native library failed to be bundled. Also isn't the CI supposed to test basic functionality cross platform to prevent regression?

Well, we only check if the help text shows up fine. Which it did. The first agent ran just fine, but the issues came when running another agent, or running the same agent again. So, a bundling issue seems implausible. I will look into this further.

A way to ensure cross-platform compatibility is to run the entire test suite on each platform instead of just printing the help text. Maybe @brynblack can help with that.

@CMCDragonkai
Copy link
Member

pkg unpacks and extracts it's native binaries to a temporary location. When that error occurs check if that path still exists and interrogate its contents. This may be a misnomer.

Copy link
Member Author

pkg unpacks and extracts it's native binaries to a temporary location. When that error occurs check if that path still exists and interrogate its contents. This may be a misnomer.

Yes, we did that. The pkg had unpacked its binaries under ~\AppData\Local\Temp. I investigated the contents in there, and everything seemed correct, and all the files were also existing in the correct paths. I located the offending file and opened the file using a text editor. Of course, the contents were gibberish in ASCII, but it revealed that the file was readable, meaning that the file is present in the correct location, and has rw permissions, too.

@CMCDragonkai
Copy link
Member

Did you try that in the midst of a second execution when the process failed?

Copy link
Member Author

The process fails literally instantly, so I can't do this. Moreover, the error message contains the actual path I followed to locate the relevant binary, so I cannot follow the path before the error message gets printed.

Copy link
Member

Yes I mean follow the path after error is printed.

Copy link
Member

And you'd want to make sure no state changes occurred after the fact. While doing it, use strace or equivalent in windows to check. Sanity check with chatgpt with there's known issues loading the same DLL for some reason.

@CMCDragonkai
Copy link
Member

You can always attach a trace to the process as it is running - on Windows it should be possible too.

See this chatgpt transcript (learn to do this when investigating problems):


Prompt:

I need the equivalent of strace for windows. How can I use it, and what is it?

Response:

The equivalent of strace for Windows is Process Monitor (Procmon), a powerful system monitoring tool developed by Microsoft (previously by Sysinternals). Procmon allows you to trace system calls and other interactions between user applications and the operating system, similar to how strace works on Linux.

What is Process Monitor?

Process Monitor is a real-time monitoring tool for Windows that tracks:

  • File system activity: File read/write operations, file creation, deletion, etc.
  • Registry activity: Access, modification, and querying of the Windows Registry.
  • Network activity: Though not as detailed as file or registry operations, it can provide insight into process-related network operations.
  • Process and thread activity: Process creation, termination, and related information.

How to Use Process Monitor

  1. Download Process Monitor:

  2. Launch Process Monitor:

    • Run Procmon.exe. You may need administrator privileges to monitor system-wide activity.
  3. Set Filters (Optional):

    • Click on the Filter menu and configure filters to focus on specific processes, file paths, registry keys, or types of events. For example:
      • Trace only a specific process by name: Process Name is <process_name>.
      • Trace only file or registry operations: Include only File System Activity or Registry Activity.
  4. Start Monitoring:

    • Once you’ve configured your filters, click the Capture Events button (magnifying glass icon) to start capturing system calls and related events.
    • You can stop capturing at any time by clicking the same button.
  5. Analyze the Results:

    • Use the search, filter, and highlight features to analyze the captured events.
    • Click on an event to see detailed information, such as the system call name, parameters, return values, and more.
  6. Export Data (Optional):

    • If you need to share or analyze the logs later, you can save them as .PML files for use in Procmon or export them as CSV files.

Examples of Use Cases

  1. Debugging Application Issues:

    • Identify why an application fails to launch or behaves incorrectly.
    • Track missing files or incorrect registry values.
  2. Analyzing Malware Behavior:

    • Monitor suspicious processes to see what files, registry keys, or network resources they are accessing.
  3. Performance Tuning:

    • Trace bottlenecks by analyzing which files or registry keys are accessed frequently.
  4. Reverse Engineering:

    • Observe how an application interacts with the operating system.

Key Differences Between strace and Procmon

Feature strace (Linux) Procmon (Windows)
Primary Focus System call tracing File, registry, and process activity
Granularity Detailed system calls Broader OS-level activities
Output Format Text-based logs GUI with filters and visualizations

For lightweight alternatives to Procmon, you can explore Sysmon or DebugView from the Sysinternals suite, but Procmon remains the most versatile and closest to strace.

Prompt:

I need to attach to a node process to trace what is happening prior to ERR_DLOPEN_FAILED.

Response:

Tracing what is happening in a Node.js process before encountering an ERR_DLOPEN_FAILED error requires a combination of debugging and system monitoring tools. Here’s a step-by-step approach to identify the root cause using tools available on your operating system.


Understanding ERR_DLOPEN_FAILED

This error typically occurs when Node.js fails to dynamically load a native module. Common causes include:

  • Missing or incorrect file paths.
  • Architecture or ABI mismatch.
  • Missing dependencies (e.g., shared libraries).
  • Permission issues.

Steps to Trace the Error

1. Enable Node.js Debugging

Run the Node.js process with additional debugging options to capture internal activity.

  • Use the --trace-warnings flag to get detailed stack traces for warnings:

    bashCopy codenode --trace-warnings app.js
  • Enable verbose logging for native module loading:

    bashCopy codeNODE_DEBUG=module node app.js

    This will log information about how modules are resolved and loaded, which can help pinpoint the problem.

2. Attach a Debugger

Node.js has built-in debugging capabilities. You can use the --inspect flag to attach a debugger:

  • Start the process with:

    bashCopy codenode --inspect-brk app.js

    This will pause execution at the beginning of the script, allowing you to inspect the environment.

  • Connect to the debugger using a compatible tool:

    • Chrome DevTools: Open chrome://inspect in Chrome.
    • VSCode: Use the "Attach to Node.js" configuration.

Place breakpoints near the code triggering the ERR_DLOPEN_FAILED error to inspect variables and stack traces.

3. Use Process Monitor (Windows) or strace (Linux)

If you suspect system-level issues, trace the process's system calls.

  • On Linux: Use strace to monitor file system and library calls:

    bashCopy codestrace -f -e open,openat,stat,access,lstat,node app.js

    This will show attempts to open files or libraries, revealing missing dependencies or incorrect paths.

  • On Windows: Use Process Monitor (Procmon):

    • Filter for your node.exe process.
    • Look for File Not Found or Access Denied entries related to .node files or shared libraries.

4. Check for Missing Shared Libraries

Native modules may depend on system-level shared libraries. Use tools to verify dependencies:

  • On Linux: Use ldd to inspect dependencies of the problematic module:

    bashCopy codeldd path/to/module.node

    Look for missing libraries marked as not found.

  • On macOS: Use otool to inspect dependencies:

    bashCopy codeotool -L path/to/module.node
  • On Windows: Use Dependency Walker or ldd-like tools like lucasg/Dependencies to check missing DLLs.

5. Verify the Node.js ABI Compatibility

Node.js native modules are compiled against a specific Node.js ABI version. Ensure the module matches your Node.js version:

  • Find the Node.js ABI version:
    bashCopy codenode -p 'process.versions.modules'
  • Compare it with the compiled module's ABI version. If they differ, recompile the module:
    bashCopy codenpm rebuild

6. Use GDB for Low-Level Debugging (Advanced)

If debugging native code, attach gdb to the running process:

  • Start your process:
    bashCopy codenode app.js
  • Attach gdb to the running process:
    bashCopy codegdb -p $(pgrep -f node)
  • Set breakpoints in the loading code or shared libraries.

7. Check Permissions

Ensure the .node file and any dependencies have the correct permissions:

  • Verify ownership and access:
    bashCopy codels -l path/to/module.node

8. Review Your Build Environment

If the module is custom-built, ensure the build environment matches your runtime environment:

  • Check node-gyp logs for errors during the build:
    bashCopy codenpm rebuild --verbose

Combining Techniques

For comprehensive debugging:

  1. Use --trace-warnings and NODE_DEBUG=module for application-level logs.
  2. Use strace or Procmon to monitor system calls.
  3. Inspect native module dependencies with ldd or Dependency Walker.
  4. Recompile or rebuild the module to ensure compatibility.

This multi-level approach will allow you to pinpoint the exact issue causing the ERR_DLOPEN_FAILED error. Let me know if you'd like help interpreting specific traces or outputs!

Prompt:

No this is happening to Windows, so I want to know all the system calls that occurred prior to the process crashing. So that's why I want to attach strace here.

Response:

To achieve the functionality of strace on Windows, where you can capture system calls leading up to a crash (e.g., ERR_DLOPEN_FAILED), you can use the following tools and approaches:


1. Process Monitor (Procmon)

Procmon is the closest equivalent to strace on Windows. It allows you to monitor all file, registry, and process/thread activity.

Steps to Use Procmon:

  1. Download and Launch Procmon:

  2. Set Up Filters for the Node.js Process:

    • Click on the Filter menu and set:
      arduinoCopy codeProcess Name is node.exe
      
    • Add additional filters for:
      • File System Activity (Path contains .node if debugging native modules).
      • Registry Activity (e.g., RegQueryKey for environment variables).
  3. Reproduce the Error:

    • Start capturing by clicking the magnifying glass icon.
    • Reproduce the process crash (ERR_DLOPEN_FAILED).
  4. Analyze the Logs:

    • Look for failed system calls (e.g., ACCESS DENIED, FILE NOT FOUND) or other anomalies leading up to the crash.
    • Save the logs for deeper inspection or sharing (File > Save).

2. Windows Performance Recorder and Analyzer (WPR/WPA)

Windows Performance Recorder (WPR) and Windows Performance Analyzer (WPA) can provide detailed tracing of system calls and performance metrics.

Steps to Use WPR:

  1. Install Windows Performance Toolkit:

  2. Record Traces:

    • Open Command Prompt as Administrator.
    • Start recording:
      cmdCopy codewpr -start GeneralProfile -start CPU
    • Reproduce the error.
    • Stop recording:
      cmdCopy codewpr -stop output.etl
  3. Analyze Traces with WPA:

    • Open output.etl in Windows Performance Analyzer.
    • Focus on:
      • CPU Activity.
      • File I/O and Registry traces.
    • Drill into the Node.js process to identify anomalies.

3. Debugging Tools for Windows (WinDbg)

If you want to trace low-level system calls and debug the process crash directly, you can use WinDbg.

Steps to Use WinDbg:

  1. Install WinDbg:

    • Part of the Windows 10 SDK.
  2. Attach to the Process:

    • Launch your Node.js process:
      cmdCopy codenode app.js
    • Open WinDbg and attach to the process:
      arduinoCopy codeFile > Attach to Process > Select node.exe
      
  3. Enable System Call Tracing:

    • Use the following command to enable verbose output for DLL loading and system calls:
      cmdCopy code!ntsdexts.locks
    • Monitor the sequence of system calls up to the crash.
  4. Analyze Crash Dumps (Optional):

    • If the process crashes, generate a crash dump and analyze it for detailed insights:
      cmdCopy code.dump /ma crash.dmp

4. Sysinternals Tools

a. DebugView

Use DebugView for debugging output if the application emits debug logs via OutputDebugString.

b. ProcDump

Capture a memory dump when the process crashes:

cmdCopy codeprocdump -e -w node.exe dump.dmp

Analyze the dump in WinDbg or a similar tool.


Combined Workflow for Best Results:

  1. Use Procmon for high-level system activity tracing leading up to the crash.
  2. Use WPR/WPA or WinDbg for low-level system call tracing and debugging.
  3. Analyze dependencies with tools like Dependency Walker or lucasg/Dependencies.

Let me know if you need guidance on setting up or interpreting results from any of these tools!

Prompt:

Yea so how would I attach procmon to the node process? Assume I'm running the entire program via powershell admin.

Response:

To attach Procmon to your Node.js process while running it via an elevated PowerShell session, follow these steps:


1. Launch Process Monitor with Administrator Privileges

  1. Start Procmon:

    • Open it by running Procmon.exe directly from your Sysinternals folder.
  2. When prompted, ensure Procmon runs with administrator privileges (this is necessary for tracing system-level activity).


2. Apply a Filter for the Node.js Process

Procmon will capture everything by default, which can be overwhelming. Narrow the scope to just your Node.js process:

  1. In Procmon, click Filter > Filter... to open the filter configuration.

  2. Add the following filter conditions:

    • Process Nameisnode.exeInclude.
    • Optionally, add specific paths if you are debugging .node native modules or specific files:
      • Pathcontains.nodeInclude.
  3. Click Add, then Apply to activate the filters.


3. Run Your Node.js Script via PowerShell

  1. In your elevated PowerShell session, run the Node.js script as you normally would. For example:

    powershellCopy codenode app.js

    If you're troubleshooting, you can also add Node.js debugging flags like --trace-warnings or NODE_DEBUG=module.

  2. Procmon will automatically pick up the Node.js process and start capturing its events based on the filters you applied.


4. Monitor and Capture the Events

  1. Observe the live stream of system calls in Procmon.

    • Look for:
      • FILE NOT FOUND (missing files or modules).
      • ACCESS DENIED (permission issues).
      • Registry operations (e.g., environment variables being queried).
  2. Pause logging after the error occurs to avoid unnecessary data capture:

    • Click the magnifying glass icon or press Ctrl + E to stop capturing events.

5. Save and Analyze the Log

  1. Save the filtered log:

    • Go to File > Save and save it as a .PML file for later analysis or sharing.
  2. Analyze the sequence of events leading up to the error:

    • Look for failed system calls (Result = FILE NOT FOUND, ACCESS DENIED, etc.).
    • Focus on .node file operations, registry reads, and process threads.

6. Additional Tips

  • If Node.js is launched dynamically or spawns child processes:

    • Adjust the filter to include child processes by enabling Options > Enable Advanced Output.
    • Alternatively, add Parent PID to the display columns to trace spawned processes.
  • Use Procmon’s Highlighting feature:

    • Highlight events like FILE NOT FOUND or ACCESS DENIED for quick identification.

Example Workflow Summary

  1. Start Procmon.
  2. Filter for Process Name is node.exe.
  3. Run node app.js in an elevated PowerShell session.
  4. Observe and capture the system calls/events leading to ERR_DLOPEN_FAILED.
  5. Save the log and analyze the failure.

This approach ensures a clear view of what system-level interactions Node.js attempted before encountering the error. Let me know if you need help interpreting specific logs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants