Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] When submitting a spark job in Yarn cluster mode, an error occurs that the resource file cannot be found. #6771

Open
3 of 4 tasks
BohanZhang0222 opened this issue Oct 22, 2024 · 1 comment
Labels
kind:bug This is a clearly a bug priority:major

Comments

@BohanZhang0222
Copy link

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

I use kyuubi batch api v2 to submit the spark job of yarn cluster.
When the API node being called is inconsistent with the node submitting the spark job, an error message that the resource file cannot be found will be reported. I analyzed that the reason is that when I call the API, kyuubi will place the uploaded resource file in a local directory, but this directory is not shared among multiple workers of kyuubi. As a result, when the batch task is scheduled to be submitted to other nodes, the resource file cannot be found.

Affects Version(s)

1.9.1

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

No response

Kyuubi Server Configurations

kyuubi.batch.impl.version=2
kyuubi.batch.submitter.enabled=true

Kyuubi Engine Configurations

No response

Additional context

The solution I tried,
kyuubi has an environment variable:kyuubi_work_dir,I changed this directory to point to the shared storage.
But i failed,
The problem encountered is that jobs are submitted occasionally. You can see the spark submission log in the kyuubi server, and you can also find the corresponding batch id in the database, but the submission is not successful and the Yarn App Id cannot be obtained. The status of kyuubi will change from PENDING to ERROR very quickly.

By calling the locallog interface of batch, no valid error content could be found. (Because it was an accident in the production environment, it has been rolled back and no screenshots can be taken). However, the locallog interface mentions the detailed error log path, which is a log file in the username subdirectory in the kyuubi work directory (the shared directory configured in the environment variable).

When I accessed this log file, I found that the file content described another job. At this time, I realized that the multi-node shared work directory may have caused job conflicts.

I realized that the uploaded resource files might also have conflicts, so I executed the following query.
image

It can be confirmed that shared directories will cause multi-node resource file and log conflicts.
But I can't confirm whether this is the reason for the occasional task submission exception.

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to fix.
  • No. I cannot submit a PR at this time.
@BohanZhang0222 BohanZhang0222 added kind:bug This is a clearly a bug priority:major labels Oct 22, 2024
Copy link

Hello @BohanZhang0222,
Thanks for finding the time to report the issue!
We really appreciate the community's efforts to improve Apache Kyuubi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug This is a clearly a bug priority:major
Projects
None yet
Development

No branches or pull requests

1 participant