Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better timeouts #129

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Better timeouts #129

wants to merge 10 commits into from

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 24, 2025

✨ Description

Fix: #122, #76

Set a custom timeout for slow operations (data setup, checkpoint save/load/export) using the well-hidden add_ephemeral method. Using a single "long timeout" for simplicity, the alternative would be to have 4-5 separate timeouts, which would be more powerful and actually simpler to implement but also more annoying to configure.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)



@config_class()
class GPTTestSlowDatasetConfig(GPTSampledDatasetConfig):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't this be moved to a module in the tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it has to be run in a separate process (for distributed) which doesn't import the test file.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so you're calling the fast-llm CLI in a subprocess in the tests.

@jlamypoirier jlamypoirier marked this pull request as ready for review January 24, 2025 23:00
@jlamypoirier jlamypoirier requested a review from tscholak January 24, 2025 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] Use a different timeout dor data loading and checkpoints
2 participants