Slow Dataset Preprocessing due to CPU affinity (?) issues #118

mgolub2 · 2024-05-02T17:21:41Z

🐛 Bug

I'm attempting to train a model using litgpt, and the openwebtext dataset. I launch the run as normal following their examples, and the dataset preprocessing starts:

However, the workers are all using a single core!

Checking the pid, the affinity is set incorrectly (?)

I don't know why that would be though. This is fairly new behavior, running the prepare_data portion of openwebtext was quite fast few weeks ago.

To Reproduce

litgpt pretrain --config config_hub/pretrain/tinyllama

Expected behavior

Data preparation should use multiple cores

Environment

PyTorch Version (e.g., 1.0): 2.4.0a0+gitc82fcb7 (cxx11 abi build)
OS (e.g., Linux): Rocky Linux 9.3
How you installed PyTorch (conda, pip, source): source
Build command you used (if compiling from source): python setup.py develop
Python version: 3.11
CUDA/cuDNN version: 12.4
GPU models and configuration: 2x 4090s
Any other relevant information: You Rock!

Additional context

Possibly thinking this is some weird interplay between threading, affinity and the cx11 ABI? Will test in a more normal configuration soon.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-05-02T17:22:12Z

Hi! thanks for your contribution!, great first issue!

mgolub2 · 2024-05-02T18:09:53Z

Possibly thinking this is some weird interplay between threading, affinity and the cx11 ABI? Will test in a more normal configuration soon.

Wow, it might actually be cxx11 abi related (edit: or python 3.9 vs 3.11?) - running the pretraining example in a non cxx11 environment uses all the cores:

I double checked that tokenizers and transformers were both on the same version - they were not, but fixing that changed nothing, and both are now on 0.19.1 & 4.40.1 respectivly.

tchaton · 2024-05-07T08:46:52Z

That's odd indeed. I recommend using Lightning Studio to prepare your dataset.

mgolub2 · 2024-05-15T05:50:18Z

I can paper over the issue by setting the cpu affinity using (fishshell incoming) sudo pgrep pt_main_thread | while read -l pid; taskset --all-tasks -p ffff $pid; end

That's odd indeed. I recommend using Lightning Studio to prepare your dataset.

I'd rather not, thanks.

tchaton · 2024-05-15T07:15:08Z

Hey @mgolub2, PyTorch Geometric supports CPU affinity mapping for their dataloader: https://github.com/pyg-team/pytorch_geometric/blob/e9648df16dcb6dde0e09b5736b1b2da5d68db2ad/docs/source/advanced/cpu_affinity.rst#L80.

If you are interested, you can take inspiration to contribute to litdata native support for cpu affinity, so you don't need to hack around.

I'd rather not, thanks.
The main advantage is that data file manipulations are faster in the cloud, so it scales better. Plus, you get free machine to use whenever you need to.

mgolub2 added bug Something isn't working help wanted Extra attention is needed labels May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Dataset Preprocessing due to CPU affinity (?) issues #118

Slow Dataset Preprocessing due to CPU affinity (?) issues #118

mgolub2 commented May 2, 2024

github-actions bot commented May 2, 2024

mgolub2 commented May 2, 2024 •

edited

Loading

tchaton commented May 7, 2024

mgolub2 commented May 15, 2024

tchaton commented May 15, 2024 •

edited

Loading

Slow Dataset Preprocessing due to CPU affinity (?) issues #118

Slow Dataset Preprocessing due to CPU affinity (?) issues #118

Comments

mgolub2 commented May 2, 2024

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented May 2, 2024

mgolub2 commented May 2, 2024 • edited Loading

tchaton commented May 7, 2024

mgolub2 commented May 15, 2024

tchaton commented May 15, 2024 • edited Loading

mgolub2 commented May 2, 2024 •

edited

Loading

tchaton commented May 15, 2024 •

edited

Loading