Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak. #69

Open
chao-camect opened this issue Apr 8, 2024 · 29 comments
Open

Memory leak. #69

chao-camect opened this issue Apr 8, 2024 · 29 comments
Assignees
Labels
aitce bug Something isn't working

Comments

@chao-camect
Copy link

I have been training using tensorflow + keras on nvidia GPU for a while.
Recently I experimented with A770. With some efforts, I finally got it working, except that there is a memory leak.
The same code works fine on nvidia 3090, it uses about 8GB memory, very stably.
With A770, it starts with 8GB and grows very quickly until killed because of OOM.
I used tracemalloc to see where is the leak. No luck. So it's not in python code.
I haven't got time to get more details of it...

@chao-camect
Copy link
Author

More details: I am using tensorflow 2.15.1, running on Ubuntu 22.04. I can reproduce it on both kernel 6.5 and 6.7.

@huiyan2021
Copy link
Contributor

Hi @chao-camect , thanks for reporting this issue, could you share a small re-producer so that we can investigate?

@chao-camect
Copy link
Author

chao-camect commented Apr 9, 2024

I think you should be able to reproduce it by training any model... It must be leaking in some common operations.
Anyway, here is a simple program to reproduce it:

import os

from absl import app
from absl import flags
import tensorflow as tf

from tensorflow.keras import applications
from tensorflow.keras import callbacks
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras.preprocessing.image import ImageDataGenerator

FLAGS = flags.FLAGS
flags.DEFINE_string('data_dir', '', '')
flags.DEFINE_string('output_dir', 'output/dogs_vs_cats_mobilenet_v2', '')
flags.DEFINE_integer('batch_size', 20, '')
flags.DEFINE_integer('width', 300, '')
flags.DEFINE_integer('height', 300, '')
flags.DEFINE_integer('epochs', 10000, '')
flags.DEFINE_integer('max_chpt_to_keep', 10, '')
flags.DEFINE_boolean('use_data_augmentation', True, '')
flags.DEFINE_boolean('debug', False, '')
flags.DEFINE_string('test_image_dir', '', '')

def count_jpgs(dirStr: str) -> int:
    n = 0
    for name in os.listdir(dirStr):
        full_name = os.path.join(dirStr, name)
        if os.path.isfile(full_name):
            ext = os.path.splitext(name)[1]
            if ext == '.jpg' or ext == '.jpeg':
                n += 1
        elif os.path.isdir(full_name):
            n += count_jpgs(full_name)
    return n

def main(argv):
    i = layers.Input([FLAGS.height, FLAGS.width, 3], dtype = tf.uint8)
    x = tf.cast(i, tf.float32)
    x = applications.mobilenet_v2.preprocess_input(x)
    base = applications.MobileNetV2(classes=2, include_top=False)
    base.trainable = False
    x = base(x)
    x = layers.GlobalAveragePooling2D()(x)
    o = layers.Dense(2, activation='softmax', name='output')(x)
    model = tf.keras.Model(inputs=[i], outputs=[o], name='dogs_vs_cats_mobilenet_v2')
    model.compile(
        loss='categorical_crossentropy',
        optimizer=optimizers.RMSprop(learning_rate=2e-5, rho=0.618),
        metrics=['acc'])
    if FLAGS.debug:
        model.summary()

    if FLAGS.use_data_augmentation:
        train_datagen = ImageDataGenerator(
            rotation_range=40,
            width_shift_range=0.2,
            height_shift_range=0.2,
            shear_range=0.2,
            zoom_range=0.2,
            horizontal_flip=True)
    else:
        train_datagen = ImageDataGenerator()
    train_generator = train_datagen.flow_from_directory(
        os.path.join(FLAGS.data_dir, 'train'),
        target_size=(FLAGS.height, FLAGS.width),
        batch_size=FLAGS.batch_size,
        class_mode='categorical')

    num_train_jpgs = count_jpgs(os.path.join(FLAGS.data_dir, 'train'))
    print('Total %d training images.' % num_train_jpgs)

    tb_callback = callbacks.TensorBoard(log_dir=os.path.join(FLAGS.output_dir, 'tb_logs'))
    model.fit(
        x=train_generator,
        steps_per_epoch=num_train_jpgs/FLAGS.batch_size,
        epochs=FLAGS.epochs,
        callbacks=[tb_callback])
    model.save(os.path.join(FLAGS.output_dir, 'saved_model'))

app.run(main)

To run it, download data from [Kaggle] (https://www.kaggle.com/c/dogs-vs-cats/data).

unzip train.zip
cd train/
mkdir dogs cats
mv dog.* dogs/
mv cat.* cats/
python3 train.py --data_dir=xxxx

@chao-camect
Copy link
Author

More context: I run it inside docker. I installed the deps inside docker using following script:

wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg    
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor --output /usr/share/keyrings/oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero \
    intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 libegl-mesa0 libegl1-mesa \                         
    libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri libglapi-mesa libgles2-mesa-dev \
    libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers mesa-vdpau-drivers mesa-vulkan-drivers \
    va-driver-all vainfo hwinfo clinfo libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev \
    level-zero-dev intel-oneapi-runtime-dpcpp-cpp intel-oneapi-runtime-mkl xpu-smi

@huiyan2021
Copy link
Contributor

could you also run this environment check script and upload the result here? thanks!

https://github.com/intel/intel-extension-for-tensorflow/blob/main/tools/python/env_check.py

@chao-camect
Copy link
Author

The tool doesn't support tensorflow 2.15.

Check Environment for Intel(R) Extension for TensorFlow*...

100% [................................................................................] 6116 / 6116Check Python
Traceback (most recent call last):
  File "/home/chao/projects/intel-extension-for-tensorflow/tools/python/env_check.py", line 138, in <module>
    itex_version = check_python()
  File "/home/chao/projects/intel-extension-for-tensorflow/tools/python/env_check.py", line 39, in check_python
    elif python_minor_version < config['python_version']['min_python_version'][itex_version]:
KeyError: '2.15.0.0'

@Disty0
Copy link

Disty0 commented Apr 9, 2024

I suspect this is an issue with a common Intel library.
Because this same issue happens on IPEX as well:

intel/intel-extension-for-pytorch#476

@chao-camect
Copy link
Author

@Disty0 Thanks for linking to the other issue. I agree with you. For me, it leaks 3MB-4MB every second. It must be some common operation. I don't get how could it evaded from Intel's own engineers...
I tried LD_PRELOAD=libtcmalloc_minimal.so.4 from intel/intel-extension-for-pytorch#476, no help.

@huiyan2021
Copy link
Contributor

huiyan2021 commented Apr 10, 2024

Hi @chao-camect , I am running the training script on Arc770 with docker that we published: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/install_for_xpu.html#get-docker-container-from-dockerhub

Total 25000 training images.
Epoch 1/10000
2024-04-10 04:00:56.653123: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type XPU is enabled.
1250/1250 [==============================] - 158s 125ms/step - loss: 0.3988 - acc: 0.8320
Epoch 2/10000
1250/1250 [==============================] - 155s 124ms/step - loss: 0.1587 - acc: 0.9555
Epoch 3/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.1136 - acc: 0.9612
Epoch 4/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0992 - acc: 0.9624
Epoch 5/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0925 - acc: 0.96426
Epoch 6/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0883 - acc: 0.9677
Epoch 7/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0856 - acc: 0.9670
Epoch 8/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0821 - acc: 0.9693
Epoch 9/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0794 - acc: 0.9702
Epoch 10/10000
1250/1250 [==============================] - 156s 125ms/step - loss: 0.0765 - acc: 0.9698
Epoch 11/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0793 - acc: 0.9694
Epoch 12/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0746 - acc: 0.9712
Epoch 13/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0758 - acc: 0.9712
Epoch 14/10000
1250/1250 [==============================] - 154s 123ms/step - loss: 0.0756 - acc: 0.9711
Epoch 15/10000
1250/1250 [==============================] - 154s 124ms/step - loss: 0.0703 - acc: 0.9729
Epoch 16/10000
1250/1250 [==============================] - 155s 124ms/step - loss: 0.0739 - acc: 0.9710
Epoch 17/10000
1250/1250 [==============================] - 155s 124ms/step - loss: 0.0725 - acc: 0.9708
Epoch 18/10000
535/1250 [===========>..................] - ETA: 1:28 - loss: 0.0699 - acc: 0.9733

GPU Memory Used (MiB) keeps stable at 4222
image

You should be able to run https://github.com/intel/intel-extension-for-tensorflow/blob/main/tools/python/env_check.py now, also need to clone https://github.com/intel/intel-extension-for-tensorflow/blob/main/tools/python/config.json, could you try again and upload the result here?

@chao-camect
Copy link
Author

It's CPU memory not GPU.

@huiyan2021
Copy link
Contributor

More context: I run it inside docker. I installed the deps inside docker using following script:

wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg    
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor --output /usr/share/keyrings/oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install -y intel-opencl-icd intel-level-zero-gpu level-zero \
    intel-media-va-driver-non-free libmfx1 libmfxgen1 libvpl2 libegl-mesa0 libegl1-mesa \                         
    libegl1-mesa-dev libgbm1 libgl1-mesa-dev libgl1-mesa-dri libglapi-mesa libgles2-mesa-dev \
    libglx-mesa0 libigdgmm12 libxatracker2 mesa-va-drivers mesa-vdpau-drivers mesa-vulkan-drivers \
    va-driver-all vainfo hwinfo clinfo libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev \
    level-zero-dev intel-oneapi-runtime-dpcpp-cpp intel-oneapi-runtime-mkl xpu-smi

did you install by following steps here: https://intel.github.io/intel-extension-for-tensorflow/latest/docs/install/experimental/install_for_arc_gpu.html

@huiyan2021
Copy link
Contributor

It's CPU memory not GPU.

How many iterations have you trained when facing OOM?

@huiyan2021
Copy link
Contributor

@chao-camect we observed memory usage increasing on host during the training, developer team is looking into it, will post here when there are any updates. Thanks!

@chao-camect
Copy link
Author

Thanks for the prompt response!

@huiyan2021
Copy link
Contributor

huiyan2021 commented Apr 11, 2024

Hi @chao-camect

  1. Upgrade driver to the latest version both on host and in docker as following:
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key | \
  sudo gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy client" | \
  sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list
sudo apt update
sudo apt upgrade
  1. check driver version:
    image

  2. we also submitted a PR fixing the issue, build main branch of intel extension for tensorflow from source code or you can wait for our weekly build: https://intel.github.io/intel-extension-for-tensorflow/latest/get_started.html#install-for-xpu-weekly

This is the memory usage trend I tested on Arc770 using the latest build:
memory_usage_6

This is the one on A100:
image

Let us know whether you can re-produce the result or not, thanks!

@chao-camect
Copy link
Author

Thanks. When will the weekly build be ready? I see that the latest version is 20240329.

@huiyan2021
Copy link
Contributor

Should be in this week, will let you know when they are ready.

@xiguiw
Copy link
Contributor

xiguiw commented Apr 17, 2024

intel_extension_for_tensorflow_lib_weekly 2.15.0.0.2.dev20240415
intel_extension_for_tensorflow_weekly 2.15.0.0.dev20240415
tensorflow 2.15.1
tensorflow-estimator 2.15.0
tensorflow-io-gcs-filesystem 0.36.0

@xiguiw
Copy link
Contributor

xiguiw commented Apr 17, 2024

@chao-camect
The weekly build is ready. Please help to try it, thanks!

Pleased uninstall your previous intel-extension-for-tensorflow package. The package name are different.

The install command is:
pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly

A couple of things to be noted:

  1. Upgrade your driver as @huiyan2021 mensioned.
    Both host (KMD driver) and docker (UMD driver) are to be upgrade if you use docker.
ii  intel-level-zero-gpu                       1.3.28202.39-821~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero                                 1.15.8-820~22.04                        amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-dev                             1.15.8-820~22.04                        amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
  1. oneAPI version is 2024.1

FYI
Here is the memory used with the weekly build, to run traning script your provided, 5 epoches.

memory_weekly_20240415_

Good lunck!

@xiguiw xiguiw added bug Something isn't working aitce labels Apr 17, 2024
@chao-camect
Copy link
Author

No. It's still leaking, just slower. As you can see from your own graph...
I have a bigger program that's leaking more obviously. If necessary, I can see whether I can separate that part for your testing purpose. However, this is really not my job.
I suggest you guys spend some time in testing. You need to compare training on ARC with NVIDIA.

@xiguiw
Copy link
Contributor

xiguiw commented Apr 17, 2024

@chao-camect
Thanks for trying the weekly build.

Yes we compared the traning on NV, this is the training result on A100. Running the script you provide us.
A100

From the patten, memory usage increases between epochs, but keep stable in epch. A100 behaves similarly.

Suppose memory leak is related to specific workload (or operator/kernel).
Could you help to check with tcmalloc by:

  1. disabled tensorboard/ disable the callback
  2. put the traing command in run.sh
  3. run following command and feedback the result.
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 HEAPCHECK=normal run.sh 

pprof run.sh "xxxx.heap" --stack --lines --text. 

We have checked your example for 1 epoch and find 72 bytes leak from itex. This weekly build fixe this.

Note that onednn primitive cache and kernel cache from queue.submit consumed memory, the behavior looks like memory leak but actually it is NOT. Ignore such 'leaks' on python objects and tensorflow...

@xiguiw
Copy link
Contributor

xiguiw commented Apr 17, 2024

@chao-camect

It would be nice if you can separate that part from you 'bigger programm' for us to test.

BTW, Did you upate the driver? What's the driver version did you use now?

@chao-camect
Copy link
Author

It'll take some time before I can get a minimized version for you to test with. Do you test the extension with a set of common models regularly? I don't think there is anything special in my model.
Anyway, the weekly build does help a bit. It was leaking about 4MB per second. It's about 1MB per second now.
level zero versions:

ii  intel-level-zero-gpu                            1.3.28202.51-821~22.04                  amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero                                      1.16.14-821~22.04                       amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  level-zero-dev                                  1.16.14-821~22.04                       amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.

@xiguiw
Copy link
Contributor

xiguiw commented Apr 19, 2024

The drvier version is OK.

Of course we have models check regularly.
Till now not found memory leak for this case, we will add more models to check this issue.

Meanwhile, if you can get a minimized version for us, that's would be ideal. Thanks!

@huiyan2021
Copy link
Contributor

Hi @chao-camect , could you try below environment variable at your side, and let us know if memory leak still exists? Thanks!

export ITEX_LAYOUT_OPT=0

@chao-camect
Copy link
Author

Looks like it did the trick.
I'll let it run for longer and report back.

@chao-camect
Copy link
Author

chao-camect commented Apr 28, 2024

I believe the memory leak is gone with ITEX_LAYOUT_OPT=0.
I assume setting ITEX_LAYOUT_OPT=0 has performance impact?

@huiyan2021
Copy link
Contributor

I believe the memory leak is gone with ITEX_LAYOUT_OPT=0.
Thanks for the confirmation!

I assume setting ITEX_LAYOUT_OPT=0 has performance impact?
It depends on the model, you can observe the performance for your model.

also, our fix is WIP... will let you know as soon as it works...

@huiyan2021
Copy link
Contributor

@chao-camect , please help to try our latest weekly build to see if it works for your case, thanks!

pip install --upgrade intel-extension-for-tensorflow-weekly[xpu] -f https://developer.intel.com/itex-whl-weekly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aitce bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants