This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Replies: 1 comment
-
Even if you can't use a newer version in production, can you try locally if the regression is also present in newer versions? Specifically in the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all,
Training is much slower after upgrading an old container from MXNet 1.1 to 1.4.1. Due to how my org is handling our builds, I can't update MXNet past 1.4 so I'd just like to ask if there are better approaches to debug this issue.
I have two containers:
I have observed some significant performance regressions in the py3-MXNet 1.4.1 container, which is built with MKLDNN enabled.
I am using code at this repo as a ‘minimal reproducible example’: https://github.com/opringle/multivariate_time_series_forecasting
I used to profiler in each version to capture the second training batch at the second epoch for both containers, in a manner like this:
I capture the program at this time since the docs recommend avoiding profiling at the first batch.
This is the profiler output when sorted by total op time for the py2-1.1:
This is the output for the py3-1.4 container:
Some ops like backward_Convolution are significantly slower. My machine CPU is a 6-core Intel i7.
Does anyone know how to determine the root cause for this? Is this issue related to MKL-DNN somehow?
Also, when I run the same example with the same containers on a machine with an Intel Xeon CPU (c5 instance on AWS), the opposite occurs: the py3-1.4.1 container is much faster per batch (1s difference) than the py2-1.1 container.
Beta Was this translation helpful? Give feedback.
All reactions