Identifying training bottlenecks and system resource under-utilization with Amazon SageMaker Debugger

 Identifying training bottlenecks and system resource under-utilization with Amazon SageMaker Debugger

At AWS re:Invent 2020, AWS released the profiling functionality for Amazon SageMaker Debugger. In this post, we expand on the importance of profiling deep neural network (DNN) training, review some of the common performance bottlenecks you might encounter, and demonstrate how to use the profiling feature in Debugger to detect such bottlenecks.

In the context of DNN training, performance profiling refers to the art of analyzing the manner in which your training application is utilizing your training resources. Training resources are expensive, and your goal should always be to maximize their utilization. This is particularly true of your GPUs, which are typically the most expensive system resource in deep learning training tasks. Through performance profiling, we seek to answer questions such as:

  • To what degree are we utilizing our CPU, GPU, network, and memory resources? Can we increase their utilization, and if so, how?
  • What is our current speed of training, as measured, for example, by the training throughput, or the number of training iterations per second? Can we increase the throughout, and if so, how?
  • Are there any performance bottlenecks that are preventing us from increasing the training throughput, and if so, what are they?
  • Are we using the most ideal training instance types? Might a different choice of instance type speed up our training, or be more cost-effective?

Performance analysis is an integral step of performance optimization, in which we seek to increase system utilization and increase throughput. A typical strategy for performance optimization is to iterate the following two steps until you’re satisfied with the system utilization and throughput:

  • Profiling the training performance to identify bottlenecks in the pipeline and under-utilized resources
  • Addressing bottlenecks to increase resource utilization

Effective profiling analysis and optimization can lead to meaningful savings in time and cost. If you’re content with 50% GPU utilization, you’re wasting your (your company’s) money. Not to mention that you could probably be delivering your product much sooner. It’s essential that you have strong tools for profiling performance, and that you incorporate performance analysis and optimization as an integral part of your team’s development cycle.

That’s where the newly announced profiling capability of Debugger comes in.

Debugger is a feature of Amazon SageMaker training that makes it easy to train machine learning (ML) models faster by capturing real-time metrics such as learning gradients and weights. This provides  transparency into the training process, so you can correct anomalies such as losses, overfitting, and overtraining. Debugger provides built-in rules to easily analyze emitted data, including tensors th


Source - Continue Reading:


Related post