CONTINUER: Maintaining Distributed DNN Services During Edge Failures. (arXiv:2206.05267v1 [cs.DC])

Partitioning and deploying Deep Neural Networks (DNNs) across edge nodes may
be used to meet performance objectives of applications. However, the failure of
a single node may result in cascading failures that will adversely impact the
delivery of the service and will result in failure to meet specific objectives.
The impact of these failures needs to be minimised at runtime. Three techniques
are explored in this paper, namely repartitioning, early-exit and
skip-connection. When an edge node fails, the repartitioning technique will
repartition and redeploy the DNN thus avoiding the failed nodes. The early-exit
technique makes provision for a request to exit (early) before the failed node.
The skip connection technique dynamically routes the request by skipping the
failed nodes. This paper will leverage trade-offs in accuracy, end-to-end
latency and downtime for selecting the best technique given user-defined
objectives (accuracy, latency and downtime thresholds) when an edge node fails.
To this end, CONTINUER is developed. Two key activities of the framework are
estimating the accuracy and latency when using the techniques for distributed
DNNs and selecting the best technique. It is demonstrated on a lab-based
experimental testbed that CONTINUER estimates accuracy and latency when using
the techniques with no more than an average error of 0.28% and 13.06%,
respectively and selects the suitable technique with a low overhead of no more
than 16.82 milliseconds and an accuracy of up to 99.86%.



Related post