# A Survey on Parallel and Distributed Deep Learning

✅ Paper Type: Free Essay |
✅ Subject: Computer Science |

✅ Wordcount: 6024 words |
✅ Published: 29th Oct 2021 |

## Abstract

Deep learning is considered as one of the most remarkable machine learning techniques in recent years. It has achieved great success in many applications, such as image analysis, speech recognition, and text understanding. It uses supervised and unsupervised methods for the tasks of classification and pattern recognition to learn multi-level representations and features in hierarchical architectures. The recent development in parallel and distributed technologies has enabled the processing of big data using deep learning. Although big data offers great opportunities for a wide range of areas including e-commerce, industrial control, and smart medicine, it poses many challenging issues in data mining and information processing due to its high size, wide variety and high speed. Deep learning has played a major role in big data applications over the past few years.

In this paper, we review the emerging researches of deep learning models using parallel and distributed tools, resources, algorithms and techniques. Furthermore, we point out the remaining challenges of using parallel and distributed deep learning and discuss future topics.

Index Terms - Deep learning, Parallel, Distributed, Highperformance computing, GPU

## I. INTRODUCTION

Machine learning and, in particular, deep learning are rapidly taking on a range of dimensions of our daily lives. Inspired by the integrated nature of the human brain, the Deep Neural Network (DNN) is at the core of deep learning. Properly trained, the expressiveness of DNNs offers precise solutions only by analyzing large amounts of data to previously thought unsolvable problems. Deep learning has been successfully implemented for a multitude of fields, ranging from image classification, speech recognition, and medical diagnosis to autonomous driving and defeating in complex games human players. As the size and complexity of DNNs increase in data sets, the computational frequency and storage requirements of deep learning increase proportionately. A high-performance computing cluster is required to train a DNN for today's competitive accuracy. Different aspects of DNN learning and inference (evaluation) were modified to increase competitiveness to exploit these systems.

If you need assistance with writing your essay, our professional essay writing service is here to help!

Essay Writing ServiceParallel processors such as GPUs have played an important role in the practical implementation of deep neural networks. The computations that arise naturally lend themselves to effective parallel implementation when training and using deep neural networks. The performance of these deployment products enables researchers to test networks that are significantly higher in size and train them on larger datasets. This has helped to greatly enhance the accuracy of tasks such as speech recognition and object identification, among other things. A recognized drawback of the GPU approach is that the speedup of learning is limited when the template does not fit in GPU memory. Researchers also reduce the size of the data or parameters to efficiently use a GPU so that transfers from CPU to GPU are not a major bottleneck. Although information and parameter reduction function well for small problems (e.g. acoustic modeling for speech recognition), problems with a large number of examples and measurements (e.g. highresolution images) are less desirable.

Nonetheless, there are two major challenges in designing a distributed deep learning model. First, deep learning models have a ton of parameters to synchronize nodes when modified, which would cause a lot of overhead interactions. Consequently, the scalability in terms of training time to achieve such accuracy is a concern. Second, it is non-trivial for programmers to build and train models with deep and complex system structures. In addition, distributed training increases the burden on programmers, e.g. partitioning data and model, and communication with the network. This question is compounded by the fact that it will work with deep learning models for data scientists with a little deep learning experience.

In this study, we discuss the variety of topics, ranging from vectorization to supercomputer-efficient use, in the context of parallelism and in-depth learning distribution. Parallel strategies are presented to evaluate and implement multiple research and existing tools, as well as extensions to training algorithms and systems to support distributed environments.

## II. RELATED WORKS

Some field studies focus on deep learning applications [1], neural networks and their history [2], [3], [4], [5], deep learning scaling [6], and DNN hardware architecture [7], [8], [9]. Specifically, three surveys [2], [4], [5] describe DNNs and the origins of deep learning methodologies from a historical perspective, as well as discuss the possible capabilities of DNNs w.r.t. training and representational energy. [4], [5] also describe in detail the methods of optimization and the methods of regularization. Bengio [6] discusses the scaling of deep learning from different perspectives, focusing on models, algorithms for optimization, and datasets. The paper also looks at some aspects of distributed computing, including sparse and asynchronous communication. Hardware design surveys focus primarily on the computational learning side and not on optimization. It includes a recent survey [9] which analyses DNN operator computing techniques (layer types) and mapping hardware computations, leveraging inherent parallelism. The survey also provides a discussion on increasing data representation (e.g. through quantization) to reduce the overall bandwidth of hardware memory. Other surveys focus on accelerators for traditional neural networks [7] and the use of FPGAs in deep learning [8].

## III. TERMINOLOGY

This section sets out the conventions of theory and naming for the material presented in the survey. We first discuss the class of subjects of supervised learning, followed by relevant parallel programming foundations.

### A. Deep Neural Network

An artificial neural network (ANN) with multiple layers between the layers of input and output is a deep neural network (DNN). The DNN finds the correct mathematical manipulation, whether it is a linear relationship or a non-linear relationship, to turn the input into an output. The network passes through the layers to measure each output's probability. For example, a DNN trained to recognize dog breeds will go over the given image and determine the probability of a certain breed being the dog in the picture. The user can review the results and select the probabilities to be displayed by the network (above a certain threshold, etc.) and return the proposed label. Every mathematical manipulation as such is considered a layer, and there are many layers of complex DNN, hence the name "deep" networks.

### B. Parallel Computer Architecture

*1) * *Multi-core Computing:* A multi-core processor is a processor on the same chip that includes multiple processing units (known as "core"). From multiple instruction streams, a multi-core processor can issue multiple instructions per clock cycle. Each core in a multi-core processor can theoretically be superscalar, i.e. each core can issue multiple instructions from a single thread on every clock cycle.

*2) * *Symmetric Multiprocessing:* Symmetric multiprocessor (SMP) is a multi-processor computer system that shares memory and links through a bus. Bus controversy prevents the scaling of bus architectures. As a result, there are generally no more than 32 processors in SMPs. Due to the small size of the processors and the significant reduction in bus bandwidth requirements achieved by large caches, such symmetric multiprocessors are extremely cost-effective, as long as there is enough memory bandwidth.

*3) * *Distributed Computing:* A distributed computer (also known as a multiprocessor distributed memory) is a distributed computer system where a network connects the processing elements. Distributed computers are highly scalable. There is a lot of overlap between the terms "competitive computing," "parallel computing," and "distributed computing," and there is no clear distinction between them. The same system can be characterized as "parallel" as well as "distributed;" processors run simultaneously in a typical distributed system.

Fig. 1: Single machine and distributed system structure

### C. Parallel Programming

The programming techniques used on parallel computers to implement parallel learning algorithms depend on the target architecture. They range from simple, single-machine threaded implementations to OpenMP. Accelerators are usually programmed with special languages such as the CUDA of NVIDIA, OpenCL, or with hardware design languages in the case of FPGAs. However, the details are often hidden behind calls from libraries (e.g. cuDNN or MKL-DNN) implementing the time-consuming primitive. It is possible to use simple communication mechanisms such as TCP / IP or Remote Direct Memory Access (RDMA) on multiple machines with distributed memory. It is also possible to use more convenient libraries such as the Message Passing Interface (MPI) or Apache Spark on distributed memory machines. MPI is a lowlevel library that aims to deliver portable performance while Spark is a higher-level framework that focuses more on the productivity of programmers.

## IV. OPERATOR CONCURRENCY

There are several ways to parallel execution of neural network layer. Computations (for example, in the case of pooling operators) can be directly parallelized in most cases. However, computations need to be reshaped in order to expose parallelism in other types of operators. Below, we describe concurrency analysis of three popular operators.

### A. Performance Modeling

It is difficult to estimate the runtime of a single DNN operator, let alone a whole network, even with work and depth models. But with performance modeling, other works still manage to approximate the runtime of a given DNN. Using the values in the figure as a lookup table, it was possible to predict the time to calculate and backpropagate with a 5–19 percent error through lots of different sizes, even on asynchronous communication [10] clusters of GPUs. In a distributed system [11], the same was done for CPUs, using a similar approach. Paleo [12] derives a performance model from service counts alone (with a prediction error of 10–30 percent) and Pervasive CNN's [13] uses performance modeling to pick networks with reduced precision to meet users real-time needs.

### B. Fully Connected Layers

It is possible to express and model a fully connected layer as a matrix multiplication of weights and neuron values (column per batch sample). To that end, it is possible to use efficient linear algebra libraries like CUBLAS [14] and MKL [15]. [16] presents a variety of methods for further optimizing fully connected layer CPU implementation. The paper shows efficient loop building, vectorizing, blocking, unrolling, and batching in particular. The paper also demonstrates how weights can be quantized to use fixed-point math instead of floating-point.

### C. Convolution

The bulk of computations involved in DNN learning and inference are convolutions. As such, significant efforts have been made by the research community and industry to optimize their computation across all platforms. While a convolution operator can be explicitly determined, it will not make full use of the capabilities of vector processors and multi-core architectures, which are oriented to many parallel multiplication-accumulation operations. Furthermore, through ordering operations to optimize information reuse [17], adding data redundancy, or through base transformation, it is possible to increase utilization. In CNN's [18], the first occurrence of unrolling convolutions used both CPUs and GPUs for training. The approach was subsequently popularized by [19] to reshape images from 3D tensors to 2D matrices in the array. Every 1D row in the matrix contains an unrolled 2D patch that would produce redundant data, normally converted (possibly with overlap). Then the kernels of convolution are stored as a 2D matrix, where each column represents an unrolled kernel (one filter of convolution). Multiplying these two matrices results in a matrix containing the converted tensor in 2D format, which for subsequent operations can be reshaped to 3D. Note that this operation can be generalized to 4D tensors (a whole batch), converting it into a multiplication of a single matrix. DNN basic libraries, such as CUDNN [20], provide a variety of methods and software formats for convolution. These libraries include functions that select the best performing algorithm given tensor sizes and memory constraints to assist users in selecting an algorithm. The libraries will run all methods internally and pick the fastest one.

### D. Recurrent Units

The complex gate systems within RNN units contain multiple operations, each involving a small multiplication of matrixes or an element-wise operation. For this reason, as a series of high-level operations, such as GEMMs, these layers have traditionally been implemented. Nevertheless, these layers can be further accelerated. Moreover, since RNN units are usually chained together (forming consecutive recurrent layers), it is possible to consider two types of competition: within the same layer and between consecutive layers. [21] describes several GPU optimizations. The first optimization fuses all computations (GEMMs and otherwise) into one function (kernel), saving the memory of scratch-pad intermediate results. This both decreases the overhead of the kernel scheduling and maintains round trips to the global memory using the massively parallel GPU's multi-level memory hierarchy. Certain enhancements involve the pre-transposition of matrices and allowing the GPU to concurrently perform separate recurrent units on various multi-processors. Competitiveness between layers is achieved through a parallel pipeline with which [21] implements stacked RNN unit computations, starting to propagate immediately through the next layer once its data dependencies are met. Overall, these optimizations result in an 11x increase in performance over the implementation at a high level. Dynamic programming [22] for RNNs was proposed from the memory consumption perspective to balance between caching intermediate results and recomputing forward inferences for backpropagation.

## V. NETWORK CONCURRENCY

The high average parallelism of neural networks can be used not only to accurately quantify individual operators but also to simultaneously test the entire network for various dimensions. Below, together with a variant, we analyze three influential partitioning techniques.

### A. Data Parallelism

A simple parallel approach is to split the work of batch samples between multiple computational resources (core or devices). This approach (initially called pattern parallelism, because input samples are called patterns), dates back to the first practical implementation of [23] artificial neural networks. Multiple vector accelerator microprocessors (Spert-II) were used by [24] to parallel neural network training error backpropagation. The paper presents a version of delayed gradient updates called "bunch mode" to support data parallelism, where the gradient is updated several times before the weights are updated. [25] performed one of the earliest occurrences of mapping DNN computations to software parallel architectures (e.g., GPUs). When teaching Restricted Boltzmann Machines, the paper shows a speedup of up to 72.6x over CPU. Today, the vast majority of deep learning frameworks support data parallelism, using a single GPU, multiple GPUs, or a multiGPU node cluster. Additional methods have been suggested in the literature for software parallelism. SGD runs (possibly with batches) k times in parallel in ParallelSGD [26], splitting the data set between the processors. The resulting weights are aggregated and averaged after the convergence of all SGD instances. ParallelSGD was developed using the programming model for MapReduce [27]. It is easy to plan parallel tasks on multiple processors and distributed environments using MapReduce. While the MapReduce model initially succeeded in deep learning, its generality hindered specific DNN optimizations. Current implementations, therefore, use high-performance communication interfaces (e.g., MPI) to implement features of fine-grained parallelism.

### B. Model Parallelism

The second DNN learning partitioning technique is template parallelism also known as network parallelism. This technique separates the job into each layer according to the neurons. In this case, the sample batch is copied to all processors and different parts of the DNN are measured on different processors, which can save memory (because the entire network is not stored in one place) but contributes to further interaction after each layer. [28] has been proposed to incorporate redundant computations into neural networks in order to reduce communication costs in fully connected layers. In particular, the proposed method partitions an NN in such a way that each processor is responsible for twice the neurons (with overlap), thus requiring more calculation but less communication. One approach proposed to reduce interaction in fully connected layers is to use Cannon's algorithm for matrix multiplication, updated for DNNs[29]. The paper reports that Cannon's algorithm produces better efficiency and speed-ups by simple partitioning on fully connected networks on smallscale multilayer.

Fig. 2: Neural Network Parallelism Schemes

### C. Pipelining

In deep learning, the pipeline can either refer to overlapping computations, i.e. between one layer and the next (as data becomes ready); or to dividing the DNN by depth, assigning layers to specific processors. Pipelining can be viewed as a form of data parallelism, as elements (samples) are processed in parallel through the network, but also as model parallelism, since the DNN structure determines the length of the pipeline. For combine forward analysis, backpropagation, and weight updates, the first type of pipeline can be used. This scheme is commonly used in [30], [31], [32] training, and improves use by reducing idle time for the processor.

### D. Hybrid Parallelism

Most parallelism schemes combined would overcome each scheme's drawbacks. Below we take a look at successful examples of such hybrids. applies parallel data to the convolution layer and parallel model to the fully connected portion. Using this hybrid approach, it is possible to achieve a speed-up of up to 6.25x over one for 8 GPUs, with less than 1 percent loss of accuracy (due to an increase in lot size). AMPNet [33] is an asynchronous implementation of CPU DNN learning using an intermediate representation to implement parallel fine-grained modeling. In particular, internal parallel tasks are defined and designed asynchronously within and between layers. In addition, asynchronous dynamic control flow execution enables the tasks of forwarding analysis, backpropagation, and weight updating to be pipelined. Finally, a deep learning system was distributed by the DistBelief [34] which combines all three parallel strategies. Training is conducted simultaneously in the implementation of multiple model replicas, where each replica is trained on different samples (data parallelism). Within each replica, the DNN is distributed in the same layer (model parallelism) by neurons as well as by different layers (pipeline).

## VI. TRAINING CONCURRENCY

So far we have addressed training algorithms where there is only one copy and all processors can see its up-to-date quality directly. There may be multiple instances of training agents operating independently in distributed environments, and therefore the overall algorithm has to be adapted.

### A. Model Consistency

Directly splitting computations between nodes produces a distributed type of data parallelism where all nodes have to communicate their updates to others before a new batch is retrieved. It involves a large overhead on the overall system, hampering the scaling of learning. The HOGWILD sharedmemory algorithm [35] is a well-known instance of inconsistency, which enables training agents to read parameters and modify gradients at will, overwriting existing development. Stale-Synchronous Parallelism (SSP) [36] proposes a balance between consistent and inconsistent models to provide accuracy guarantees given asynchrony.

### B. Parameter Distribution

For distributed deep learning, there are two general ways to maintain communication bandwidth: compressing parameters with efficient data representations and avoiding sending unnecessary information entirely, resulting in sparse data structures being communicated. Although the methods used in the former category are orthogonal to the network infrastructure, when applied using hierarchical (PS) and distributed topologies, the methods used in the latter classification vary. A common gradient (or parameter) compression data representation is quantization, i.e. the mapping of continuous information into buckets representing value sets (usually ranges). [37] has been shown to be closely distributed in the distributions of parameter and gradient values, so these approaches are successful in representing the working range to reduce the number of bits per parameter. Sparsification is another popular method used for the distribution of parameters. DNNs (especially CNNs) display sparse gradients during updates of parameters. Using a fixed threshold, the first application of gradient sparsification [38] prunes gradient values should not be sent below.

## VII. TOOLS AND SOFTWARES

### A. cuDNN

CuDNN [20] is an effective library of basic deep learning implementation. Deep workloads of learning are computationally intensive and it is difficult and time-consuming to optimize their kernels. Kernels need to be reoptimized as parallel architectures evolve, making it difficult to maintain codebases over time. Libraries such as the Basic Linear Algebra Subroutines (BLAS) [39] have long addressed similar issues in the HPC community. For deep learning, however, there is no comparable library. CuDNN is a BLAS-like library of structured routines for deep workload learning. This implementation includes GPU routines, although these routines could be implemented for other platforms similar to the BLAS library. The library is easy to integrate into existing frameworks and provides efficient use of memory and performance.

Fig. 3: Comparative Performance between Caffe and cuDNN

For example, incorporating cuDNN into Caffe [32], a common framework for convolutional networks improves performance on a standard model by 36 percent while increasing memory consumption as well. Our goal is to offer matrix multiplication output as close as possible while using no auxiliary memory. GPU memory is high, but low, bandwidth and therefore a scarce resource. Ideally, the GPU memory should be filled with information, parameters, and neuron responses when training deep networks, not auxiliary data structures needed by the algorithm of convolution. Figure 3 displays three convolution implementations output on an NVIDIA Tesla K40: cuDNN, Caffe, and Cuda-convnet2. Importantly, even with a small mini-batch size of just 16, cuDNN performance is still 86 percent of maximum performance, which indicates that our implementation is performing well across the parameter space of convolution.

### B. Torch

Torch7 [31] has been designed with efficiency in mind, leveraging SSE wherever possible and supporting two parallel approaches: OpenMP and CUDA. These techniques are heavily used by the Tensor library (interfaced with the "torch" package in Lua). From the user's point of view, enabling CUDA and OpenMP can result in great speed-ups in any "Lua" script at zero cost of implementation (because most packages rely on the Tensor library). For more specific uses not covered by the Tensor library, other packages (such as the "nn" package) also leverage OpenMP and CUDA. On most benchmarks, Torch shows it's faster than Theano [40]. Ironically, for small architectures, Theano lags, which could be explained by a high overhead of Python. The great performance of OpenMP compared to the GPU implementation is another interesting comment: only the largest architectures will benefit from the GPU.

### C. TensorFlow

TensorFlow [30] is an application for expressing algorithms for machine learning and implementing these algorithms. A computation expressed using TensorFlow can be performed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets to hundreds of machines' large-scale distributed systems and thousands of computational devices such as GPU cards. The program is versatile and can be used to describe a wide range of algorithms, including deep neural network models learning and inference algorithms. Client programs communicate with the TensorFlow model by generating a Session and a tensor is a typed, multidimensional set that is the minimum building block in their implementation. We have built TensorBoard, a companion visualization tool for TensorFlow that is included in the open-source release, to help users understand the structure of their computer graphs and also understand the overall behavior of machine learning models.

Fig. 4: Baseline throughput for synchronous replication with a null model. Sparse accesses enable TensorFlow to handle larger models, such as embedding matrices

## VIII. CONCLUSION

The world of deep learning is full of concurrency. Nearly every aspect of learning is essentially parallel, from convolution computing to meta-optimizing DNN architectures. Even if an aspect is sequential, due to the robustness of nonlinear optimization, its consistency requirements can be reduced to increase competition while at the same time achieving reasonable accuracy, if not better. In this paper, we provide an overview of many of these aspects, the approaches documented in the literature, and the analysis of competition. It's hard to predict what the future holds for this highly active research field (many have tried over the years) but we can assume there's a lot to do with parallel and distributed deep learning to progress. As research progresses, DNN architectures between consecutive and non-consecutive layers are becoming deeper and more interconnected. Apart from accuracy, considerable effort is made to reduce memory footprint and the number of operations so that inferences on mobile devices can be successfully executed. This also means that compression of DNN post-training is likely to be further studied, and compressible networks learning is desirable. Because mobile hardware is limited in memory capacity and must be energy-efficient, it requires specialized DNN computational hardware.

## REFERENCES

1 Najafabadi, M. M., Villanustre, F., Khoshgoftaar, T. M., Seliya, N., Wald, R., and Muharemagic, E., "Deep learning applications and challenges in big data analytics," *Journal of Big Data*, vol. 2, no. 1, p. 1, 2015.

2 LeCun, Y., Bengio, Y., and Hinton, G., "Deep learning," *nature*, vol. 521, no. 7553, pp. 436–444, 2015.

3 Li, Y., "Deep reinforcement learning: An overview," *arXiv preprint arXiv:1701.07274*, 2017.

4 Schmidhuber, J., "Deep learning in neural networks: An overview," *Neural networks*, vol. 61, pp. 85–117, 2015.

5 Wang, H., Raj, B., and Xing, E., "On the origin of deep learning. arxiv preprint arxiv: 170207800," *Neural networks*, 2017.

6 Bengio, Y., "Deep learning of representations: Looking forward," in *International Conference on Statistical Language and Speech Processing*. Springer, 2013, pp. 1–37.

7 Ienne, P. *et al.*, "Architectures for neuro-computers: review and performance evaluation," *Computer Science Department Technical Report*, no. 93/21, 1993.

8 Lacey, G., Taylor, G. W., and Areibi, S., "Deep learning on fpgas: Past, present, and future," *arXiv preprint arXiv:1602.04283*, 2016.

9 Sze, V., Chen, Y.-H., Yang, T.-J., and Emer, J. S., "Efficient processing of deep neural networks: A tutorial and survey," *Proceedings of the IEEE*, vol. 105, no. 12, pp. 2295–2329, 2017.

10 Oyama, Y., Nomura, A., Sato, I., Nishimura, H., Tamatsu, Y., and Matsuoka, S., "Predicting statistics of asynchronous sgd parameters for a large-scale distributed deep learning system on gpu supercomputers," in *2016 IEEE International Conference on Big Data (Big Data)*. IEEE, 2016, pp. 66–75.

11 Yan, F., Ruwase, O., He, Y., and Chilimbi, T., "Performance modeling and scalability optimization of distributed deep learning systems," in *Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. ACM, 2015, pp. 1355–1364.

12 Qi, H., Sparks, E. R., and Talwalkar, A., "Paleo: A performance model for deep neural networks," 2016.

13 Song, M., Hu, Y., Chen, H., and Li, T., "Towards pervasive and user satisfactory cnn across gpu microarchitectures," in *2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)*. IEEE, 2017, pp. 1–12.

14 Nvidia, C., "Cublas library," *NVIDIA Corporation, Santa Clara, California*, vol. 15, no. 27, p. 31, 2008.

15 Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q., and Wang, Y., "Intel math kernel library," in *High-Performance Computing on the Intel* R *Xeon Phi ^{TM}*. Springer, 2014, pp. 167–188.

16 Vanhoucke, V., Senior, A., and Mao, M. Z., "Improving the speed of neural networks on cpus," 2011.

17 Demmel, J. and Dinh, G., "Communication-optimal convolutional neural nets," *arXiv preprint arXiv:1802.06905*, 2018.

18 Chellapilla, K., Puri, S., and Simard, P., "High performance convolutional neural networks for document processing," 2006.

19 Coates, A., Huval, B., Wang, T., Wu, D., Catanzaro, B., and Andrew, N., "Deep learning with cots hpc systems," in *International conference on machine learning*, 2013, pp. 1337–1345.

20 Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., and Shelhamer, E., "cudnn: Efficient primitives for deep learning," *arXiv preprint arXiv:1410.0759*, 2014.

21 Appleyard, J., Kocisky, T., and Blunsom, P., "Optimizing performance of recurrent neural networks on gpus," *arXiv preprint arXiv:1604.01946*, 2016.

22 Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., and Graves, A., "Memory-efficient backpropagation through time," in *Advances in Neural Information Processing Systems*, 2016, pp. 4125–4133.

23 Zhang, X., Mckenna, M., Mesirov, J. P., and Waltz, D. L., "An efficient implementation of the back-propagation algorithm on the connection machine cm-2," in *Advances in neural information processing systems*, 1990, pp. 801–809.

24 Farber, P. and Asanovic, K., "Parallel neural network training on multispert," in *Proceedings of 3rd International Conference on Algorithms and Architectures for Parallel Processing*. IEEE, 1997, pp. 659–666.

25 Raina, R., Madhavan, A., and Ng, A. Y., "Large-scale deep unsupervised learning using graphics processors," in *Proceedings of the 26th annual international conference on machine learning*. ACM, 2009, pp. 873– 880.

26 Zinkevich, M., Weimer, M., Li, L., and Smola, A. J., "Parallelized stochastic gradient descent," in *Advances in neural information processing systems*, 2010, pp. 2595–2603.

27 Dean, J. and Ghemawat, S., "Mapreduce: simplified data processing on large clusters," *Communications of the ACM*, vol. 51, no. 1, pp. 107–113, 2008.

28 Muller, U. and Gunzinger, A., "Neural net simulation on parallel computers," in *Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94)*, vol. 6. IEEE, 1994, pp. 3961–3966.

29 Ericson, L. and Mbuvha, R., "On the performance of network parallel training in artificial neural networks," *arXiv preprint arXiv:1701.05130*, 2017.

30 Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M. *et al.*, "Tensorflow: Largescale machine learning on heterogeneous distributed systems," *arXiv preprint arXiv:1603.04467*, 2016.

31 Collobert, R., Kavukcuoglu, K., and Farabet, C., "Torch7: A matlab-like environment for machine learning," in *BigLearn, NIPS workshop*, no. CONF, 2011.

32 Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T., "Caffe: Convolutional architecture for fast feature embedding," in *Proceedings of the 22nd ACM international conference on Multimedia*. ACM, 2014, pp. 675–678.

33 Gaunt, A. L., Johnson, M. A., Riechert, M., Tarlow, D., Tomioka, R., Vytiniotis, D., and Webster, S., "Ampnet: asynchronous model-parallel training for dynamic neural networks," *arXiv preprint arXiv:1705.09786*, 2017.

34 Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K. *et al.*, "Large scale distributed deep networks," in *Advances in neural information processing systems*, 2012, pp. 1223–1231.

35 Recht, B., Re, C., Wright, S., and Niu, F., "Hogwild: A lock-free approach to parallelizing stochastic gradient descent," in *Advances in neural information processing systems*, 2011, pp. 693–701.

36 Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J. K., Gibbons, P. B., Gibson, G. A., Ganger, G., and Xing, E. P., "More effective distributed ml via a stale synchronous parallel parameter server," in *Advances in neural information processing systems*, 2013, pp. 1223–1231.

37 Koster, U., Webb, T., Wang, X., Nassar, M., Bansal, A. K., Constable, W.,¨ Elibol, O., Gray, S., Hall, S., Hornof, L. *et al.*, "Flexpoint: An adaptive numerical format for efficient training of deep neural networks," in *Advances in neural information processing systems*, 2017, pp. 1742–1752.

38 Strom, N., "Scalable distributed dnn training using commodity gpu cloud computing," in *Sixteenth Annual Conference of the International Speech Communication Association*, 2015.

39 Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T., "Basic linear algebra subprograms for fortran usage," 1977.

40 Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y., "Theano: a cpu and gpu math expression compiler," in *Proceedings of the Python for scientific computing conference (SciPy)*, vol. 4, no. 3. Austin, TX, 2010.

## Cite This Work

To export a reference to this article please select a referencing stye below:

## Related Services

View all## DMCA / Removal Request

If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: