Robust Training of Deep Neural Networks with Extremely Noisy Labels
|✅ Paper Type: Free Essay||✅ Subject: Computer Science|
|✅ Wordcount: 991 words||✅ Published: 3rd Nov 2020|
This paper introduces and investigates a co-teaching machine learning strategy to increase the robustness of a deep neural network to training datasets with noisy labels. The motivation for this investigation stems from the fact that typically in the world, it is not inconceivable that noisy labels may exist in various datasets and neural networks should be robust to such noise. The authors point out that deep neural networks are notoriously known to fitting noisy labels as training epochs become large due to the so-called memorizing effect. The authors assert that deep neural networks memorize easy instances first and then gradually try to adapt to noisy instances. The proposed approach exploits this phenomenon by training two neural networks simultaneously similar to the Co-training approach introduced in . The training of each network is done with the biased selection of small loss instances from each mini batch of the peer network to update the parameters. Unlike in the existing MentorNet or Decoupling approaches, in which the error in one network is directly fed back into the same network in the second mini batch, the Co-teaching approach leverages the fact that two networks have different learning capabilities and this serves the purpose of filtering out errors introduced by noisy labels. The optimization method used for both networks was the stochastic gradient descent algorithm with momentum is known to generalize well. The authors argue that when deep neural networks memorize clean data, they become robust and hence will attenuate errors from the subsequent noisy data. To prove the proposed approach introduced, experiments were conducted on different noisy renditions of the popular MNIST, CIFAR-100 and CIFAR-10 datasets. Results from the experiments showed the proposed Co-teaching approach performed better than existing state-of-the-art baselines after training with varying degrees of noisy conditions.
Details of the Approach
The Co- teaching approach proposed trains two deep neural networks f, with parameter
and g with parameter Wg. During the first mini-batch pass, network f is trained with a percentage of instances in the minibatch with a small training loss. This selection is controlled by the parameter R(T). The R(T) selected instances are then fed into network g as useful knowledge for updating the parameters and the process is repeated with networks f and g swapped. The error flow therefore takes a crisscross path between the two networks.
If you need assistance with writing your essay, our professional essay writing service is here to help!Essay Writing Service
The authors acknowledged that in order for this approach to work it was important to establish that the small-loss instances selected were indeed clean. Working with the ability of neural networks to filter out noisy instances using their loss values at the initial stages of training explained by the memorizing effect, more instances in the mini batch were kept at the beginning of training and then gradually dropped the noisy instances by increasing R(T). This is similar to boosting and active learning which have been shown to be to be sensitive to outliers. The proposed Co-teaching combats this problem exploiting the fact that two classifiers can produce different hyperplanes and would have different abilities to filter noise when exposed to noisy data. This was the motivation behind exchanging selected small-loss instances between the networks to update the respective parameters. Although the authors drew motivation from Co-training, they argue the proposed approach needs a single set of features unlike Co-training which needs two and exploits memorization of deep neural networks which Co-training does not.
As stated earlier the authors used three popular benchmark datasets to verify the effectiveness of their proposed model MNIST, CIFAR-100 and CIFAR-10. The authors had to manually corrupt the datasets by using the transition matrix Q which flipped clean labels to noisy labels. The authors defined two structures for Q. Pair flipping where labels are flipped within very similar classes and symmetry flipping were labels are flipped based on a constant probability. Noise rate of 0.45 was chosen for the Pair flipping and 0.5 for the symmetry flipping. The model was also evaluated on data with noise rate of 0.2 in order to measure its performance against low-level noisy data. The performance on these datasets was compared to MentorNet, Decoupling, S-Model, Bootstrap, F-correction and the standard Deep neural network trained on noisy data. All of these methods were implemented with a Convolutional Neural Network with Leaky-RELU as the activation function.9 CNN layer structure with Adam optimizer and a learning rate of 0.001 was used. Test accuracy and label precision were used as the performance metrics. The results from the MNIST database revealed Co-teaching achieved better results with both 45% pair flipping noise rate and 50% symmetry flipping than all the other state-of-the-art methods. It was also performed better than all the other models except F-correction on data with 20% noise rate. The Co-teaching algorithm again outperformed its competitors on both CIFAR-100 and CIFAR-10 datasets in the various noise level conditions defined except in the 20% noise rate case where the F-correction was better.
Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.View our services
The idea presented is elegant and relevant given that real data may have noisy labels. The experiments produced consistent results proving the reliability of the proposed approach. In the implementation of the Co-teaching approach, it is assumed the quality of the labels is unknown. The confidence of the labels is therefore estimated by the small -loss and the noise rate estimated by τ in the experiments which determines the drop rate R(T)
 A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998.
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
DMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: