self training with noisy student improves imagenet classification

tsai - Noisy student 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. to noise the student. Le, and J. Shlens, Using videos to evaluate image model robustness, Deep residual learning for image recognition, Benchmarking neural network robustness to common corruptions and perturbations, D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, Distilling the knowledge in a neural network, G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, G. Huang, Y. Although noise may appear to be limited and uninteresting, when it is applied to unlabeled data, it has a compound benefit of enforcing local smoothness in the decision function on both labeled and unlabeled data. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. augmentation, dropout, stochastic depth to the student so that the noised We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. Noisy Student can still improve the accuracy to 1.6%. There was a problem preparing your codespace, please try again. 3429-3440. . Although the images in the dataset have labels, we ignore the labels and treat them as unlabeled data. [76] also proposed to first only train on unlabeled images and then finetune their model on labeled images as the final stage. Imaging, 39 (11) (2020), pp. ImageNet images and use it as a teacher to generate pseudo labels on 300M Hence we use soft pseudo labels for our experiments unless otherwise specified. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Scaling width and resolution by c leads to c2 times training time and scaling depth by c leads to c times training time. Noisy Student (EfficientNet) - huggingface.co We present a simple self-training method that achieves 87.4 By showing the models only labeled images, we limit ourselves from making use of unlabeled images available in much larger quantities to improve accuracy and robustness of state-of-the-art models. Train a classifier on labeled data (teacher). . Overall, EfficientNets with Noisy Student provide a much better tradeoff between model size and accuracy when compared with prior works. labels, the teacher is not noised so that the pseudo labels are as good as In contrast, the predictions of the model with Noisy Student remain quite stable. We iterate this process by putting back the student as the teacher. Hence, whether soft pseudo labels or hard pseudo labels work better might need to be determined on a case-by-case basis. In our experiments, we also further scale up EfficientNet-B7 and obtain EfficientNet-L0, L1 and L2. We use the same architecture for the teacher and the student and do not perform iterative training. Our experiments showed that self-training with Noisy Student and EfficientNet can achieve an accuracy of 87.4% which is 1.9% higher than without Noisy Student. Parthasarathi et al. Self-Training With Noisy Student Improves ImageNet Classification @article{Xie2019SelfTrainingWN, title={Self-Training With Noisy Student Improves ImageNet Classification}, author={Qizhe Xie and Eduard H. Hovy and Minh-Thang Luong and Quoc V. Le}, journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2019 . For example, without Noisy Student, the model predicts bullfrog for the image shown on the left of the second row, which might be resulted from the black lotus leaf on the water. We start with the 130M unlabeled images and gradually reduce the number of images. Not only our method improves standard ImageNet accuracy, it also improves classification robustness on much harder test sets by large margins: ImageNet-A[25] top-1 accuracy from 16.6% to 74.2%, ImageNet-C[24] mean corruption error (mCE) from 45.7 to 31.2 and ImageNet-P[24] mean flip rate (mFR) from 27.8 to 16.1. Efficient Nets with Noisy Student Training | by Bharatdhyani | Towards Noisy Student Training is based on the self-training framework and trained with 4 simple steps: Train a classifier on labeled data (teacher). A number of studies, e.g. This result is also a new state-of-the-art and 1% better than the previous best method that used an order of magnitude more weakly labeled data [ 44, 71]. Are you sure you want to create this branch? Self-training with noisy student improves imagenet classification. [2] show that Self-Training is superior to Pre-training with ImageNet Supervised Learning on a few Computer . In this work, we showed that it is possible to use unlabeled images to significantly advance both accuracy and robustness of state-of-the-art ImageNet models. on ImageNet, which is 1.0 Self-training with Noisy Student improves ImageNet classification Self-Training With Noisy Student Improves ImageNet Classification We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Self-Training With Noisy Student Improves ImageNet Classification This invariance constraint reduces the degrees of freedom in the model. You signed in with another tab or window. We train our model using the self-training framework[59] which has three main steps: 1) train a teacher model on labeled images, 2) use the teacher to generate pseudo labels on unlabeled images, and 3) train a student model on the combination of labeled images and pseudo labeled images. Do imagenet classifiers generalize to imagenet? For this purpose, we use the recently developed EfficientNet architectures[69] because they have a larger capacity than ResNet architectures[23]. Learn more. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. We iterate this process by putting back the student as the teacher. Zoph et al. et al. Please Qizhe Xie, Eduard Hovy, Minh-Thang Luong, Quoc V. Le. 10687-10698 Abstract In this section, we study the importance of noise and the effect of several noise methods used in our model. EfficientNet-L1 approximately doubles the training time of EfficientNet-L0. Le. Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. We have also observed that using hard pseudo labels can achieve as good results or slightly better results when a larger teacher is used. [68, 24, 55, 22]. We found that self-training is a simple and effective algorithm to leverage unlabeled data at scale. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. Classification of Socio-Political Event Data, SLADE: A Self-Training Framework For Distance Metric Learning, Self-Training with Differentiable Teacher, https://github.com/hendrycks/natural-adv-examples/blob/master/eval.py. As can be seen from the figure, our model with Noisy Student makes correct predictions for images under severe corruptions and perturbations such as snow, motion blur and fog, while the model without Noisy Student suffers greatly under these conditions. The architectures for the student and teacher models can be the same or different. Noisy StudentImageNetEfficientNet-L2state-of-the-art. Self-Training With Noisy Student Improves ImageNet Classification combination of labeled and pseudo labeled images. First, a teacher model is trained in a supervised fashion. For labeled images, we use a batch size of 2048 by default and reduce the batch size when we could not fit the model into the memory. For classes where we have too many images, we take the images with the highest confidence. When the student model is deliberately noised it is actually trained to be consistent to the more powerful teacher model that is not noised when it generates pseudo labels. The accuracy is improved by about 10% in most settings. Semi-supervised medical image classification with relation-driven self-ensembling model. Why Self-training with Noisy Students beats SOTA Image classification Noise Self-training with Noisy Student 1. On robustness test sets, it improves ImageNet-A top-1 accuracy from 61.0% to 83.7%, reduces ImageNet-C mean corruption error from 45.7 to 28.3, and reduces ImageNet-P mean flip rate from 27.8 to 12.2. arXiv:1911.04252v4 [cs.LG] 19 Jun 2020 Use Git or checkout with SVN using the web URL. However, in the case with 130M unlabeled images, with noise function removed, the performance is still improved to 84.3% from 84.0% when compared to the supervised baseline. Self-Training With Noisy Student Improves ImageNet Classification. To date (2020) we will introduce "Noisy Student Training", which is a state-of-the-art model.The idea is to extend self-training and Distillation, a paper that shows that by adding three noises and distilling multiple times, the student model will have better generalization performance than the teacher model. ; 2006)[book reviews], Semi-supervised deep learning with memory, Proceedings of the European Conference on Computer Vision (ECCV), Xception: deep learning with depthwise separable convolutions, K. Clark, M. Luong, C. D. Manning, and Q. V. Le, Semi-supervised sequence modeling with cross-view training, E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, AutoAugment: learning augmentation strategies from data, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, RandAugment: practical data augmentation with no separate search, Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov, Good semi-supervised learning that requires a bad gan, T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, A. Galloway, A. Golubeva, T. Tanay, M. Moussa, and G. W. Taylor, R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow, I. J. Goodfellow, J. Shlens, and C. Szegedy, Explaining and harnessing adversarial examples, Semi-supervised learning by entropy minimization, Advances in neural information processing systems, K. Gu, B. Yang, J. Ngiam, Q. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. This work systematically benchmark state-of-the-art methods that use unlabeled data, including domain-invariant, self-training, and self-supervised methods, and shows that their success on WILDS is limited. Unlike previous studies in semi-supervised learning that use in-domain unlabeled data (e.g, ., CIFAR-10 images as unlabeled data for a small CIFAR-10 training set), to improve ImageNet, we must use out-of-domain unlabeled data. We use stochastic depth[29], dropout[63] and RandAugment[14]. There was a problem preparing your codespace, please try again. Here we show the evidence in Table 6, noise such as stochastic depth, dropout and data augmentation plays an important role in enabling the student model to perform better than the teacher. It can be seen that masks are useful in improving classification performance. Self-training with Noisy Student improves ImageNet classification. to use Codespaces. We use a resolution of 800x800 in this experiment. The performance consistently drops with noise function removed. For a small student model, using our best model Noisy Student (EfficientNet-L2) as the teacher model leads to more improvements than using the same model as the teacher, which shows that it is helpful to push the performance with our method when small models are needed for deployment. The swing in the picture is barely recognizable by human while the Noisy Student model still makes the correct prediction. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. It is found that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. They did not show significant improvements in terms of robustness on ImageNet-A, C and P as we did. Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores. Self-training with Noisy Student - Self-training was previously used to improve ResNet-50 from 76.4% to 81.2% top-1 accuracy[76] which is still far from the state-of-the-art accuracy. [^reference-9] [^reference-10] A critical insight was to . Noisy Student (B7) means to use EfficientNet-B7 for both the student and the teacher. Their purpose is different from ours: to adapt a teacher model on one domain to another. An important contribution of our work was to show that Noisy Student can potentially help addressing the lack of robustness in computer vision models. For example, with all noise removed, the accuracy drops from 84.9% to 84.3% in the case with 130M unlabeled images and drops from 83.9% to 83.2% in the case with 1.3M unlabeled images. Stochastic depth is proposed, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time and reduces training time substantially and improves the test error significantly on almost all data sets that were used for evaluation. On robustness test sets, it improves ImageNet-A top . Figure 1(b) shows images from ImageNet-C and the corresponding predictions. Noisy Student Training extends the idea of self-training and distillation with the use of equal-or-larger student models and noise added to the student during learning. Callback to apply noisy student self-training (a semi-supervised learning approach) based on: Xie, Q., Luong, M. T., Hovy, E., & Le, Q. V. (2020). As stated earlier, we hypothesize that noising the student is needed so that it does not merely learn the teachers knowledge. We use the standard augmentation instead of RandAugment in this experiment. Stochastic Depth is a simple yet ingenious idea to add noise to the model by bypassing the transformations through skip connections. The algorithm is iterated a few times by treating the student as a teacher to relabel the unlabeled data and training a new student. Significantly, after using the masks generated by student-SN, the classification performance improved by 0.9 of AC, 0.7 of SE, and 0.9 of AUC. Distillation Survey : Noisy Student | 9to5Tutorial We vary the model size from EfficientNet-B0 to EfficientNet-B7[69] and use the same model as both the teacher and the student. Noisy Student Training seeks to improve on self-training and distillation in two ways. Agreement NNX16AC86A, Is ADS down? Self-training with Noisy Student improves ImageNet classificationCVPR2020, Codehttps://github.com/google-research/noisystudent, Self-training, 1, 2Self-training, Self-trainingGoogleNoisy Student, Noisy Studentstudent modeldropout, stochastic depth andaugmentationteacher modelNoisy Noisy Student, Noisy Student, 1, JFT3ImageNetEfficientNet-B00.3130K130K, EfficientNetbaseline modelsEfficientNetresnet, EfficientNet-B7EfficientNet-L0L1L2, batchsize = 2048 51210242048EfficientNet-B4EfficientNet-L0l1L2350epoch700epoch, 2EfficientNet-B7EfficientNet-L0, 3EfficientNet-L0EfficientNet-L1L0, 4EfficientNet-L1EfficientNet-L2, student modelNoisy, noisystudent modelteacher modelNoisy, Noisy, Self-trainingaugmentationdropoutstochastic depth, Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores., 12/self-training-with-noisy-student-f33640edbab2, EfficientNet-L0EfficientNet-B7B7, EfficientNet-L1EfficientNet-L0, EfficientNetsEfficientNet-L1EfficientNet-L2EfficientNet-L2EfficientNet-B75. We then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. Next, with the EfficientNet-L0 as the teacher, we trained a student model EfficientNet-L1, a wider model than L0. Self-training with noisy student improves imagenet classification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687-10698, (2020 . It is experimentally validated that, for a target test resolution, using a lower train resolution offers better classification at test time, and a simple yet effective and efficient strategy to optimize the classifier performance when the train and test resolutions differ is proposed. all 12, Image Classification However an important requirement for Noisy Student to work well is that the student model needs to be sufficiently large to fit more data (labeled and pseudo labeled).