A more parameter-efficient bottleneck for EfficientNet!

Linear Bottleneck with Efficient Channel Attention instead of Squeeze Excitation

Introduction

Linear Bottleneck

Linear Bottleneck (cf MobileNetV2 paper)

Linear Bottleneck is a residual layer composed of one 1x1 convolution, followed by a 3x3 depthwise convolution, then finally a 1x1 convolution. It is used as a basis for EfficientNet because of its efficiency (yeah). 1x1 convolutions don’t require much parameters so do 3x3 depthwise convolutions, and they’re fast to compute.

However, linear bottlenecks don’t have an attention layer.

EfficientNet: Linear Bottleneck + Squeeze Excitation

EfficientNet solving the “attention problem”: linearbottleneck_se. A better representation with hard-sigmoid can be found in MobileNetV3 paper (page 3)

So it’s almost the same, 1x1conv, 3x3dwise, SE, 1x1conv, this is what powers SOTA.

SE: Provided a C channels feature map, SE produces C scalars to be multiplied with each channel. You have a feature map of shape B*C*H*W, you apply average pooling: you get B*C*1*1, then a 1x1 convolution (same as a simple linear layer multiplying with weights and summing everything with biases), an activation layer, another 1x1 and a sigmoid. You then get B*C*1*1 positive scalars to be multiplied with your B*C*H*W feature map, which effectively produces an attention over some channels of your feature map.

Problem is, SE is highly parametric. It undoubtedly enhances the accuracy but multiplying everything with everything isn’t necessary “efficient”. EfficientNetB0 has 5.3M #Params and 0.39B #FLOPS.

We reproduced Efficient Net B0 architecture with some tweaks on the hyperparameters (optimizer, augmentation..) for our usage (fast non-dataset specific learning). EfficientNet results are hard to reproduce on low/middle-end hardware (it’s also reported here). Working re-implementations of training in Pytorch include rwightman and kakaobrain’s fast-autoaugment. This is our result:

Our reproduction of EfficientNetB0 after ~50 epochs (source for red line)

As you can see, we didn’t manage to reproduce the reported accuracy with our setup, but it doesn’t really matter for this experiment.

EfficientNetECA: Linear Bottleneck + Efficient Channel Attention

ECA layer : https://arxiv.org/pdf/1910.03151.pdf

ECA models are more parameter-efficient than SENet-50 (SEResNeXt-50) which added +10% parameters compared to ResNet-50, while ECANet-50 adds less than 1% in parameters and GFLOPS (though ECANet does reduce the FPS by 3%, while SENet reduced the FPS by 13%). This inefficiency was already reported in the SE paper.

It’s basic, ECA captured what they called “local cross-channel interaction”. Which can be translated to “Conv1D with low kernel size”.

ECA: Provided a C channels feature map, ECA produces C scalars to be multiplied with each channel. You have a feature map of shape B*C*H*W, you apply average pooling: you get B*C*1*1, then a 1D-convolution with a low kernel_size (3 for example), no bias, sigmoid, and voilà. You then get B*C*1*1 positive scalars to be multiplied with your B*C*H*W feature map, which effectively produces an attention over some channels of your feature map.

And that’s it. Now introducing: Linear Bottleneck ECA. What is it? It’s Linear Bottleneck SE (EfficientNet main block), with an ECA layer in place of the SE layer. How it performs?

Validation Accuracy we got with EfficientNetECA

Ok, it’s the same, what’s the point? Well B0 has 5,288,548 parameters. ECA-B0 has 4,652,084 parameters. We reduced the number of parameters of the SOTA architecture by 12%!

After 66.560.000 images seen (52 epochs), for one training, on our setup, B0 without any attention layers is at 72.57%, B0-SE is at 73.17% and B0-ECA is at 73.31%. This is evaluated on the whole Imagenet validation set. It’s not significant but it supports the hypothesis that ECA performs better or the same than SE, with much less parameters.

Tl;dr

Conclusion / Further works

We don’t have the hardware to train EfficientNet up to it’s reported accuracy but we hope someone will try training a NoisyStudent-FixEfficientNetECA-L2. We’ll do it by ourselves if we manage to find the gpu-day for this usage, or we wish that the ECA team will try it (they already published the results and their github for ECA_MobileNetV2), stay tuned!

_____________________________________

More plots

Partial Top1Acc = f(n_images_seen) — with curve smoothing

We can see that SE seems to perform better at the beginning of training (in terms of iteration), even though a complete evaluation on the whole set after 52 epochs puts ECA ahead.

___

Partial Top1Acc = f(time) — with curve smoothing

In terms of training speed there’s no clear winner, it looks like SE performs better at the end of the training.

Addendums

Addendum 2 (2020/07/24): Of course accuracies would need to be evaluated more properly, ideally by averaging it on multiple trainings on a completely working reproduction of the SOTA. This article is just a quick ablation study, it’s not as rigourous as research papers.

Addendum 3 (2020/07/25): It’s worth noting that attention layers won’t improve your network massively. SE improved Resnet50 accuracy by 1.51% (in absolute value) and ECA improved it by 2.28%. It’s an easy improvement but on a custom dataset with a custom network you’ll probably see more improvements by modifying other hyperparameters.

Addendum 4 (2020/07/28): +plots, +code

Addendum 5 (2020/10/12): Other results could indicate that ECA, while being more parameter-efficient and still better than no attention, could have a lower score than SE on some trainings.

Code

____________________________________________

This article was written on Medium, by Elie D, for https://www.hyugen.com/en. We are working on AGI.

Support us 💙

Date: 2020/07/21

We work on statistics, machine learning, AI and we want to bring Artificial General Intelligence into the real world — Website: https://www.hyugen.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store