The process of scaling convolutional neural networks is not well understood and is sometimes done arbitrarily until a satisfactory result is found. This process can be tedious because manual tweaking of the relevant parameters is required [13]. The earlier proposed methods of scaling a network include scaling a model by depth [33], by width [34], and by image resolution [35]. Tan and Quoc [13] studied the influence of these scaling methods in a bid to develop a more systematic way of scaling network architecture. The key findings of their research can be summarized into two specific notes: Firstly, scaling up any single dimension of network resolution, depth, or width will improve accuracy; however, this accuracy gain will diminish for larger models. Secondly, to achieve improved accuracy and efficiency, it is essential to balance the dimensions of a network's depth, width, and resolution, instead of focusing on just one of these. Considering these findings, the authors presented a novel scaling method that uses a robust compound coefficient, ϕ, to scale up the networks in a much more structured manner. Equation (1) represents how the authors [13] suggest scaling the depth, width, and resolution with respect to ϕ.

where d, w, and r represent the depth, width, and resolution of the network, respectively, while the constant terms α,  β, and γ are determined by a hyperparameter tuning technique called grid search. The coefficient ϕ is user-specified and manages the resources that are available for model scaling. The constants define how the additional resources are assigned to the dimensions in the network. The “floating point operations per second” (FLOPS) is a measure of computer performance [36] and essentially measures how many operations are required to execute the network. If the network's depth is doubled, the number of FLOPS required is doubled too. If the network's width or resolution is doubled, the number of FLOPS required is quadrupled. Therefore, the constraint in (1) indicates that, for any increase in the ϕ value, the new number of FLOPS will increase by 2ϕ. Furthermore, the constant terms must be greater than or equal to one because none of the dimensions should be allowed to be scaled down. The aim of this method [13] is to scale network depth, resolution, and width, such that the accuracy of the network, and the consumption of memory and FLOPS are optimized according to the available resources.

To solidify the concept and prove the effectiveness of the compound scaling method, the authors [13] then developed a mobile-sized baseline network by applying the neural architecture search (a technique used to optimize efficiency and accuracy with respect to FLOPS), which was called the EfficientNet-B0. The model uses inverted residual blocks, consisting of squeeze-and-excitation optimization [37] and swish activation [38]. Swish is defined as

The inverted residual block was introduced in the MobileNet-v2 architecture [31] and makes use of depth-wise separable convolution to decrease the number of parameters and multiplications needed to execute the network. This modification results in faster computation without adversely affecting performance. The inverted block consists of three major components: a convolutional layer (called the expansion layer) which expands the number of channels to prepare the data for the next layer, a depth-wise convolutional layer, and another convolutional layer (the projection layer) which is meant to project the data from a large number of channels to a small number of channels. The first and last layers of a residual block are connected via a skip connection. Therefore, during fine-tuning, it is imperative to train entire blocks. Disobeying this restriction can damage the way the network learns [39]. The squeeze-and-excition block consists of a global average pooling (GAP), a reshaping, and two convolutional layers. The GAP layer extracts global features, and then the number of channels is squeezed according to a predefined squeeze ratio.

The compound scaling method was then used to create the EfficientNet family which included the versions B1 to B7; the constants α,  β , and γ were fixed; and ϕ was scaled.

The efficacies of the models were tested on the ImageNet dataset and surpassed state-of-the-art convolutional neural networks, with magnitudes being smaller and faster on CPU inference. The outcome (shown in Figure 7) revealed that even though the models have smaller magnitudes than established models in both number of parameters and number of FLOPS, they performed phenomenally.

A comparison of EfficientNets with established architectures on the classification of the ImageNet dataset (source: [13]).

These models have been successfully used for other histopathology image classification [4045]. However, at the time of this research, the EfficientNet architecture had not yet been investigated for classification of the ICIAR2018 dataset. We limit our experimentation to the first six EfficientNets due to computational resource restrictions.