As mentioned earlier, the proposed model was trained on a 2-stage process detailed by Algorithm 1. First, each pretrained model was replaced by the previously mentioned architecture, their feature’s layers were frozen, leaving only the fully connected layers available for training, and parameters were randomly initialized by He uniform initialization [18]. This model configuration was trained for 30 epochs in its first stage. An epoch refers to one whole training cycle through the training set. Within each epoch, the data is fed by batches. Once all of the batches are fed and training is executed, an epoch is completed. Immediately after, the second stage, also known as fine tuning, unfreezes the whole model and trains it for 20 epochs (Inception v3 model was trained for 30 epochs due to higher complexity and overall demonstrating better performance). Neural networks are fed by batches for training and testing, preprocessing3 and feeding them to the neural networks with a batch size of 32 images. As a single output node for binary classification activates a sigmoid function, the loss function is binary cross entropy. A two-stage training method helps prevent the randomness from the initialization in the output fully connected layer to disrupt the already learned parameters from the pre-trained models. This is done by allowing those parameters to be only fine-tuned after the first stage has optimized the output layer.

Individual models training algorithm

Regarding the optimizer for ensemble model, the method used included the Adam algorithm with decoupled weight decay [19], also known as AdamW optimizer. It has been proven that the Adam optimizer with L2 regularization4 generally fails to converge to a global optima, since its regularization term fails to be equivalent to weight decay as in Stochastic Gradient Descent (SGD) optimization, instead converging quickly and uniformly to a local optima. This is why the SGD with momentum optimization has been the optimizer of choice for many state-of-the-art neural networks. On the other hand, the AdamW optimizer correctly adds the weight decay after the moving averages are calculated. This greatly prevents models to overfit (the model no longer being able to generalize enough to accurately predict from new data), especially when dealing with small datasets.

Learning rate was set to 1e-3 for the first stage and 1e-4 for the second stage, to account for fine tuning, avoiding taking large gradient descent steps to prevent feature layers values from varying significantly. This is one of the most important hyperparameters in neural networks: learning rate controls how much the model’s parameters are updated in response to the network’s error. Selecting a proper learning rate is of utmost importance since a value too high could cause the objective function to diverge while a value too low could make learning too slow and the loss function could converge to a local optimum. To ensure a more robust control of the learning network, we set a learning rate scheduler for both stages, based on validation accuracy behavior, with a patience value of 10 and a reduction factor of 0.1. In this way, when the training reaches a point in which after 10 epochs no improvement is seen, the learning rate is reduced to 10% of its original value. This further improves the training, avoiding overfitting by gradually reducing learning rate when the increase in validation accuracy seems to have come to a stall.

It is important to note that we tested cosine annealing learning rate scheduler on both SGD with momentum [20] and AdamW optimizers, but found no significant improvements over the initial learning rate scheduler. Further testing on these schedulers and optimizers combination is recommended to properly find evidence (or lack thereof) for improvements under the trained dataset. Finally, to avoid training for excessive epochs, we set a best model checkpoint based on validation accuracy for the second stage, this way the trained parameters which led to the highest validation accuracy while training, are the ones used by the model once the training is over.

The aforementioned training specifications were applied to each of the six individual models. Finally, after the second stage is completed, outputs concatenated and all models frozen, the ensemble single output neuron is trained for 15 epochs, using an AdamW optimizer, learning rate of 1e-3 and best model checkpoint. It is important to note that while some initializations required the model to be trained on more than 15 epochs to reach convergence, most of the time it was quickly achieved within the first 10 epochs.

To test this model against an ensemble baseline, we built a voting classifier based upon the results from the six models. For this, we provided two voting methods: hard voting, in which final class label is predicted by a majority rule voting of all six estimators predicted labels; and soft voting, where scores from all six estimators are averaged and rounded to the nearest integer to give out the final prediction [11]. In both methods, a progressive validation of estimators in an ensemble was produced: models were sorted, based on their validation accuracy from highest to lowest, and their ensemble’s new accuracy was registered. The highest validation accuracy ensemble was selected, and performance metrics were calculated on the holdout test set.

The proposed model was trained on an instance with 61 ​GB RAM, four Intel Xeon vCPU running at 2.7 ​GHz and one NVIDIA K80 GPU with 12 ​GB of memory. Deep learning library PyTorch was used for data preprocessing and model training on Python