A reliable data [1618] is necessary for a robust model. The benchmark datasets constructed by Zhang et al. [15] were used in our work. It contained 17,403 BLPs composed of three species, namely, bacteria, eukaryote, and archaea, which were collected from UniProt (Jul. 2016). Therefore, four benchmark datasets were generated corresponding to a general and three species-specific datasets (bacteria, eukaryote, and archaea). To avoid homology bias and remove redundant sequences from the benchmark datasets, BLASTClust [19] was utilized to cluster all these protein sequences by setting the cutoff of sequence identity at 30%. And then, one protein was randomly picked from each cluster as the representative. Thus, 863 BLPs were obtained as positive samples. Among these BLPs, 748 belong to bacteria, 70 belong to eukaryote, and 45 belong to archaea. Additionally, 7093 nonredundant non-BLPs were collected to construct the negative samples that consist of 4919, 1426, and 748 proteins of bacteria, eukaryote, and archaea, respectively. Moreover, to construct balanced training dataset, 80% of the positive samples and equal number of negative samples were randomly picked out for training model. The rest positive and negative samples were used for independent testing. As a result, the final four benchmark datasets are constructed and summarized in Table 1. All data are available at http://lin-group.cn/server/iBLP/download.html.

The constructed benchmark datasets for BLP prediction.