参见作者原研究论文

本实验方案简略版
Jan 2020
Advertisement

本文章节


 

Sequence Alignment Using Machine Learning for Accurate Template-based Protein Structure Prediction
基于模板的蛋白质结构精确预测的机器学习序列比对   

引用 收藏 提问与回复 分享您的反馈 Cited by

Abstract

Template-based modeling, the process of predicting the tertiary structure of a protein by using homologous protein structures, is useful when good templates can be available. Indeed, modern homology detection methods can find remote homologs with high sensitivity. However, the accuracy of template-based models generated from the homology-detection-based alignments is often lower than that from ideal alignments. In this study, we propose a new method that generates pairwise sequence alignments for more accurate template-based modeling. Our method trains a machine learning model using the structural alignment of known homologs. When calculating sequence alignments, instead of a fixed substitution matrix, this method dynamically predicts a substitution score from the trained model.

Keywords: Template-based modeling (基于模板的蛋白质结构预测), Homology modeling (同源建模), Sequence alignment (序列比对), Machine learning (机器学习), k-Nearest Neighbor (k-近邻)

Background

Proteins are key molecules in biology, biochemistry and pharmaceutical sciences. To reveal the functions of proteins, it is essential to understand the relationships between proteins' structure and function. Protein structures can be determined by experimental; the protein structures are often registered to and accessible in the Protein Databank (PDB) (wwPDB consortium, 2018). However, despite improvements in experimental methods for determining protein structures, the speed at which amino acid sequences can be revealed has overtaken our ability to ascertain the corresponding proteins' structures (Muhammed et al. 2019). Therefore, protein structure prediction remains essential.

As one of various methods for protein structure prediction, template-based or homology modeling predicts structures based on templates and their sequence alignment to a target protein. Template structures are the structures of homologous proteins, often found by homology detection methods. Currently, template-based modeling methods are the most practical because the predicted models are often accurate if we can find good templates and protein sequence alignments. These accurate models by template-based modeling can be used for computer-aided drug design (CADD).

Indeed, recent homology search methods have been able to detect remote homologs (Boratyn et al., 2012; Zimmermann et al., 2018). Although, sometimes sufficiently accurate structure models cannot be obtained because the quality of the sequence alignment generated by homology detection program is poor. If a more accurate model is required, researchers must manually edit alignments to improve their quality before modeling. In structural alignment, the structural difference between a target protein structure and a template protein structure is minimized; thus, sequence alignments generated by structural alignment are almost ideal for template-based modeling. Often, the sequence alignments generated by the homology detection methods are dissimilar to those generated by structural alignment, especially for remote homologs. Thus far, a method’s ability to detect remote homologs has been prioritized because models cannot be generated without a template. However, to achieve higher-accuracy template-based modeling, the improvement of sequence alignment generation is a critical open problem. This problem has been mentioned in several studies (Kopp et al., 2007) in which researchers have tried to improve alignments manually based on their knowledge of biology; fully automated methods are still required.

Recently, machine learning methods have demonstrated power in various fields (Lyons et al., 2014; Cao et al., 2016; Wang, Peng, et al., 2016; Wei and Zou, 2016; Manavalan and Lee, 2017; Wang, Sun, et al., 2017). Machine learning also seems effective in tackling the problem of alignment generation for homology modeling. However, this topic has not been studied because it is challenging to treat alignment generation as a classification or regression problem.

For the problem, we proposed a new sequence alignment generation protocol based on a machine learning that learns the structural alignments of known homologs (Makigaki and Ishida, 2019). We use a dynamic programming algorithm during aligning sequences to dynamically predict a substitution score from the k-Nearest Neighbor (k-NN) model instead of a fixed substitution matrix or profile comparison. Machine learning is used in this substitution score prediction process.

The proposed method is valuable for researchers who use template-based modeling with remote homologs whose sequence identity is not high. In this paper, we show the overview of our method as a procedure, and more detailed usage of our tool and some examples are available in the source code repository (https://github.com/shuichiro-makigaki/exmachina).

Equipment

  1. Computer
    > 128 GiB RAM and > 150 GiB free storage are recommended
  2. Linux (> 3.10) or SUSE Linux Enterprise Server 12

Software

  1. PSI-BLAST (> v2.9)
    To generate PSSM of an amino acid sequence
    Download URL: https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download (Last access date: 2020-02-22)
    Installation document URL: https://www.ncbi.nlm.nih.gov/books/NBK279690/
  2. TM-align (> v20190822) (Zhang and Skolnick, 2005)
    To generate structural alignment of homologs
    Download and installation document URL: https://zhanglab.ccmb.med.umich.edu/TM-align/ (Last access date: 2020-02-22)
  3. Implementation: Source code and installation document are available in the source code repository.
    Download URL: https://github.com/shuichiro-makigaki/exmachina/archive/master.zip
    Installation Procedure: https://github.com/shuichiro-makigaki/exmachina#how-to-use
    (Last access date: 2020-02-22)
    Python 3.6: Required python packages are listed in the repository.
  4. FLANN (Muja and Lowe, 2009): k-Nearest Neighbor implementation. The installation procedure also contains the FLANN installation document.
  5. Structural Classification of Proteins (SCOP) database
    The SCOP database classifies proteins by class, folds, superfamily (SF), family and domain based on manually curated function/structure classifications and contains redundant sequences. Thus, we used the SCOP40 database instead, which contains only domains whose sequence identity is < 40% to avoid overfitting and reduce execution time.
    Download URL: https://scop.berkeley.edu/astral/pdbstyle/ver=1.75 (Last access date: 2020-02-22)
  6. UniRef (The UniProt Consortium, 2016) database
    For Position Specific Scoring Matrix (PSSM) generation, we used three-iteration PSI-BLAST (Altschul et al., 1997) with the UniRef90 database.
    Download URL: https://www.uniprot.org/downloads#unireflink (Last access date: 2020-02-22)

Procedure

The primary purpose of the training phase is to generate k-NN model that will be used for substitution score prediction in the prediction and alignment generation phase. The prediction phase consists of score prediction and alignment generation. Figure 1 shows the overview of the method. More detailed step-by-step commands and the example are available at source code repository (https://github.com/shuichiro-makigaki/exmachina).


Figure 1. Overview of the proposed method

  1. Model training
    1. Download SCOP40 database.
    2. Generate structural alignments of every domain pair in the same SF by TM-align.
    3. Select only pairs that the TM-score is ≥ 0.5.
    4. Generate a PSSM of the domain by three-iteration PSI-BLAST with the UniRef90 database.
    5. Generate training data and labels.
      As a hyper-parameter, window size is 5.
    6. Reduce training dataset to 1/10 by random sampling.
      Because the original training dataset became too large to process within a reasonable computation time.
    7. Save the training dataset and the labels as FLANN-acceptable data format.

  2. Score prediction and sequence alignment generation
    1. Prepare two homologous amino acid sequences
      As a current limitation, our implementation expects that the inputs are sub-domains. When the protein consists of multiple domains, it should be split into domains. Usually, the domain regions can be predicted by homology detection.
    2. Generate PSSMs of each sequence by three-iteration PSI-BLAST with the UniRef90 database.
    3. Predict all substitution scores of each residue pairs.
      1. Query vector format is the same as the training phase, and the k-NN's classification scores are used for the substitution score directly.
      2. As hyper-parameters, the window size is 5, and the number of the neighbor is 1,000.
    4. Save predicted substitution score matrix.
    5. Generate local sequence alignment by Smith-Waterman algorithm implemented in Biopython (https://biopython.org/).
      During the dynamic-programming, the predicted substitution scores are used for score calculation.

Acknowledgments

This work was supported by JSPS KAKENHI [18K11524] and (Makigaki and Ishida, 2019).

Competing interests

The authors declare no competing interests.

References

  1. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17): 3389-3402.
  2. Boratyn, G. M., Schaffer, A. A., Agarwala, R., Altschul, S. F., Lipman, D. J. and Madden, T. L. (2012). Domain enhanced lookup time accelerated BLAST. Biol Direct 7: 12.
  3. Cao, R., Bhattacharya, D., Hou, J. and Cheng, J. (2016). DeepQA: improving the estimation of single protein model quality with deep belief networks. BMC Bioinformatics 17(1): 495.
  4. Kopp, J., Bordoli, L., Battey, J. N., Kiefer, F. and Schwede, T. (2007). Assessment of CASP7 predictions for template-based modeling targets. Proteins 69 Suppl 8: 38-56.
  5. Lyons, J., Dehzangi, A., Heffernan, R., Sharma, A., Paliwal, K., Sattar, A., Zhou, Y. and Yang, Y. (2014). Predicting backbone Calpha angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network. J Comput Chem 35(28): 2040-2046.
  6. Makigaki, S. and Ishida, T. (2019). Sequence alignment using machine learning for accurate template-based protein structure prediction. Bioinformatics.
  7. Manavalan, B. and Lee, J. (2017). SVMQA: support-vector-machine-based protein single-model quality assessment. Bioinformatics 33(16): 2496-2503.
  8. Muhammed, M. T. and Aki-Yalcin, E. (2019). Homology modeling in drug discovery: Overview, current applications, and future perspectives. Chem Biol Drug Des 93(1): 12-20.
  9. Muja, M. and Lowe, D. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. In Muja M. and Lowe D. (Eds). VISAPP International Conference on Computer Vision Theory and Applications. Lisboa, Portugal, February 5-8, 2009 - Volume 1.
  10. The UniProt Consortium (2016). UniProt: the universal protein knowledgebase. Nucleic Acids Research 45(D1): D158-D169.
  11. Wang, S., Peng, J., Ma, J. and Xu, J. (2016). Protein secondary structure prediction using deep convolutional neural fields. Sci Rep 6: 18962.
  12. Wang, S., Sun, S., Li, Z., Zhang, R. and Xu, J. (2017). Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput Biol 13(1): e1005324.
  13. Wei, L. and Zou, Q. (2016). Recent progress in machine learning-based methods for protein fold recognition. Int J Mol Sci 17(12).
  14. wwPDB consortium. (2018). Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res 47(D1): D520-D528.
  15. Zhang, Y. and Skolnick, J. (2005). TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33(7): 2302-2309.
  16. Zimmermann, L., Stephens, A., Nam, S. Z., Rau, D., Kubler, J., Lozajic, M., Gabler, F., Soding, J., Lupas, A. N. and Alva, V. (2018). A completely reimplemented MPI bioinformatics toolkit with a new HHpred server at its core. J Mol Biol 430(15): 2237-2243.

简介

[摘要 ] 基于模板的建模是通过使用同源蛋白质结构来预测蛋白质三级结构的过程,在可以使用好的模板的情况下非常有用,实际上,现代同源性检测方法可以找到具有高灵敏度的远程同源物。基于同源性检测的比对生成的基于模型的数据通常低于理想比对。在本研究中,我们提出了一种生成配对序列比对的新方法,以更精确地基于模板进行建模。我们的方法训练了机器学习该方法使用已知同源物的结构比对来计算模型,当计算序列比对时,而不是固定的替代矩阵时,该方法根据训练后的模型动态预测替代分数。

[背景 ] 蛋白质是生物学,生物化学和药物科学中的关键分子。要揭示蛋白质的功能,必须了解蛋白质的结构与功能之间的关系。蛋白质的结构可以通过实验确定;蛋白质的结构通常与和可在蛋白质数据库(PDB)(WwPDB协会,2018)。然而,尽管改进实验方法测定蛋白质结构,以飞快的速度氨基酸序列可以透露已经超过了我们的能力,以确定相应的蛋白质结构( Muhammed et al 。(2019)。因此,蛋白质结构预测仍然至关重要。

基于模板或同源性建模 是蛋白质结构预测的多种方法之一,它基于模板及其与靶蛋白的序列比对来预测结构。模板结构是同源蛋白的结构,通常是通过同源性检测方法发现的。基于模型的建模方法是最实用的方法,因为如果我们能找到好的模板和蛋白质序列比对结果,则预测模型通常是准确的,这些基于模板的建模方法可以将这些精确模型用于计算机辅助药物设计(CADD)。

确实,最近的同源性搜索方法已经能够检测到远程同源物(Boratyn 等人,2012; Zimmermann 等人,2018)。尽管有时无法获得足够准确的结构模型,因为同源性产生的序列匹配质量如果需要更精确的模型,研究人员必须在建模之前手动编辑比对以提高质量。在结构比对中,目标蛋白质结构和模板蛋白质结构之间的结构差异被最小化,因此产生了序列比对通过结构比几乎是最适用于基于模板的建模。通常情况下,序列比生成由同源性检测方法是不同的那些产生通过结构比对,尤其是对较远的同源物。至此,方法' 的能力查出较远的同源物有由于没有模板就无法生成模型,因此需要进行优先级排序,但是要获得更高的精度 在基于模板的建模中,序列比对产生的改善是一个关键的开放性问题。在一些研究(Kopp 等人,2007)中已经提到了该问题,其中研究人员试图基于其生物学知识来手动改善比对。仍然需要完全自动化的方法。

最近,机器学习方法已在各个领域证明了力量(Lyons 等人,2014; Cao 等人,2016; Wang,Peng 等人,2016; Wei和Zou,2016; Manavalan 和Lee,2017 ; Wang (Sun 等人,2017)。机器学习在解决同源性建模的比对产生问题方面似乎也很有效,但是由于将比对产生视为分类或回归问题具有挑战性,因此尚未对该主题进行研究。

针对这个问题,我们提出了一种基于机器学习的新序列比对生成协议,该协议学习已知同源物的结构比对(Makigaki and Ishida,2019)。我们在比对序列时使用动态编程算法,从序列中动态预测取代分数ķ -Nearest邻居(ķ -NN)模式,而不是一个固定的替代基质或曲线比较。机器学习是用在这个换人分数预测过程。

对于那些使用序列同源性不高的远程同源物进行基于模板的建模的研究人员而言,该方法具有重要的价值。本文以方法为例对方法进行了概述,并详细介绍了该工具的用法和一些实例。在源代码存储库中(https://github.com/shuichiro-makigaki/exmachina)。

关键字:基于模板的蛋白质结构预测, 同源建模, 序列比对, 机器学习, k-近邻

配套设备


 


建议使用计算机
> 128 GiB RAM和> 150 G iB 免费存储空间
Linux(> 3.10)或SUSE Linux Enterprise Server 12
 


软体类


 


PSI-BLAST (> v2.9)
生成氨基酸序列的PSSM


下载URL:https ://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE =下载(最后访问日期:2020-02-22)
安装文档URL:https://www.ncbi .nlm.nih.gov /书籍/ NBK279690 /


TM-align (> v20190822)(Zh ang和Skolnick,2005)
生成同源物的结构比对


下载和安装文档URL:https ://zhanglab.ccmb.med.umich.edu/TM-align/(最后访问日期:2020-02-22)


实现:源代码库中提供了源代码和安装文档。下载
URL:https : //github.com/shuichiro-makigaki/exmachina/archive/master.zip
安装过程:https : //github.com/shuichiro- makigaki / exmachina#使用方法
(最后访问日期:2020-02-22)
Python 3.6:存储库中列出了必需的python软件包。
FLANN(Muja和Lowe,2009年):k- 最近的邻居实现。安装过程还包含FLANN安装文档。
蛋白质结构分类(SCOP)数据库
SCOP数据库根据手动选择的功能/结构分类对蛋白质进行分类,折叠,超家族(SF),家族和域分类,并包含冗余序列,因此,我们使用SCOP40数据库,该数据库仅包含序列同一性< 40的域%以避免过度拟合并减少执行时间。


下载URL:https ://scop.berkeley.edu/astral/pdbstyle/ver=1.75(最后访问日期:2020-02-22)


UniRef (The UniProt Consortium,2016)数据库
对于特定位置计分矩阵(PSSM)生成,我们使用了具有三重迭代的PSI-BLAST(Altschul 等,1997)和UniRef90数据库。


下载URL:https ://www.uniprot.org/downloads#unireflink(最后访问日期:2020-02-22)


 


程序


 


主要目的的训练阶段是产生ķ -NN模型,该模型将被用于替换分数预测在预测和排列生成阶段。在预测阶段包括分数的预测和调整Generatio N. 图1显示的概述的方法。更详细的一步步命令和实施例可在源代码库(Https://Github.Com/shuichiro-makigaki/exmachina)。


 






图1. 建议方法概述


 


模型训练
下载SCOP40数据库。
通过TM-align生成同一SF中每个域对的结构比对。
只有对,选择的TM-分数是≥ 0.5。
通过带有UniRef90数据库的三迭代PSI-BLAST生成域的PSSM。
生成训练数据和标签。
作为超参数,窗口大小为5。


通过随机抽样将训练数据集减少到1/10。
因为原始训练数据集变得太大,无法在合理的计算时间内进行处理。


将训练数据集和标签保存为FLANN可接受的数据格式。
 


分数预测和序列比对生成
准备两个同源氨基酸序列
作为当前的限制,我们的实现期望输入是亚结构域,当蛋白质由多个结构域组成时,应将其拆分为结构域。正常情况下,可以通过同源性检测预测结构域区域。


通过带有UniRef90数据库的三迭代PSI-BLAST生成每个序列的PSSM。
预测每个残基对的所有取代分数。
查询向量格式与训练阶段相同,并且k -NN的分类分数直接用于替换分数。             
超参数为,窗口大小为5,而数字邻居是1 ,000。
保存预测的替代得分矩阵。
通过在Biopython (https://biopython.org/)中实现的Smith-Waterman算法生成局部序列比对。
在动态编程期间,将预测的替代分数用于分数计算。             


 


致谢


 


这项工作得到了JSPS KAKENHI [18K11524] 和(Makigaki and Ishida,2019 )的支持。


 


竞争利益


 


作者宣称没有利益冲突。


 


参考文献


 


描述于Altschul ,SF,劲爆,TL,SCH 一个FFER ,AA,张,J.,张,Z.,米勒,W.和Lipman,DJ(1997)。缺口BLAST和PSI-BLAST:新一代蛋白质数据库搜索。节目核酸研究25(17):3389-3402。
Boratyn ,GM,谢弗,AA,阿嘎瓦拉,R.,描述于Altschul ,SF,李普曼,DJ和Madden,TL(2012)。域增强查找时间加速BLAST。生物化学直接7:12。
Cao,R.,Bhattacharya,D.,Hou,J.和Cheng,J.(2016)。DeepQA :利用深度信念网络改进对单个蛋白质模型质量的估计.BMC Bioinformatics 17(1):495。
柯普,J.,Bordoli ,L.,Battey ,JN,基弗,F。和Schwede ,T.(2007) 。评估CASP7预测对于基于模板的建模目标蛋白质69增刊8:38-56。
Lyons,J.,Dehzangi ,A.,Heffernan,R.,Sharma,A.,Paliwal ,K.,Sattar,A。,Zhou,Y.和Yang,Y.(2014)。从中预测骨架Calpha 角和二面角堆积稀疏自动编码器深度神经网络的蛋白质序列。J Comput Chem 35(28):2040-2046。
Makigaki ,S.和Ishida,T.(2019)。使用机器学习进行序列比对,以进行基于模板的准确蛋白质结构预测。生物信息学。
Manavalan ,B.和Lee,J.(2017)。基于支持向量机SVMQA :.蛋白单一模式的质量评估生物信息学33(16):2496年至2503年。
Muhammed,MT和Aki-Yalcin,E.(2019)。药物发现中的同源建模:概述,当前应用和未来展望。Chem Biol Drug Des 93(1):12-20。
Muja,M和Lowe,D.(2009)。快速近似近邻通过自动算法配置。在Muja M.和Lowe D. (EDS)。VISAPP国际计算机视觉理论与应用。里斯本,葡萄牙,2月5日-8,2009-卷1。
所述的UniProt 协会(2016)。的UniProt :通用蛋白质知识库核酸研究45(D1):D158-D169。
Wang,S.,Peng,J.,Ma,J. and Xu,J.(2016)。使用深度卷积神经场预测蛋白质二级结构 .Sci Rep 6:18962。
王,S.,太阳,S,李,Z.,张河和徐,J.(2017)。准确的从头预测蛋白质关联图通过超深学习模式。公共科学图书馆COMPUT 生物学13(1) :e1005324。
Wei,L. and Zou,Q.(2016)。基于机器学习的蛋白质折叠识别方法的最新进展。Int J Mol Sci 17(12)。
wwPDB财团。(2018)。蛋白质数据库:3D大分子结构数据的唯一全球档案馆,核酸研究47(D1):D520-D528。
Zhang,Y。和Skolnick,J。(2005).TM- align:基于TM得分的蛋白质结构比对算法, Nucleic Acids Res 33(7):2302-2309。
Zimmermann,L.,Stephens,A.,Nam,SZ,Rau,D.,Kubler,J.,Lozajic ,M.,Gabler,F.,Soding ,J.,Lupas ,AN和Alva,V.(2018) 。一个完全重新实现MPI生物信息学工具包与新的HHpred 服务器为核心,分子生物学杂志430(15):2237年至2243年。
登录/注册账号可免费阅读全文
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2020 The Authors; exclusive licensee Bio-protocol LLC.
引用:Makigaki, S. and Ishida, T. (2020). Sequence Alignment Using Machine Learning for Accurate Template-based Protein Structure Prediction. Bio-protocol 10(9): e3600. DOI: 10.21769/BioProtoc.3600.
提问与回复
提交问题/评论即表示您同意遵守我们的服务条款。如果您发现恶意或不符合我们的条款的言论,请联系我们:eb@bio-protocol.org。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。