参见作者原研究论文

本实验方案简略版
Aug 2019

本文章节


 

Structural Alignment and Covariation Analysis of RNA Sequences
RNA序列的结构比对和协变分析   

引用 收藏 提问与回复 分享您的反馈 Cited by

Abstract

RNA molecules adopt defined structural conformations that are essential to exert their function. During the course of evolution, the structure of a given RNA can be maintained via compensatory base-pair changes that occur among covarying nucleotides in paired regions. Therefore, for comparative, structural, and evolutionary studies of RNA molecules, numerous computational tools have been developed to incorporate structural information into sequence alignments and a number of tools have been developed to study covariation. The bioinformatic protocol presented here explains how to use some of these tools to generate a secondary-structure-aware multiple alignment of RNA sequences and to annotate the alignment to examine the conservation and covariation of structural elements among the sequences.

Keywords: RNA (RNA ), Sequence (序列), Structure (结构), Alignment (比对), Covariation (协变), Comparative analysis (对比分析)

Background

Biological RNA molecules fold into specific secondary (2D) and tertiary (3D) structures that are critical for their function. Therefore, for comparative analysis, which usually requires sequence alignment, it is desirable to take structure information into account in order to obtain a more reliable and meaningful alignment. Numerous computational algorithms and tools have been developed to generate alignments based on secondary structure, such as MAFFT (Katoh and Toh, 2008), TurboFold (Tan et al., 2017), R-Coffee (Wilm et al., 2008), locARNA (Will et al., 2007), ProbCons (Do et al., 2005), MXSCARNA (Tabei et al., 2008), and LaRA (Bauer et al., 2007). In a benchmark comparison of these leading tools (Tan et al., 2017) TurboFold and MAFFT were shown to have comparable and highest accuracies. The running time of TurboFold is considerably longer than that of MAFFT (Katoh and Toh, 2008; Tan et al., 2017), thus in this protocol we use MAFFT because its speed, which can be further augmented by parallel processing (Katoh and Standley, 2013), allows alignment of a large number (>100) of sequences in a limited amount time. Like several of the other tools MAFFT employs an iterative strategy where pairwise structural alignments are first computed and are then progressively combined into a multiple alignment through several rounds of refinement.

Because of the tight structure-function relationship, functional RNAs undergo a selection pressure to maintain their structures (Nowick et al., 2019). This is reflected by the occurrence of covarying consistent or compensatory mutations in paired nucleotides that can be observed in sequence alignments. Covariation data are therefore very valuable and have been used to validate or predict the secondary, and even tertiary, structure of RNAs and to understand their evolution (Michel and Westhof, 1990; Cannone et al., 2002). A number of software tools are available for examining covariation within alignments, such as the structural alignment editors RALEE (Griffiths-Jones, 2005), 4SALE (Seibel et al., 2006), S2S (Jossinet and Westhof, 2005), ConStruct (Wilm et al., 2008b), or SARSE (Andersen et al., 2007), R-chie (Lai et al., 2012), a tool that scores and annotates covariation, and complex programs that include methods for performing statistical analysis of covariation with or without a phylogenetic framework such as R-scape (Rivas et al., 2017) and CoMap (Dutheil, 2012). R-chie highlights basepairs and employs arc diagrams to represent the secondary structure alongside the alignment, and can generate highly customizable figures.

In the protocol below, we explain how to use MAFFT to compute a structural alignment of multiple RNA sequences, and how to use R-chie to annotate the alignment with conservation and covariation information.

Equipment

  1. Personal computer, preferably with multiple processors (CPUs) to speed up computations
    A Unix/Linux operating system is preferred. All software mentioned here, except the optional LaRA program, can also be run under Mac and Windows systems. For Windows, a terminal or Linux emulator such as Cygwin (http://www.cygwin.com/) or Ubuntu (https://www.microsoft.com/store/p/ubuntu/9nblggh4msv6) is needed. In any case, familiarity with the use of command-line-driven applications is required.

Software

  1. MAFFT (Katoh and Toh, 2008, https://mafft.cbrc.jp/alignment/software/source.html)
  2. R (R Development Core Team, 2018, http://www.R-project.org/)
  3. R-chie (Lai et al., 2012, https://www.e-rna.org/r-chie/)
  4. (Optional) MXSCARNA (Tabei et al., 2008, https://www.ncrna.org/softwares/mxscarna/)
  5. (Optional) LaRA (Bauer et al., 2007, http://www.mi.fu-berlin.de/w/LiSA/Lara)
  6. (Optional) FOLDALIGN (Sundfeld et al., 2016, http://rth.dk/resources/foldalign)
Notes:
  1. The MAFFT package is provided in different forms. Be sure to download the bundle that provides support for RNA structural alignment, such as the package with extensions for Unix/Linux, the Standard package for Mac, and the Ubuntu or Cygwin version for Windows.
  2. It is not necessary to install MXSCARNA separately as it is included within the MAFFT package.
  3. Install LaRA and/or FOLDALIGN only if you want to use them as alternatives to MXSCARNA. LaRA runs on Linux only. LaRA version 1.3 may need to be used as the later versions 1.31 and 1.32 frequently abort.

Procedure

  1. Prepare a set of related (homologous) RNA sequences to analyze, either using local sequences or by downloading sequences from a database. Sequences can be retrieved via keyword searches from general-purpose databases such as NCBI GenBank/RefSeq (https://www.ncbi.nlm.nih.gov/), ENA (https://www.ebi.ac.uk/ena), or Ensembl (http://www.ensembl.org/), or from specialized RNA databases such as SILVA (https://www.arb-silva.de/), ncRNA databases (https://ncrnadatabases.org/), RNAcentral (https://rnacentral.org/), the Comparative RNA Website (http://www.rna.icmb.utexas.edu/), or Rfam (https://rfam.org/; e.g., see the Nucleic Acids Research website for a non-exhaustive listing of RNA sequence databases; http://www.oxfordjournals.org/nar/database/cat/2). Some of the RNA databases (e.g., SILVA, Rfam, Comparative RNA Website) provide sequences that are already aligned in the form of structure-aware multiple alignments. Homologs of RNA sequences of interest can also be identified by sequence similarity search, e.g., using the well-known BLASTN tool. All general databases and some of the RNA databases provide a BLAST service. One can also search for related RNA sequences at the structural level with the help of covariance models using INFERNAL (Nawrocki and Eddy, 2013) or CMfinder (Yao et al., 2006).
      All collected sequences must be put in a single file in the commonly used FASTA format (https://en.wikipedia.org/wiki/FASTA_format; Figure 1).
    Note: The composition of the dataset may influence the analysis of covariation, depending on the amount of similarity or dissimilarity between the sequences and their phylogenetic relationships.


    Figure 1. Example of unaligned RNA sequences in FASTA format (visualized in Emacs)

  2. Compute a structural multiple alignment by running MAFFT in the ‘X-INS-I’ mode:
    mafft-xinsi --scarnapair --nuc --reorder --maxiterate max_number_of_iterations --thread number_of_CPUs_to_use sequence_file.fasta 1> mafft_alignment.fasta 2> mafft_alignment_details.log

    The alignment generated will be in FASTA format (Figure 2).
    Notes:
    1. The option “--scarnapair” instructs MAFFT to use MXSCARNA to perform pairwise structural alignments, which is the default option. To use the LaRA aligner instead, invoke the pair of options “--larapair --laraparams parameter_file”, where “parameter_file” is a file with LaRA configuration parameters. A template file “lara.params” is provided with the LaRA software. To use the FOLDALIGN aligner, invoke the option “--foldalignlocalpair” or “--foldalignglobalpair” to perform local or global pairwise alignment, respectively. In benchmarking comparisons (Katoh and Toh, 2008) the accuracy of MXSCARNA was generally higher than that of LaRA, except when the identity among the sequences was low (<40%), in which case LaRA may be preferred. FOLDALIGN is another structural alignment program that is highly accurate and that can carry out structural alignments of sequences with low similarity ( Havgaard et al., 2005; Sundfeld et al., 2016).
    2. To obtain high accuracy alignments, use a large number of iterative refinements, e.g., set “max_number_of_iterations” to 1,000 for the “--maxiterate” option”.
    3. For the “--thread” option, increase “number_of_CPUs_to_use” to speed up alignment computation. Runtime increases with the number and length of the sequences. For example, for a set of 100 sequences of length 50-200 nt calculations can take 3-10 min on a single CPU, and under a minute when using at least 8 CPUs.
    4. In some environments, MAFFT may abort with the following error: “mafft-xinsi: line 2369: /dev/stderr: Not a directory”. The error can be solved by replacing “/dev/stderr” by “/dev/null” in the mafft-xinsi bash script.


    Figure 2. Structural alignment produced by MAFFT visualized in 4SALE

  3. (Optional) Predict a reference secondary structure:
    In order to reveal covariation, a reference secondary structure is needed. It can be the structure of one of the RNA sequences included in the analysis, a consensus structure inferred from the alignment, or an external structural model. In lack of a known or experimentally-determined model, the structure needs to be predicted. Prediction of the 2D structure of a single RNA sequence can be done with widely used tools such as MFOLD (Zuker, 2003), http://unafold.rna.albany.edu/?q=mfold), RNAfold from the ViennaRNA package (Lorenz et al., 2011), https://www.tbi.univie.ac.at/RNA/), Fold or MaxExpect from the RNAStructure package (Reuter and Mathews, 2010), https://rna.urmc.rochester.edu/RNAstructure.html), or RNAshapes from the RNA shapes studio (Janssen and Giegerich, 2015; https://bibiserv.cebitec.uni-bielefeld.de/rnashapesstudio). All these tools can be run on-line at dedicated webservers or installed locally as command-line programs. For example, standard commands to run MFOLD or RNAfold with default parameters on an RNA sequence in FASTA format would be:

    mfold SEQ=sequence_file.fasta (structures will be output in files named “sequence_file*.ct”)
    RNAfold < sequence_file.fasta > structure_file.b

    The above-mentioned packages have numerous options to tune the folding computation (e.g., by changing the algorithm, temperature, ionic conditions), and most of them provide the possibility to impose constraints on the structure.
      There exists also several pieces of software for predicting consensus structures from alignments, such as RNAalifold (Bernhart et al., 2008) from the ViennaRNA package, the graphical tool ConStruct (Wilm et al., 2008b); http://www.biophys.uni-duesseldorf.de/construct3/), and RNAalishapes from the RNA shapes studio (Voß, 2006); https://bibiserv.cebitec.uni-bielefeld.de/rnaalishapes). Notably, RNAalifold does not weight the sequences and is highly sensitive to the particular sample of sequences under study and the prediction can be affected by the varying amount of similarity between the sequences as well as by sequences containing insertions. ConStruct incorporates custom sequence weighting for optimal consensus prediction while RNAalishapes is based on the concept of abstract structural shapes and also includes gap-aware energy evaluation which makes it outperform RNAalifold when sequences contain insertions and/or alignment is of low quality (Voß, 2006). In our experience RNAalishapes performed quite well. RNAalishapes relies on older Linux libraries and may not install on current architectures, thus the on-line version may have to be used.

  4. Analyze the covariation in the alignment and generate an image of the covariation-annotated alignment using R-chie:
    Covariation is revealed by mapping the reference secondary structure onto the multiple alignments. The structure can be provided to R-chie in one of the common formats used by structure prediction and analysis software, including dot-bracket (or Vienna; “.b”), connect-table (“.ct”), as well as bpseq (from The Comparative RNA Website [Cannone et al., 2002], http://www.rna.icmb.utexas.edu/). The alignment must be in FASTA format.

    rchie.R --msafile=mafft_alignment.fasta --pdf --output=alignment_covariation_figure.pdf --format1=vienna --rule1=7 --group1=4 --legend1 --legend3 --msaspecies --msagrid --msatext reference_structure.b &> rchie.log

    This will create a figure of the alignment in PDF format, in which the paired regions in the reference secondary structure are represented by arcs colored according to the covariation score and nucleotides in the sequences are colored according to their base-pairing status (Figure 3).
    Notes:
    1. The above command runs with a reference structure in dot-bracket (“.b”) format, which is indicated by the option “--format1=vienna”. For a structure in connect (“.ct”) or bpseq format “--format1=connect” or “--format1=bpseq” would be used, respectively.
    2. Option “--rule1=7” is used to group base-pairs on covariation scores and option “--group1=4” is used to set the number of groups to 4. Various other criteria can be chosen for grouping; run “rchie.R --help” for details.
    3. R-chie is highly customizable. Options “--legend1”, “--legend3”, “--msaspecies”, “--msagrid”, and “--msatext” can be invoked or omitted at will in order to turn on or off legends, sequence names, and other graphical settings. Colors for every element in the image can be specified by additional options such as “--colour1”, “--palette1”, or “--msacol”; run “rchie.R --help” for details.


    Figure 3. Structure- and covariation-annotated alignment drawn by R-chie

  5. (Optional) If necessary, manually edit the alignment produced by MAFFT to correct errors or adjust the alignment using a structure-aware editor such as RALEE (Griffiths-Jones, 2005), http://sgjlab.org/ralee/), 4SALE (Seibel et al., 2006); http://4sale.bioapps.biozentrum.uni-wuerzburg.de/), S2S (Jossinet and Westhof, 2005); http://bioinformatics.org/assemble/), ConStruct (Wilm et al., 2008b); http://www.biophys.uni-duesseldorf.de/construct3/), or SARSE (Andersen et al., 2007). Then, regenerate the covariation-annotated figure using R-chie.

Data analysis

The protocol described here has been used to examine the covariation within metastable regions in a structural alignment of 107 mRNAs of type I toxin-antitoxin systems from bacteria of the genus Helicobacter (Masachis et al., 2019, Figure 8, figure supplement 1 and 2 therein). All files including the unaligned sequences (in FASTA format), the unedited alignment produced by MAFFT (run with the “--scarnapair” option; in FASTA format), the manually-edited alignment (in FASTA format), the reference secondary structure (in dot-bracket/Vienna format), and the covariation-annotated alignment drawn by R-chie (in PDF format) for both datasets (metastable 1 and 2) analyzed in Masachis et al. (2019) are included as supplementary material, to allow the user to learn and reproduce the analyses.

Notes

In this protocol, we employ MAFFT to compute structural alignments, but the procedure may be performed similarly using any other tools that produce structure-aware sequence alignments.

Acknowledgments

This protocol was derived from our original study of metastable structures in mRNAs of type I toxin-antitoxin systems in the bacterium Helicobacter pylori (Masachis et al., 2019).

Competing interests

The authors declare that there are no conflicts of interest or competing interests.

References

  1. Andersen, E. S., Lind-Thomsen, A., Knudsen, B., Kristensen, S. E., Havgaard, J. H., Torarinsson, E., Larsen, N., Zwieb, C., Sestoft, P., Kjems, J. and Gorodkin, J. (2007). Semiautomated improvement of RNA alignments. RNA 13(11): 1850-1859.
  2. Bauer, M., Klau, G. W. and Reinert, K. (2007). Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization. BMC Bioinformatics 8: 271.
  3. Bernhart, S. H., Hofacker, I. L., Will, S., Gruber, A. R. and Stadler, P. F. (2008). RNAalifold: Improved consensus structure prediction for RNA alignments. BMC Bioinformatics 9: 474.
  4. Cannone, J. J., Subramanian, S., Schnare, M. N., Collett, J. R., D’Souza, L. M., Du, Y., Feng, B., Lin, N., Madabusi, L. V., Müller, K. M., Pande, N., Shang, Z., Yu, N. and Gutell, R. R. (2002). The Comparative RNA Web (CRW) Site: An online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 3: 2.
  5. Do, C. B., Mahabhashyam, M. S. P., Brudno, M. and Batzoglou, S. (2005). ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 15(2): 330-340.
  6. Dutheil, J. Y. (2012). Detecting coevolving positions in a molecule: Why and how to account for phylogeny. Brief Bioinform 13(2): 228-243.
  7. Griffiths-Jones, S. (2005). RALEE - RNA alignment editor in Emacs. Bioinformatics 21(2): 257-259.
  8. Havgaard, J. H., Lyngsø, R. B., Stormo, G. D. and Gorodkin, J. (2005). Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics 21(9): 1815-1824.
  9. Janssen, S. and Giegerich, R. (2015). The RNA shapes studio. Bioinformatics 31(3): 423-425.
  10. Jossinet, F. and Westhof, E. (2005). Sequence to Structure (S2S): Display, manipulate and interconnect RNA data from sequence to structure. Bioinformatics 21(15): 3320-3321.
  11. Katoh, K. and Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol Biol Evol 30(4): 772-780.
  12. Katoh, K. and Toh, H. (2008). Improved accuracy of multiple ncRNA alignment by incorporating structural information into a MAFFT-based framework. BMC Bioinformatics 9: 212.
  13. Lai, D., Proctor, J. R., Zhu, J. Y. A. and Meyer, I. M. (2012). R-CHIE: A web server and R package for visualizing RNA secondary structures. Nucleic Acids Res 40(12): e95.
  14. Lorenz, R., Bernhart, S. H., Höner zu Siederdissen, C., Tafer, H., Flamm, C., Stadler, P. F. and Hofacker, I. L. (2011). ViennaRNA Package 2.0. Algorithms Mol Biol 6: 26.
  15. Masachis, S., Tourasse, N. J., Lays, C., Faucher, M., Chabas, S., Iost, I. and Darfeuille, F. (2019). A genetic selection reveals functional metastable structures embedded in a toxin-encoding mRNA. Elife 8: e47549.
  16. Michel, F. and Westhof, E. (1990). Modelling of the three-dimensional architecture of group I catalytic introns based on comparative sequence analysis. J Mol Biol 216(3): 585-610.
  17. Nawrocki, E. P. and Eddy, S. R. (2013). Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29(22): 2933-2935.
  18. Nowick, K., Walter Costa, M. B., Höner zu Siederdissen, C. and Stadler, P. F. (2019). Selection pressures on RNA sequences and structures. Evol Bioinforma 15: 117693431987191.
  19. R Development Core Team. (2018). R: A language and environment for statistical computing. Vienna, Austria.
  20. Reuter, J. S. and Mathews, D. H. (2010). RNAstructure: Software for RNA secondary structure prediction and analysis. BMC Bioinformatics 11: 129.
  21. Rivas, E., Clements, J. and Eddy, S. R. (2017). A statistical test for conserved RNA structure shows lack of evidence for structure in lncRNAs. Nat Methods 14(1): 45-48.
  22. Seibel, P. N., Müller, T., Dandekar, T., Schultz, J. and Wolf, M. (2006). 4SALE - A tool for synchronous RNA sequence and secondary structure alignment and editing. BMC Bioinformatics 7: 498.
  23. Sundfeld, D., Havgaard, J. H., De Melo, A. C. M. A. and Gorodkin, J. (2016). Foldalign 2.5: Multithreaded implementation for pairwise structural RNA alignment. Bioinformatics 32(8): 1238-1240.
  24. Tabei, Y., Kiryu, H., Kin, T. and Asai, K. (2008). A fast structural multiple alignment method for long RNA sequences. BMC Bioinformatics 9: 33.
  25. Tan, Z., Fu, Y., Sharma, G. and Mathews, D. H. (2017). TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs. Nucleic Acids Res 45(20): 11570-11581.
  26. Voß, B. (2006). Structural analysis of aligned RNAs. Nucleic Acids Res 34(19): 5471-5481.
  27. Will, S., Reiche, K., Hofacker, I. L., Stadler, P. F. and Backofen, R. (2007). Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol 3(4): 680-691.
  28. Wilm, A., Higgins, D. G. and Notredame, C. (2008a). R-Coffee: A method for multiple alignment of non-coding RNA. Nucleic Acids Res 36(9): e52.
  29. Wilm, A., Linnenbrink, K. and Steger, G. (2008b). ConStruct: Improved construction of RNA consensus structures. BMC Bioinformatics 9: 219.
  30. Yao, Z., Weinberg, Z. and Ruzzo, W. L. (2006). CMfinder - A covariance model based RNA motif finding algorithm. Bioinformatics 22(4): 445-452.
  31. Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res 31(13): 3406-3415.

简介

[摘要 ] RNA分子具有确定的结构构象,这对于发挥其功能至关重要。在进化过程中,给定RNA的结构可以通过配对区域中各核苷酸之间发生的补偿性碱基对变化来维持。因此,为了进行RNA分子的比较,结构和进化研究,已经开发了许多计算工具来将结构信息整合到序列比对中,并且使用了许多工具 已经被开发来研究协方差。此处介绍的生物信息学协议说明了如何使用这些工具中的某些工具来生成可感知二级结构的RNA序列多重比对并注释比对,以检查序列之间结构元素的保守性和共变异性。


[背景 ] 生物的RNA分子折叠成特定二级(2D)和叔(3D),其是用于THEI关键结构r功能。因此,对于通常需要序列比对的比较分析,期望考虑结构信息以获得更可靠和有意义的比对。已经开发了许多计算算法和工具来基于二级结构生成比对,例如MAFFT (Katoh和Toh ,2008),TurboFold (Tan 等,2017),R-Coffee (Wilm 等,2008),locARNA (Will 等人,2007),ProbCons (Do 等人,2005),MXSCARNA (Tabei 等人,2008)和LaRA (Bauer 等人,2007)。在这些领先工具的基准比较中(Tan 等人,2017),TurboFold 和MAFFT被证明具有可比性和最高的准确性。TurboFold 的运行时间比MAFFT 的运行时间长得多(Katoh and Toh ,2008; Tan et al。,2017),因此在此协议中我们使用MAFFT是因为它的速度可以通过并行处理进一步提高(Katoh和Standley ,2013),允许在有限的时间内比对大量(> 100)序列。大号IKE几个其他工具MAFFT采用迭代策略,其中成对结构比对首先计算,并且然后逐步通过几轮细化组合成多重比对。

由于紧密的结构-功能关系,功能性RNA经受选择压力以维持其结构(Nowick 等,2019)。Ť 他是由共变中,可以在序列比对中观察到配对的核苷酸一致的或补偿性突变的发生反射。因此,协变数据非常有价值,并已用于验证或预测RNA的二级甚至三级结构,并了解其演变(Michel和Westhof ,1990; Cannone 等,2002)。有许多软件工具可用于检查比对内的协变,例如结构比对编辑器RALEE (Griffiths-Jones ,2005),4SALE(Seibel et al。,2006),S2S(Jossinet and Westhof ,2005),ConStruct (Wilm)等人,2008b)或SARSE (Andersen 等人,2007),R-chie (Lai 等人,2012 ),一种对协变量进行评分和注释的工具,以及复杂的程序,其中包括对协变量进行统计分析的方法带有或不带有诸如R-scape (Rivas 等人,2017)和CoMap (Dutheil ,2012)的系统发育框架。R- chie 突出显示碱基对并使用弧形图表示比对旁边的二级结构,并且可以生成高度可定制的图形。

在以下协议中,我们说明如何使用MAFFT计算多个RNA序列的结构比对,以及如何使用R- chie 用保守性和协变信息注释比对。

关键字:RNA , 序列, 结构, 比对, 协变, 对比分析

设备


 


个人计算机,最好具有多个处理器(CPU)以加快计算速度
首选Unix / Linux操作系统。除可选的LaRA 程序外,此处提到的所有软件都可以在Mac 和Windows系统上运行。对于Windows,需要诸如Cygwin(http://www.cygwin.com/)或Ubuntu(https://www.microsoft.com/store/p/ubuntu/9nblggh4msv6)的终端或Linux模拟器。无论如何,都需要熟悉命令行驱动的应用程序的使用。


 


软件


 


MAFF T(Katoh and Toh ,2008 ,https : //mafft.cbrc.jp/alignment/software/source.html )
R(R Development Core Team ,2018年,http://www.R-project.org/)
R-chie(Lai 等,2012 ,https://www.e-rna.org/r-chie/)
(ø ptional)MXSCARNA(塔北等人,2008 ,https://www.ncrna.org/softwares/mxscarna/)
(ø ptional)拉拉(鲍尔等人。,2007 ,http://www.mi.fu-berlin.de/w/LiSA/Lara)
(ø ptional)FOLDALIGN(Sundfeld 等人,2016 ,http://rth.dk/resources/foldalign)
笔记:


MAFFT包以不同的形式提供。确保下载提供对RNA结构比对支持的捆绑软件,例如Unix / Linux扩展软件包,Mac的Standard软件包以及Windows的Ubuntu或Cygwin版本。
它不是NE cessary安装MXSCARNA单独作为它被包括在MAFFT封装内。
仅当您想将它们用作MXSCARNA的替代品时,才安装LaRA和/或FOLDALIGN 。LaRA 仅在Linux上运行。可能需要使用LaRA 1.3版,因为更高版本1.31和1.32经常中止。
 






程序


 


使用本地序列或从数据库下载序列,准备一组相关(同源)RNA序列进行分析。可以通过关键字搜索从通用数据库(如NCBI GenBank / RefSeq (https://www.ncbi.nlm.nih.gov/),ENA(https://www.ebi.ac.uk/ena)中检索序列。)或Ensembl (http:// www.ensembl.org/),或来自专门的RNA数据库,例如SILVA(https://www.arb-silva.de / ),ncRNA 数据库(https://ncrnadatabases.org / ),RNAcentral (https://rnacentral.org/),比较RNA网站(http://www.rna.icmb.utexas.edu/),或RFAM (https://rfam.org/ ; 例如,有关RNA序列数据库的非详尽列表,请参见Nucleic Acids Research网站;http://www.oxfordjournals.org/nar/database/cat/2)。一些RNA数据库(例如SILVA,Rfam ,比较RNA网站)提供的序列已经以结构识别的多重比对的形式进行比对。感兴趣的RNA序列的同源物也可以通过序列相似性搜索来鉴定,例如使用众所周知的BLASTN工具。所有通用数据库和某些RNA数据库都提供BLAST服务。还可以使用INFERNAL (Nawrocki和Eddy ,2013)或CMfinder (Yao 等人,2006)在协方差模型的帮助下在结构水平上搜索相关的RNA序列。
  必须将所有收集的序列以常用的FASTA格式(https://en.wikipedia.org/wiki/FASTA_format;图1 )放在一个文件中。


注意:数据集的组成可能会影响协变分析,具体取决于序列之间的相似性或相异程度及其系统发育关系。


 


D:\陈丹工作\ 1902938--1255 Nicolas Tourasse 819848 \ Figs jpg \ figure1.jpg


图1. FAS TA格式的未比对RNA序列示例(在Emacs中可视化)


 


通过在“ X-INS-I”模式下运行MAFFT计算结构多重比对:
MAFFT-新寺村- scarnapair - NUC --reorder - maxiterate max_number_of_iterations --thread number_of_CPUs_to_use sequence_file.fasta 1> mafft_alignment.fasta 2> mafft_alignment_details.log


 


生成的比对将采用FASTA格式(图2)。


笔记:


选项“ -Scarnapair ” 指示MAFFT使用MXSCARNA执行成对结构比对,这是默认选项。要改为使用LaRA对齐器,请调用选项对“ --larapair --laraparams parameter_file ” ,其中“ parameter_file ” 是具有LaRA配置参数的文件。LaRA软件随附模板文件“ lara.params ” 。要使用FOLDALIGN对齐,调用选项“ - foldalignlocalpair ”或“ - foldalignglobalpair ”分别进行局部或全局的两两比对。在基准比较中(Katoh和Toh ,2008年),MXSCARNA的准确性通常高于LaRA的准确性,除非序列之间的同一性较低(< 40%),在这种情况下,可能首选LaRA。FOLDALIGN是另一种结构比对程序,具有很高的准确性,可以对相似性较低的序列进行结构比对(Havgaard等,2005 ;Sundfeld等,2016 )。
为了获得高的精度比对,使用大量的迭代细化,例如,一组“ max_number_of_iterations ” 至1 ,000的“ - maxiterate ”选项” 。
对于“ --thread” 选项,增加“ number_of_CPUs_to_use ”以加快对齐计算。运行时间随序列的数量和长度而增加。例如,对于一组100个长度为50-200 nt的序列,在单个CPU上进行计算可能需要3-10分钟,而在使用至少8个CPU时需要不到一分钟。
在某些环境中,MAFFT可能会中止,并出现以下错误:“ mafft-xinsi :第2369行:/ dev / stderr :不是目录”。可以通过在mafft-xinsi bash脚本中将“ / dev / stderr”替换为“ / dev / null”来解决该错误。
 


 


D:\陈丹工作\ 1902938--1255 Nicolas Tourasse 819848 \ Figs jpg \ figure2.jpg


图2. 在4SALE中可视化由MAFFT产生的结构对齐


 


(ø ptional)预测的参考二级结构:
在ORD ER揭示共变,需要参考二级结构。它可以是一个的结构的吨包括在分析他RNA序列,共有结构从对准,或外部结构模型推断。在缺乏已知或实验确定的模型的情况下,需要预测结构。可以使用广泛使用的工具来预测单个RNA序列的2D结构,例如MFOLD(Zuker ,2003),http : //unafold.rna.albany.edu/?q=mfold,来自ViennaRNA 软件包的RNAfold (洛伦茨。等人,2011) ,https://www.tbi.univie.ac.at/RNA/),折叠或MaxExpect 从RNAStructure 包(路透和马修斯,2010) ,HTTPS://rna.urmc.rochester .edu / RNAstructure.html),或来自RNA Shapes 工作室的RNAshapes (Janssen和Giegerich ,2015年;https://bibiserv.cebitec.uni-bielefeld.de/rnashapesstudio)。所有这些工具都可以在专用Web服务器上在线运行,也可以作为命令行程序本地安装。例如,以默认参数在FASTA格式的n RNA序列上运行MFOLD或RNAfold的标准命令为:


 


mfold SEQ = sequence_file.fasta (结构将在名为“ sequence_file * 。ct ”的文件中输出)


RNAfold < sequence_file.fasta > structure_file.b


 


上述包装具有许多选项来调节折叠计算(例如,通过改变算法,温度,离子条件),并且它们中的大多数提供了对结构施加约束的可能性。


  也存在几个块的软件用于从比对,例如预测共识结构RNAalifold (伯恩哈特等人,2008)从ViennaRNA 包,图形工具构建体(肾母细胞等,2008B); http://www.biophys.uni-duesseldorf.de/construct3/),以及来自RNA Shapes 工作室的RNAalishapes (Voß ,2006年);https://bibiserv.cebitec.uni-bielefeld.de/rnaalishapes)。值得注意的是,RNAalifold 别ES不加权序列和正在研究中,以序列的特定样品高度敏感和预测可以通过含有插入的序列的序列之间,以及相似的变化的量的影响。ConStruct 结合了自定义序列权重以实现最佳的共识预测,而RNAalishapes 基于抽象结构形状的概念,并且还包括缺口感知能量评估,当序列包含插入序列和/或比对质量较差时,它的性能优于RNAalifold (Voß ,2006)。根据我们的经验,RNAalishapes的表现很好。RNAalishapes 依赖于较旧的Linux库,因此可能无法安装在当前的体系结构上,因此我们可能必须使用在线版本。


 


分析比对中的协变,并使用R- chie 生成带协变注释的比对的图像:
Ç ovariation由参考二级结构映射到所述多个比对揭示小号。可以采用结构预测和分析软件使用的一种常见格式将结构提供给R- chie ,包括点括号(或Vienna;“ .b”),连接表(“ .ct ”)。bpseq (来自比较RNA网站[ Cannone 等,2002 ] ,http://www.rna.icmb.utexas.edu/ )。对齐方式必须为FASTA格式。


 


rchie.R - msafile = mafft_alignment.fasta --pdf --output = alignment_covariation_figure.pdf --format1 = 维也纳--rule1 = 7 --group1 = 4 --legend1 --legend3 - msaspecies - msagrid - msatext reference_structure.b &> rchie.log


 


这将创建的PDF格式的对准,其中,在所述基准的二级结构成对的区域由根据共变着色弧表示得分的数字序列中的第二个核苷酸是根据其基着色- 配对状态(图3 )。


笔记:


在点托架参考结构上面的命令运行(“.B”)格式,whic ħ 由指示选项“--format1 = 维也纳”。用于连接的结构(“ CT ”)或bpseq 格式“--format1 =连接”或“--format1 = bpseq ”瓦特乌尔德分别使用。
选项“ --rule1 = 7”用于对协方差得分的碱基对进行分组,选项“ --group1 = 4”用于将组数设置为4。可以选择其他各种标准进行分组;运行“ rchie.R --help”了解详细信息。
R-chie是高度可定制的。可以随意调用或省略选项“ --legend1”,“-legend3”,“- msaspecies ”,“- msagrid ”和“ -msatext ”,以便打开或关闭图例,序列名称,和其他图形设置。可以通过其他选项指定图像中每个元素的颜色,例如“ --colour1 ” ,“ --palette1 ” 或“ --msacol ” ;运行“ rchie.R --help”了解详细信息。
 


D:\陈丹工作\ 1902938--1255 Nicolas Tourasse 819848 \ Figs jpg \ figure3.jpg


图3. R- chie 绘制的带结构和协方差注释的比对


 


(ø ptional)如果需要,手动编辑由MAFFT产生的取向纠正错误或使用结构的编辑器如RALEE(调整对齐格里菲思-琼斯,2005) ,http://sgjlab.org/ralee/), 4SALE(Seibel et al。,2006); http://4sale.bioapps.biozentrum.uni-wuerzburg.de/),S2S(Jossinet和Westhof ,2005年); http://bioinformatics.org/assemble/),构建体(肾母细胞等,2008B); http://www.biophys.uni-duesseldorf.de/construct3/)或SARSE (Andersen 等,2007)。然后,使用R- chie 重新生成带协变注释的图形。
 


数据分析


 


本文描述的方案已用于检查来自幽门螺杆菌属细菌的I型毒素-抗毒素系统的107个mRNA的结构比对中亚稳态区域内的协变(Masachis 等,2019 ,图8,图补充1和2在其中)。所有文件,包括未对齐的序列(采用FASTA格式),由MAFFT生成的未经编辑的对齐(使用“ -Scarnapair ”选项运行;以FASTA格式运行),手动编辑的对齐(以FASTA格式),参考二级结构(Masachis 等人分析了两个数据集(分别为1和2)的R- chie (以PDF格式)绘制的,带有点括号/维也纳格式的协方差注释的对齐方式(PDF格式)。(2019)作为补充材料包括在内,以允许用户学习和重现分析。


 


笔记


 


在这个协议中,我们采用MAFFT到计算结构比对,但也可以类似地使用任何其他工具来执行的程序小号产生结构感知序列比对。


 


 


致谢


 


该方案源自我们对幽门螺杆菌(Helicobacter pylori)中I型毒素-抗毒素系统的mRNA 中亚稳结构的原始研究(Masachis 等,2019)。


 


利益争夺


 


作者宣称没有利益冲突或利益冲突。


 


参考文献


 


1. 安德森(E. Andersen),E 。小号。,林德-汤姆森,甲。,克努森,B 。,Kristensen ,S 。Ë 。,Havgaard ,J . ^ h 。,Torarinsson ,E 。拉森,Ñ 。,Zwieb ,C 。,Sestoft ,P 。,Kjems ,J . 和Gorodkin,J 。(2007 )。半自动改善RNA比对。RNA 13 (11):1850 - 1859年。      


2. Bauer ,M 。,克劳,ģ 。w ^ 。和Reinert ,K. (2007 )。使用组合优化对RNA序列进行精确的多序列结构比对。BMC生物信息学8 :271。      


3. S. Bernhart ,S 。^ h 。,霍法克,我。大号。,威尔,S 。,格鲁伯,A 。[R 。和Stadler ,P 。F. (2008年)。RNAalifold:改进的RNA比对共识结构预测。BMC生物信息学9 :474。      


4. Cannone ,J 。Ĵ 。,Subramanian ,S 。,Schnare ,M . ñ 。,Collett ,J . [R 。,D'Souza ,L 。中号。杜,ÿ 。冯,乙。林,Ñ 。,Madabusi ,L 。V.,Müller ,K 。中号。,潘德,N 。尚,ž 。,玉,ñ 。和Gutell ,R 。R. (2002 )。比较RNA网站(CRW):有关核糖体,内含子和其他RNA的比较序列和结构信息的在线数据库。BMC生物信息学3 :2。      


5. 做,C 。乙。,Mahabhashyam ,M 。小号。P 。,Brudno ,M . 和Batzoglou ,S. (2005 )。ProbCons:基于概率一致性的多序列比对。基因组的RE 15 (2):330 - 340。      


6. Dutheil ,J 。黄(2012 )。检测分子中共同进化的位置:为什么以及如何解释系统发育。介绍Bioinform 13 (2):228 - 243。      


7. 格里菲思-琼斯,S. (2005 )。RALEE-Emacs中的RNA比对编辑器。生物信息学21 (2):257 - 259。      


8. Havgaard ,J 。^ h 。,Lyngsø ,R 。乙。,斯托莫,G 。d 。和Gorodkin ,J. (2005 )。序列相似性小于40%的RNA序列的成对局部结构比对。生物信息学21 (9):1815年- 1824      


9. Janssen ,S 。和Giegerich ,R. (2015 )。RNA形状工作室。生物信息学31 (3):423 - 425。      


10. Jossinet ,F 。和Westhof ,E. (2005 )。序列到结构(S2S):显示,操纵和互连从序列到结构的RNA数据。生物信息学21 (15):3320 - 3321。   


11. 加藤市,K 。和斯坦利,D 。M. (2013年)。MAFFT多序列比对软件版本7:性能和可用性上的改进。分子生物学EVOL 30 (4):772 - 780。   


12. 加藤市,K 。和Toh ,H. (2008年)。通过将结构信息整合到基于MAFFT的框架中,提高了多个ncRNA比对的准确性。BMC生物信息学9 :212。   


13. 赖,D 。,Proctor ,J 。[R 。朱,Ĵ 。ÿ 。一。和迈耶,我。M. (2012年)。R-CHIE:Web服务器和R包,用于可视化RNA二级结构。Nucleic Acids Res 40 (12):e95。   


14. Lorenz ,R 。,Bernhart ,S . ^ h 。,Hönerzu Siederdissen ,C . ,塔弗,H 。,Flamm ,C 。,Stadler ,P 。˚F 。和霍法克,我。L. (2011年)。ViennaRNA软件包2.0。算法分子生物学杂志6 :26。   


15. Masachis ,S 。,Tourasse ,Ñ 。Ĵ 。,乐事,Ç 。,Faucher ,M . ,Chabas ,S 。,IOST ,我。和Darfeuille ,F. (2019 )。遗传选择揭示了嵌入毒素编码mRNA中的功能性亚稳态结构。Elife 8 :e47549。   


16. 米歇尔,F。和Westhof ,E. (1990 )。基于比较序列分析的I组催化内含子的三维结构建模。分子生物学杂志216 (3):585 - 610。   


17. Nawrocki ,ē 。P 。和艾迪(S. Eddy )。R. (2013年)。地狱1.1:RNA同源性搜索速度提高100倍。生物信息学29 (22):2933 - 件2935。   


18. Nowick ,ķ 。沃尔特·科斯塔,中号。乙。,Hönerzu Siederdissen ,C . 和Stadler ,P 。F. (2019年)。    RNA序列和结构的选择压力。Evol Bioinforma 15 :117693431987191。


19. R开发核心团队。(2018 )。R:用于统计计算的语言和环境。奥地利维也纳。   


20. 路透社,J 。小号。和Mathews ,D 。高(2010 )。RNAstructure:用于RNA二级结构预测和分析的软件。BMC生物信息学11 :129。   


21. 里瓦斯,E 。,Clements ,J . 和艾迪(S. Eddy )。R. (2017年)。保守RNA结构的统计测试表明,缺乏有关lncRNAs结构的证据。纳特方法14 (1):45 - 48。   


22. 塞贝尔,P 。ñ 。,米勒,T 。,Tandekar ,T 。,舒尔茨,J 。和Wolf ,M。(2006 )。4SALE-同步RNA序列以及二级结构比对和编辑的工具。BMC生物信息学7 :498。   


23. 桑德菲尔德,D 。,Havgaard ,J . ^ h 。,De Melo ,A 。Ç 。中号。一。和Gorodkin ,J. (2016 )。Foldalign 2.5:用于成对结构RNA对齐的多线程实现。生物信息学32 (8):1238 - 1240   


24. 塔贝,Y 。,Kiryu ,H 。,Kin ,T 。和Asai ,K. (2008 )。长RNA序列的快速结构多重比对方法。BMC生物信息学9 :33。   


25. Tan ,Z 。傅,ÿ 。,夏尔马,G 。和Mathews ,D 。H. (2017年)。TurboFold II:由多个同源物提供的RNA结构比对和二级结构预测。核酸研究45 (20):11570 - 11581。   


26. VOSS ,B. (2006年)。对齐的RNA的结构分析。核酸研究34 (19):5471 - 5481。   


27. 威尔,小号。,Reiche ,K 。,霍法克,我。大号。,Stadler ,P 。˚F 。和Backofen ,R. (2007 )。通过基于基因组规模的结构聚类来推断非编码RNA家族和类别。公共科学图书馆COMPUT生物学3 (4):680 - 691。   


28. Wilm ,A 。,希金斯,D 。摹。和Notredame ,C. (2008a )。R-Coffee:非编码RNA的多重比对方法。核酸研究36 (9):e52。   


29. Wilm ,A 。,Linnenbrink ,K 。和Steger ,G. (2008b )。构造:改善RNA共有结构的构建。BMC生物信息学9 :219。   


30. 姚,Z 。,温伯格,Z 。和Ruzzo ,w ^ 。L. (2006 )。CMfinder-基于协方差模型的RNA基序发现算法。生物信息学22 (4):445 - 452。   


31. 朱克,M. (2003 )。Mfold Web服务器,用于核酸折叠和杂交预测。核酸研究31 (13):3406 - 3415。
登录/注册账号可免费阅读全文
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright Tourasse and Darfeuille. This article is distributed under the terms of the Creative Commons Attribution License (CC BY 4.0).
引用: Readers should cite both the Bio-protocol article and the original research article where this protocol was used:
  1. Tourasse, N. J. and Darfeuille, F. (2020). Structural Alignment and Covariation Analysis of RNA Sequences. Bio-protocol 10(3): e3511. DOI: 10.21769/BioProtoc.3511.
  2. Masachis, S., Tourasse, N. J., Lays, C., Faucher, M., Chabas, S., Iost, I. and Darfeuille, F. (2019). A genetic selection reveals functional metastable structures embedded in a toxin-encoding mRNA. Elife 8: e47549.
提问与回复

(提问前,请先登录)bio-protocol作为媒介平台,会将您的问题转发给作者,并将作者的回复发送至您的邮箱(在bio-protocol注册时所用的邮箱)。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

当遇到任何问题时,强烈推荐您通过上传图片的形式提交相关数据。