Genome-wide Estimation of Evolutionary Distance and Phylogenetic Analysis of Homologous Genes
进化距离的全基因组评估和同源基因的系统进化分析   

评审
匿名评审
引用 收藏 提问与回复 分享您的反馈 Cited by

本文章节

Abstract

Homologous genes, including paralogs and orthologs, are genes that share sequence homologies within or between different species. Homologous genes originate from a common origin through speciation, genetic duplication or horizontal gene transfer. Estimation of the sequence divergence of homologous genes help us to understand divergence time, which makes it possible to understand the evolutionary patterns of speciation, gene duplication and gene transfer events. This protocol will provide a detailed bioinformatics pipeline on how to identify the homologous genes, compare their sequence divergence and phylogenetic relationships, focusing on homologous genes that show syntenic relationships using soybean (Glycine max) and common bean (Phaseolus vulgaris) as example species.

Keywords: Homologous genes (同源基因), Whole genome duplication (全基因组复制), Sequence alignment (序列比对), Evolutionary distance (进化距离), Phylogenetic analysis (系统进化分析)

Background

Gene duplication, including whole genome duplication or polyploidy, segmental duplication, and tandem duplication, is a very important process that increases gene copy number and thus enhances genetic diversity in many organisms. Because of this, gene duplication is thought to be a major force in evolution (Ohno, 1970; Otto and Whitton, 2000; Blanc and Wolfe, 2004; Jiao et al., 2011). After duplication, duplicated genes are subject to a variety of changes, such as accumulation of point mutations, insertions and deletions, gene conversion and transposon insertions (Ilic et al., 2003; Gu et al., 2005; Sémon and Wolfe, 2007). Theoretically, two or multiple copies of the duplicated genes have undergone different levels of selective constraint, which makes the functional divergence of the duplicated genes. This can be reflected on the sequence divergence of the homologous genes, such as non-synonymous substitution (Ka) and synonymous substitution (Ks), the latter of which the produced amino acid sequence is not modified. Because it is neutral with respect to selection, Ks can be used to determine rough divergence time. The ratio of Ka/Ks can be used to estimate the selection pressure on genes. A Ka/Ks ratio equal to one indicates a lack of selection, as is observed in pseudogenes. The ratio of Ka/Ks higher and lower than one implies positive and purifying selection, respectively. The values of Ka/Ks for the vast majority of genes are < 1.0 due to purifying selection to maintain function (Makalowski and Boguski, 1998; Nekrutenko et al., 2002). When comparing duplicate genes, differences in Ka/Ks suggest different levels or kinds of selection.

In order to determine the evolutionary distance, we first need to identify the homologous genes within or between different species. Here, we will mainly focus on the homologous genes that show syntenic relationships between different species. Syntenic genes are those genes that retain an ancestral position on a given region of a chromosome. We refer to these syntenic homologous genes as “syntelogs” (Zhao et al., 2017). The advantage of focusing on syntelogs is that if they are the result of polyploidy, all syntelogs arose at the same time, so groups of syntenic gene pairs can be compared with high confidence. Here, we describe a detailed pipeline for the identification and comparison of the evolutionary distance of syntelogs by using soybean (Glycine max) and common bean (Phaseolus vulgaris) as example species (Figure 1). It has been proposed that soybean has experienced a recent whole genome duplication event roughly 5 to 13 million years ago (MYA, Schmutz et al., 2010), occurring after the split with its close relative common bean roughly 19 MYA (Lavin et al., 2005; McClean et al., 2010). In this protocol, we will estimate the evolutionary divergence of the duplicated gene pairs in soybean by comparing them with the orthologous genes in the common bean genome.

Equipment

  1. Linux/Unix cluster
    In this study, we use the Purdue Halstead supercomputer, which contains 508 nodes in total. Each node contains 20 cores, two 10-Core Intel Xeon-E5 processors, and 128 GB memory. Please refer to the website for more information: https://www.rcac.purdue.edu/compute/halstead
  2. Personal computer for post data processing (Lenovo, T430s, Intel Core i5-3320M CPU, 4 GB RAM)

Software

  1. Blastall or Blast+ (Altschul et al., 1997)
    https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download
    This program is developed and distributed by NCBI to do blast searches.
  2. MCScanX (Wang et al., 2012), http://chibba.pgml.uga.edu/mcscan2/
    This toolkit is for detection and evolutionary analysis of gene synteny and collinearity. This program also can generate tandemly duplicated genes. 
  3. Multiple Sequence Comparison by Log-Expectation (MUSCLE) (Edgar, 2004), https://www.drive5.com/muscle/
  4. ClustalW (Thompson et al., 1994), http://www.clustal.org/clustal2/
    Both Muscle and ClustalW are software for multiple sequence alignment for nucleotide and protein sequences.
  5. Phylogenetic Analysis by Maximum Likelihood (PAML) (Yang et al., 2007), http://abacus.gene.ucl.ac.uk/software/paml.html
    PAML is a package of programs for phylogenetic analyses of DNA or protein sequences using maximum likelihood.
  6. Molecular Evolutionary Genetics Analysis (MEGA X) (Kumar et al., 2018), https://www.megasoftware.net/
    This software contains many sophisticated methods and tools for phylogenetic analysis, including constructing phylogenetic trees as well as evolutionary distance estimation.
  7. Perl, https://www.perl.org/, or Python, https://www.python.org/, programming languages
    These languages make it possible to post-process the primary data generated from some of the software used above.
  8. SAS software, https://www.sas.com/en_us/home.html
    SAS is a software for statistical analysis.

Procedure

  1. Identification of Syntenic Genes among Closely Related Species
    1. Retrieve gene sequence
      For given plant genomes of which the genome annotation is available, download the coding sequences (CDS) and protein sequences of all the protein-encoding genes from corresponding databases. Some species have their respective websites, such as Arabidopsis TAIR (https://www.arabidopsis.org/), soybean SoyBase (https://soybase.org/), and maize MaizeGDB (https://www.maizegdb.org/), etc. The sequences of many of other species were deposited at NCBI (https://www.ncbi.nlm.nih.gov/), CoGe (https://genomevolution.org/coge/), Phytozome (https://phytozome.jgi.doe.gov/pz/portal.html) or other relevant databases. We downloaded all the genome sequences and gene annotation of soybean (v1.1, Schmutz et al., 2010) and common bean (Schmutz et al., 2014) from Phytozome.
    2. Removal of transposon-related and hypothetical proteins
      In order to detect highly confident syntelogs among closely related species, transposon-related and hypothetical proteins were first removed using BLAST (Altschul et al., 1997). Taking soybean (v1.1) as an example, in total 53,927 annotated genes from the 20 soybean chromosomes were BLASTN queried against the soybean transposon database SoyTEdb (Du et al., 2010, https://www.soybase.org/soytedb/). Any genes over the 80% of the total length matching the transposon-related sequences with the sequence similarities of greater than 80% were removed. Genes annotated as hypothetical proteins were excluded as well (Figure 1). Here is a typical setting for doing local blast.

      formatdb -i SoyBase_TE_Fasta.txt -p F -o T
      blastall -p blastn -i Soybean_gene_cds.fa -d SoyBase_TE_Fasta.txt -m 8 -a 8 -o Soybean_gene_cds_blast_TEs

      Note: Hypothetical proteins were identified based on the gene annotation file. The genes annotated as hypothetical proteins were not included in the analysis. Here formatdb is to format the nucleotide source database “SoyBase_TE_Fasta.txt” before it can be searched using blastall. blastall is used to compare the gene sequences in the file “Soybean_gene_cds.fa” with the database “SoyBase_TE_Fasta.txt”. A detailed description of each parameter can be found in the software manual.


      Figure 1. Bioinformatics pipeline for identification and comparison of the syntenic homologous genes in soybean and common bean. The solid and the dotted circles indicate the presence and absence of the genes in the corresponding genomes, respectively. GmA and GmB represent either of the duplicated genes in soybean. Gm, Glycine max, soybean; Pv, Phaseolus vulgaris, common bean. This figure was modified from Zhao et al., 2017.

    3. Detection of candidate syntelogs
      The remaining protein-encoding genes in soybean and common bean were used to do an all-against-all BLASTP search using default parameters, with the E-value cutoff 10-10 (Altschul et al., 1997). For each pair of genes, BLAST hits were loaded to the software MCScanX (Wang et al., 2012) to scan the syntelog homologous gene pairs.

      blastall -p blastp -i Soybean_gene_pep.fa -d Commonbean_gene_pep.fa -m 8 -a 8 -F F -o Soybean _blast_Commonbean
      MCScanX Soybean_CommonBean

      Note: The setting of E-value is based on the rough divergence time. You can leave it as default setting if the divergence time is unknown. Please prepare the required gff file for all gene locations to MCScanX. Syntenic gene pairs between soybean and common bean were identified with MCScanX’s default settings (Match Score = 50, Match Size = 5, Gap Penalty = -1, Overlap Window = 5, Max Gaps = 25 and an E-value cutoff 10-10).
    4. Post process the candidate syntelogous genes
      Since soybean has undergone a whole genome duplication after its split with common bean, the genes in the common bean genomes were corresponding to one or two copies in the soybean genome, depending on the duplication status of the genes retained in soybean. MCScanX provided all the homologous gene pairs in addition to syntelogous gene pairs. In order to remove the false positive syntelogous genes, the duplicated block information from the reference genome was incorporated (Schmutz et al., 2014) to keep the homologous gene pairs detected in the duplicated regions and showing syntenic relationship between soybean and common bean.
        Example of the final list of candidate syntelogous genes was shown in Table 1. Genes involved in tandem duplication have an ambiguous retention status and thus were put aside for separate analysis.

      Table 1. Example of syntenic homologous genes identified in the common bean and soybean genomes


  2. Estimation of Evolutionary Distance for Syntelogous Genes
    1. Sequence alignment
      Although many genes have several alternative transcripts, only the primary transcripts of the genes based on the gene annotation were used to estimate the sequence divergence between different syntenic genes of soybean and common bean. The nucleotide sequences of the syntenic genes were aligned using the MUSCLE program (Edgar, 2004) or ClustalW (Thompson et al., 1994) using default parameters. The alignment can be viewed by Jalview (Figure 2A).

      muscle -in input -out output or clustalw input

      Note: The primary transcripts of the genes were determined based on the gene annotation file which shows the primary transcripts of the genes. MUSCLE or ClustalW can only run one group of syntelogs at a time. For whole genome level analysis, we recommend that the authors write a Perl or Python script to automatically load each pair of sequences to MUSCLE or ClustalW to do the alignment. At this step, we used MUSCLE to run the alignment first, and then performed ClustalW for the remained gene pairs of which the nucleotide alignments were not integer multiples of three after MUSCLE alignment. 
    2. Manual inspection
      The output alignment was manually inspected to modify incorrectly aligned nucleotides. This step is very important although it may not be practical if there is a very large amount of data to verify (Figure 2B).


      Figure 2. An example of sequence alignments of syntenic homologous genes in soybean and common bean. A. Original alignment generated by MUSCLE (Edgar, 2004). B. Manually modified alignment. Blue boxes indicate modified regions.

    3. Sequence divergence
      All pairwise alignments of the syntenic genes were prepared into the required format by the PAML software using Perl or Python programming in order to calculate non-synonymous (Ka) and synonymous (Ks) substitution using the yn00 and baseml modules with the default parameters except model was set to 1 instead of 0 (Yang, 2007). Please refer to the manual for more information about running programs in the PAML package.

  3. Phylogenetic analysis
    Phylogenetic trees are used to tell the phylogenetic relationship among homologous genes.
    1. Sequence Alignment
      The nucleotide sequences or protein sequences of the syntenic genes were aligned using the MUSCLE program (Edgar, 2004) or ClustalW (Thompson et al., 1994) using default parameters.

      muscle -in input -out output or clustalw input

    2. Phylogenetic Tree Construction
      The sequence alignments of the homologous genes were transferred to MEGA software to construct the phylogenetic trees using the neighbor-joining maximum composite likelihood model integrated for nucleotide sequences and Poisson correction for protein sequences with pairwise deletions (Kumar et al., 2018). Bootstrap values were calculated from 1,000 replicates.

Data analysis

Student’s t-test was performed to compare the evolutionary distance between duplicates and singletons using the SAS software. The Bonferroni correction was further performed to correct the P values. P < 0.05 was considered to be significant, and P < 0.0001 was considered to be significant under the Bonferroni correction. Experimental values are reported as mean ± standard deviation or in a box plot (Figure 3).


Figure 3. Comparison of the evolutionary distance between duplicates and singletons in soybean. Ka and Ks were calculated by pairwise comparison between soybean and common bean. The statistical analysis was conducted by Student’s t-test. **, P < 0.0001. Ka, non-synonymous substitution; Ks, synonymous substitution; ω, Ka/Ks. The bottom and top boundaries of the box are the first and third quartiles, and the bold lines within individual boxes are the medians, which are referred to as the second quartiles. The ends of the whiskers (the dotted lines) represent the minimum values and maximum values of the data.

Notes

  1. Some designated singleton genes, the homologs of which were not found in the syntenic region, may belong to duplicated pairs because of potential gene transposition. The homologs of the singletons may be transposed or translocated from the original syntenic regions to elsewhere in the genome, rather than being deleted. 
  2. Genes involved in tandem duplication always have an ambiguous retention status, and thus were separately analyzed.

Acknowledgments

This protocol was adapted from Zhao et al. (2017). This work was supported by soybean check-off funds from the United Soybean Board and Indiana Soybean Alliance and National Science Foundation Grant DBI-0822258 to J.M., by National Science Foundation Grant DBI-1237931 to D.L. and Purdue Startup Funds to D.L, and by Miami University Startup Funds to M.Z.

Competing interests

The authors declare that there are no conflicts of interest or competing interests.

References

  1. Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17): 3389-3402.
  2. Blanc, G. and Wolfe, K. H. (2004). Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16(7): 1667-1678.
  3. Du, J., Grant, D., Tian, Z., Nelson, R. T., Zhu, L., Shoemaker, R. C. and Ma, J. (2010). SoyTEdb: a comprehensive database of transposable elements in the soybean genome. BMC Genomics 11: 113.
  4. Edgar, R. C. (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: 113.
  5. Gu, X., Zhang, Z. and Huang, W. (2005). Rapid evolution of expression and regulatory divergences after yeast gene duplication. Proc Natl Acad Sci U S A 102(3): 707-712.
  6. Ilic, K., SanMiguel, P. J. and Bennetzen, J. L. (2003). A complex history of rearrangement in an orthologous region of the maize, sorghum, and rice genomes. Proc Natl Acad Sci U S A 100(21): 12265-12270.
  7. Jiao, Y., Wickett, N. J., Ayyampalayam, S., Chanderbali, A. S., Landherr, L., Ralph, P. E., Tomsho, L. P., Hu, Y., Liang, H., Soltis, P. S., Soltis, D. E., Clifton, S. W., Schlarbaum, S. E., Schuster, S. C., Ma, H., Leebens-Mack, J. and dePamphilis, C. W. (2011). Ancestral polyploidy in seed plants and angiosperms. Nature 473(7345): 97-100.
  8. Kumar, S., Stecher, G., Li, M., Knyaz, C. and Tamura, K. (2018). MEGA X: Molecular Evolutionary Genetics Analysis across computing platforms. Mol Biol Evol 35(6): 1547-1549.
  9. Lavin, M., Herendeen, P. S. and Wojciechowski, M. F. (2005). Evolutionary rates analysis of Leguminosae implicates a rapid diversification of lineages during the tertiary. Syst Biol 54(4): 575-594.
  10. Makalowski, W. and Boguski, M. S. (1998). Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc Natl Acad Sci U S A 95(16): 9407-9412.
  11. McClean, P. E., Mamidi, S., McConnell, M., Chikara, S. and Lee, R. (2010). Synteny mapping between common bean and soybean reveals extensive blocks of shared loci. BMC Genomics 11: 184.
  12. Nekrutenko, A., Makova, K. D. and Li, W. H. (2002). The KA/KS ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res 12(1): 198-202.
  13. Ohno, S. (1970). Evolution by gene duplication. Springer-Verlag, New York, p. 160.
  14. Otto, S. P. and Whitton, J. (2000). Polyploid incidence and evolution. Annu Rev Genet 34: 401-437.
  15. Schmutz, J., Cannon, S. B., Schlueter, J., Ma, J., Mitros, T., Nelson, W., Hyten, D. L., Song, Q., Thelen, J. J., Cheng, J., Xu, D., Hellsten, U., May, G. D., Yu, Y., Sakurai, T., Umezawa, T., Bhattacharyya, M. K., Sandhu, D., Valliyodan, B., Lindquist, E., Peto, M., Grant, D., Shu, S., Goodstein, D., Barry, K., Futrell-Griggs, M., Abernathy, B., Du, J., Tian, Z., Zhu, L., Gill, N., Joshi, T., Libault, M., Sethuraman, A., Zhang, X. C., Shinozaki, K., Nguyen, H. T., Wing, R. A., Cregan, P., Specht, J., Grimwood, J., Rokhsar, D., Stacey, G., Shoemaker, R. C. and Jackson, S. A. (2010). Genome sequence of the palaeopolyploid soybean. Nature 463(7278): 178-183. 
  16. Schmutz, J., McClean, P. E., Mamidi, S., Wu, G. A., Cannon, S. B., Grimwood, J., Jenkins, J., Shu, S., Song, Q., Chavarro, C., Torres-Torres, M., Geffroy, V., Moghaddam, S. M., Gao, D., Abernathy, B., Barry, K., Blair, M., Brick, M. A., Chovatia, M., Gepts, P., Goodstein, D. M., Gonzales, M., Hellsten, U., Hyten, D. L., Jia, G., Kelly, J. D., Kudrna, D., Lee, R., Richard, M. M., Miklas, P. N., Osorno, J. M., Rodrigues, J., Thareau, V., Urrea, C. A., Wang, M., Yu, Y., Zhang, M., Wing, R. A., Cregan, P. B., Rokhsar, D. S. and Jackson, S. A. (2014). A reference genome for common bean and genome-wide analysis of dual domestications. Nat Genet 46(7): 707-713.
  17. Sémon, M. and Wolfe, K. H. (2007). Consequences of genome duplication. Curr Opin Genet Dev 17(6): 505-512.
  18. Thompson, J. D., Higgins, D. G. and Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22(22): 4673-4680.
  19. Wang, Y., Tang, H., Debarry, J. D., Tan, X., Li, J., Wang, X., Lee, T. H., Jin, H., Marler, B., Guo, H., Kissinger, J. C. and Paterson, A. H. (2012). MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res 40(7): e49.
  20. Yang, Z. (2007). PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24(8): 1586-1591.
  21. Zhao, M., Zhang, B., Lisch, D. and Ma, J. (2017). Patterns and consequences of subgenome differentiation provide insights into the nature of paleopolyploidy in plants. Plant Cell 29(12): 2974-2994.

简介

同源基因,包括旁系同源物和直向同源物,是在不同物种内或之间共享序列同源性的基因。 同源基因通过物种形成,遗传复制或水平基因转移来源于共同起源。 估计同源基因的序列差异有助于我们理解分歧时间,这使得理解物种形成,基因复制和基因转移事件的进化模式成为可能。 该协议将提供详细的生物信息学管道,如何识别同源基因,比较它们的序列差异和系统发育关系,重点关注使用大豆( Glycine max >)和普通豆( Glycine max >)显示同线关系的同源基因。 em> Phaseolus vulgaris >)作为示例物种。

【背景】基因重复,包括全基因组重复或多倍体,节段重复和串联重复,是一个非常重要的过程,可以增加基因拷贝数,从而增强许多生物的遗传多样性。因此,基因复制被认为是进化中的主要力量(Ohno,1970; Otto和Whitton,2000; Blanc和Wolfe,2004; Jiao et al。>,2011)。重复后,重复的基因会发生各种变化,如点突变的积累,插入和缺失,基因转换和转座子插入(Ilic et al。>,2003; Gu et al 。>,2005;Sémon和Wolfe,2007)。理论上,重复基因的两个或多个拷贝经历了不同水平的选择性约束,这使得重复基因的功能分化。这可以反映在同源基因的序列差异上,例如非同义取代(Ka)和同义取代(Ks),后者的产生的氨基酸序列未被修饰。因为它在选择上是中性的,所以Ks可用于确定粗略的发散时间。 Ka / Ks的比率可用于估计基因的选择压力。 Ka / Ks比等于1表示缺乏选择,如在假基因中观察到的。 Ka / Ks的比率高于和低于1意味着分别表示正选择和纯化选择。由于纯化选择以保持功能,绝大多数基因的Ka / Ks值<1.0(Makalowski和Boguski,1998; Nekrutenko et al。>,2002)。当比较重复基因时,Ka / Ks的差异表明选择的水平或种类不同。

为了确定进化距离,我们首先需要识别不同物种内或之间的同源基因。在这里,我们将主要关注显示不同物种之间的同线关系的同源基因。同源基因是那些在染色体的给定区域保留祖先位置的基因。我们将这些同线同源基因称为“syntelogs”(Zhao et al。>,2017)。聚焦于syntelogs的优点是,如果它们是多倍体的结果,所有的syntelogs同时出现,因此可以高可信度地比较同线基因对的组。在这里,我们描述了一个详细的管道,用于通过使用大豆( Glycine max >)和普通豆( Phaseolus vulgaris >)作为示例物种来识别和比较星际动物的进化距离(图1)。有人提出,大豆在大约500万到1300万年前经历了最近的全基因组重复事件(MYA,Schmutz et al。>,2010),在与其近亲豆的分裂后发生19 MYA(Lavin et al。>,2005; McClean et al。>,2010)。在该协议中,我们将通过将它们与普通豆基因组中的直系同源基因进行比较来估计大豆中重复基因对的进化分歧。

关键字:同源基因, 全基因组复制, 序列比对, 进化距离, 系统进化分析

设备

  1. Linux / Unix集群
    在这项研究中,我们使用Purdue Halstead超级计算机,它总共包含508个节点。每个节点包含20个内核,两个10核Intel Xeon-E5处理器和128 GB内存。有关更多信息,请访问网站: https://www.rcac.purdue.edu/compute/霍尔斯特德功能。
  2. 用于后期数据处理的个人计算机(联想,T430s,英特尔酷睿i5-3320M CPU,4 GB RAM)

软件

  1. Blastall或Blast +(Altschul et al。>,1997), https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download
    该程序由NCBI开发和分发,用于进行爆炸搜索。
  2. MCScanX(Wang et al。>,2012), http://chibba.pgml .uga.edu / mcscan2 /
    该工具包用于基因同线性和共线性的检测和进化分析。该程序还可以生成串联重复的基因。&nbsp;
  3. Log-Expectation的多序列比较(MUSCLE)(Edgar,2004), https://www.drive5.com/肌肉/
  4. ClustalW(Thompson et al。>,1994), http://www.clustal.org / clustal2 /
    Muscle和ClustalW都是用于核苷酸和蛋白质序列的多序列比对的软件。
  5. 通过最大似然(PAML)进行系统发育分析(Yang et al。>,2007), http://abacus.gene.ucl.ac.uk/software/paml.html
    PAML是一系列程序,用于使用最大可能性对DNA或蛋白质序列进行系统发育分析。
  6. 分子进化遗传学分析(MEGA X)(Kumar et al。>,2018), https:// www.megasoftware.net/
    该软件包含许多用于系统发育分析的复杂方法和工具,包括构建系统发育树以及进化距离估计。
  7. Perl, https://www.perl.org/ 或Python, https://www.python.org/ ,编程语言
    这些语言可以对从上面使用的某些软件生成的主要数据进行后处理。
  8. SAS软件, https://www.sas.com/en_us/home.html < br /> SAS是一种用于统计分析的软件。

程序

  1. 密切相关物种中同源基因的鉴定
    1. 检索基因序列
      对于可获得基因组注释的给定植物基因组,从相应的数据库下载所有蛋白质编码基因的编码序列(CDS)和蛋白质序列。有些物种有各自的网站,例如拟南芥> TAIR( https://www.arabidopsis.org / ),大豆SoyBase( https://soybase.org/ )和玉米MaizeGDB( https://www.maizegdb.org/ ), etc >。许多其他物种的序列保存在NCBI( https://www.ncbi.nlm.nih。 gov / ),CoGe( https://genomevolution.org/coge/ ),Phytozome( https://phytozome.jgi.doe.gov/pz/portal.html )或其他相关数据库。我们从Phytozome下载了大豆的所有基因组序列和基因注释(v1.1,Schmutz等,2010)和普通豆(Schmutz et al。>,2014)。
    2. 去除转座子相关和假设的蛋白质
      为了在密切相关的物种中检测高度可靠的syntelogs,首先使用BLAST去除转座子相关和假设的蛋白质(Altschul 等人,>,1997)。以大豆(v1.1)为例,来自20个大豆染色体的总共53,927个注释基因针对大豆转座子数据库SoyTEdb进行了BLASTN查询(Du et al。>,2010, https://www.soybase.org/soytedb/ )。去除了与转座子相关序列匹配的总长度的80%以上的序列相似性大于80%的任何基因。注释为假设蛋白质的基因也被排除在外(图1)。这是进行局部爆炸的典型环境。

      formatdb -i SoyBase_TE_Fasta.txt -p F -o T >
      blastall -p blastn -i Soybean_gene_cds.fa -d SoyBase_TE_Fasta.txt -m 8 -a 8 -o Soybean_gene_cds_blast_TEs >

      注意:基于基因注释文件识别假设蛋白质。注释为假设蛋白质的基因不包括在分析中。这里formatdb用于格式化核苷酸源数据库“SoyBase_TE_Fasta.txt”,然后才能使用blastall进行搜索。 blastall用于将文件“Soybean_gene_cds.fa”中的基因序列与数据库“SoyBase_TE_Fasta.txt”进行比较。有关每个参数的详细说明,请参阅软件手册。>


      图1.生物信息学管道,用于鉴定和比较大豆和普通豆中的同线同源基因。 实心圆圈和虚线圆圈分别表示相应基因组中基因的存在与否。 GmA和GmB代表大豆中的任一重复基因。 Gm, Glycine max >,大豆; Pv, Phaseolus vulgaris >,普通豆。这个数字是从赵等人>,2017年修改过来的。

    3. 候选syntelogs的检测
      大豆和普通豆中剩余的蛋白质编码基因用于使用默认参数进行全对抗BLASTP搜索,E值截止值为10 -10 (Altschul et al 。>,1997)。对于每对基因,将BLAST命中加载到软件MCScanX(Wang et al。>,2012)以扫描syntelog同源基因对。

      blastall -p blastp -i Soybean_gene_pep.fa -d Commonbean_gene_pep.fa -m 8 -a 8 -FF -o Soybean _blast_Commonbean >
      MCScanX Soybean_CommonBean >

      注意:E值的设置基于粗略的发散时间。如果发散时间未知,您可以将其保留为默认设置。请为MCScanX准备所有基因位置所需的gff文件。使用MCScanX的默认设置(匹配分数= 50,匹配大小= 5,间隙罚分= -1,重叠窗口= 5,最大间隙= 25和E值截止10 识别大豆和普通豆之间的同步基因对-10 )。>
    4. 后期处理候选星系基因
      由于大豆在与普通豆分裂后经历了全基因组重复,因此普通豆基因组中的基因对应于大豆基因组中的一个或两个拷贝,这取决于保留在大豆中的基因的重复状态。除了syntelogous基因对之外,MCScanX还提供了所有同源基因对。为了去除假阳性星系基因,引入了来自参考基因组的重复区段信息(Schmutz et al。>,2014),以保持在重复区域中检测到的同源基因对并显示出同线关系。大豆和普通豆之间。
      &NBSP;候选同位素基因的最终列表的实例显示在表1中。参与串联重复的基因具有模糊的保留状态,因此被搁置以进行单独分析。

      表1.在普通豆和大豆基因组中鉴定的同线同源基因的实例

      ||在同线区域未检测到基因。

  2. Syntelogous基因进化距离的估算
    1. 序列比对
      尽管许多基因具有几种替代转录物,但仅使用基于基因注释的基因的初级转录物来估计大豆和普通豆的不同同线性基因之间的序列差异。使用默认参数使用MUSCLE程序(Edgar,2004)或ClustalW(Thompson 等人,1994)比对同线基因的核苷酸序列。 Jalview可以查看对齐情况(图2A)。

      肌肉-in输入输出或clustalw输入>

      注意:基因的初级转录本是根据基因注释文件确定的,该文件显示了基因的主要转录本。 MUSCLE或ClustalW一次只能运行一组syntelog。对于全基因组水平分析,我们建议作者编写Perl或Python脚本以自动将每对序列加载到MUSCLE或ClustalW以进行对齐。在此步骤中,我们首先使用MUSCLE运行比对,然后对剩余的基因对执行ClustalW,其中核苷酸比对在MUSCLE比对后不是整数倍的三倍。&nbsp; >
    2. 手动检查
      手动检查输出比对以修改错误排列的核苷酸。此步骤非常重要,但如果要验证的数据量非常大,则可能不实用(图2B)。


      图2.大豆和普通豆中同线同源基因的序列比对的一个例子。 :一种。由MUSCLE生成的原始对齐(Edgar,2004)。 B.手动修改对齐。蓝框表示修改后的区域。

    3. 序列分歧
      使用Perl或Python编程通过PAML软件将同线性基因的所有成对比对制备成所需格式,以使用具有默认参数的yn00和baseml模块计算非同义(Ka)和同义(Ks)替换,除了模型被设置为1而不是0(Yang,2007)。有关在PAML包中运行程序的更多信息,请参阅手册。

  3. 系统发育分析
    系统发育树用于表示同源基因之间的系统发育关系。
    1. 序列对齐
      使用MUSCLE程序(Edgar,2004)或ClustalW(Thompson 等人,1994)使用默认参数比对同线性基因的核苷酸序列或蛋白质序列。

      肌肉-in输入输出或clustalw输入>

    2. 系统发育树构建
      将同源基因的序列比对转移到MEGA软件中,使用针对核苷酸序列整合的邻接最大复合似然模型和具有成对缺失的蛋白质序列的泊松校正来构建系统发育树(Kumar et al。 >,2018)。 Bootstrap值从1,000次重复计算。

数据分析

执行学生的 t > - 测试以使用SAS软件比较重复项和单项之间的进化距离。进一步进行Bonferroni校正以校正 P >值。 P >&lt; 0.05被认为是显着的,并且 P > <0.05。在Bonferroni校正下,0.0001被认为是重要的。实验值报告为平均值±标准偏差或方框图(图3)。


图3.大豆中重复和单体之间进化距离的比较。通过大豆和普通豆的成对比较计算Ka和Ks。统计分析由学生 t > - 测试进行。 **, P >&lt; 0.0001。 Ka,非同义替代; Ks,同义替换; ω,Ka / Ks。框的底部和顶部边界是第一个和第三个四分位数,单个框内的粗线是中位数,它们被称为第二个四分位数。晶须的末端(虚线)表示数据的最小值和最大值。

笔记

  1. 由于潜在的基因转座,一些指定的单一基因,其同源物未在同线区域中发现,可能属于重复对。单体的同源物可以从原始同线区转移或易位到基因组的其他地方,而不是被删除。&nbsp;
  2. 参与串联重复的基因总是具有模糊的保留状态,因此被单独分析。

致谢

该协议改编自Zhao 等人>(2017)。这项工作得到了联合大豆委员会和印第安纳州大豆联盟的大豆核查资金和国家科学基金会资助DBI-0822258到J.M.,国家科学基金会资助DBI-1237931到D.L.的支持。和Purdue Startup Funds到D.L,以及迈阿密大学创业基金到M.Z.

利益争夺

作者声明没有利益冲突或竞争利益。

参考

  1. Altschul,S.F.,Madden,T.L。,Schäffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W。和Lipman,D.J。(1997)。 Gapped BLAST和PSIBLAST:新一代蛋白质数据库搜索程序。 Nucleic Acids Res > 25(17):3389-3402。
  2. Blanc,G。和Wolfe,K。H.(2004)。 根据重复基因的年龄分布推断模型植物物种中广泛的古多倍体。 植物细胞> 16(7):1667-1678。
  3. Du,J.,Grant,D.,Tian,Z.,Nelson,R。T.,Zhu,L.,Shoemaker,R。C. and Ma,J。(2010)。 SoyTEdb:大豆基因组中转座因子的综合数据库。 BMC基因组学> 11:113。
  4. Edgar,R。C.(2004)。 MUSCLE:一种多序列比对方法,可缩短时间和空间复杂度。 BMC生物信息学> 5:113。
  5. Gu,X.,Zhang,Z。和Huang,W。(2005)。 酵母基因复制后表达和调控分歧的快速演变。 Proc Natl Acad Sci USA > 102(3):707-712。
  6. Ilic,K.,SanMiguel,P。J.和Bennetzen,J.L。(2003)。 玉米,高粱和水稻基因组直系同源区重排的复杂历史。 Proc Natl Acad Sci USA > 100(21):12265-12270。
  7. Jiao,Y.,Wickett,NJ,Ayyampalayam,S.,Chanderbali,AS,Landherr,L.,Ralph,PE,Tomsho,LP,Hu,Y.,Liang,H.,Soltis,PS,Soltis,DE,Clifton ,SW,Schlarbaum,SE,Schuster,SC,Ma,H.,Leebens-Mack,J。和dePamphilis,CW(2011)。 种子植物和被子植物中的祖先多倍体。 自然> 473 (7345):97-100。
  8. Kumar,S.,Stecher,G.,Li,M.,Knyaz,C。和Tamura,K。(2018)。 MEGA X:跨计算平台的分子进化遗传学分析。 Mol Biol Evol > 35(6):1547-1549。
  9. Lavin,M.,Herendeen,P。S.和Wojciechowski,M。F.(2005)。 豆科植物的进化率分析表明,在大学期间,谱系的快速多样化。 Syst Biol > 54(4):575-594。
  10. Makalowski,W。和Boguski,M。S.(1998)。 转录的哺乳动物基因组的进化参数:对2,820种直系同源啮齿动物和人类序列的分析。 Proc Natl Acad Sci USA > 95(16):9407-9412。
  11. McClean,P.E.,Mamidi,S.,McConnell,M.,Chikara,S。和Lee,R。(2010)。 普通豆和大豆之间的同步映射揭示了大量共享位点。 BMC Genomics > 11:184。
  12. Nekrutenko,A.,Makova,K。D.和Li,W。H.(2002)。 K A / K S 用于评估基因组区域蛋白质编码潜力的比率测试:经验和模拟研究。 Genome Res > 12(1):198-202。
  13. Ohno,S。(1970)。 通过基因复制进化。 Springer-Verlag,纽约,p。 160。
  14. Otto,S。P.和Whitton,J。(2000)。 多倍体发病率和进化。 Annu Rev Genet > 34: 401-437。
  15. Schmutz,J.,Cannon,SB,Schlueter,J.,Ma,J.,Mitros,T.,Nelson,W.,Hyten,DL,Song,Q.,Thelen,JJ,Cheng,J.,Xu,D 。,Hellsten,U.,May,GD,Yu,Y.,Sakurai,T.,Umezawa,T.,Bhattacharyya,MK,Sandhu,D.,Valliyodan,B.,Lindquist,E.,Peto,M., Grant,D.,Shu,S.,Goodstein,D.,Barry,K.,Futrell-Griggs,M.,Abernathy,B.,Du,J.,Tian,Z.,Zhu,L.,Gill,N 。,Joshi,T.,Libault,M.,Sethuraman,A.,Zhang,XC,Shinozaki,K.,Nguyen,HT,Wing,RA,Cregan,P.,Specht,J.,Grimwood,J.,Rokhsar ,D.,Stacey,G.,Shoemaker,RC和Jackson,SA(2010)。 古生物大豆的基因组序列。 Nature > 463( 7278):178-183。&nbsp;
  16. Schmutz,J.,McClean,PE,Mamidi,S.,Wu,GA,Cannon,SB,Grimwood,J.,Jenkins,J.,Shu,S.,Song,Q。,Chavarro,C.,Torres-Torres ,M.,Geffroy,V.,Moghaddam,SM,Gao,D.,Abernathy,B.,Barry,K。,Blair,M.,Brick,MA,Chovatia,M.,Gepts,P.,Goodstein,DM ,Gonzales,M.,Hellsten,U.,Hyten,DL,Jia,G.,Kelly,JD,Kudrna,D.,Lee,R.,Richard,MM,Miklas,PN,Osorno,JM,Rodrigues,J。 ,Thareau,V.,Urrea,CA,Wang,M.,Yu,Y.,Zhang,M.,Wing,RA,Cregan,PB,Rokhsar,DS和Jackson,SA(2014)。 常见豆和全基因组双重驯化分析的参考基因组。 Nat Genet > 46(7):707-713。
  17. Sémon,M。和Wolfe,K。H.(2007)。 基因组重复的后果。 Curr Opin Genet Dev > 17 (6):505-512。
  18. Thompson,J.D.,Higgins,D.G。和Gibson,T.J。(1994)。 CLUSTAL W:通过序列加权,位置特异性空位罚分和提高渐进多序列比对的灵敏度重量矩阵选择。 Nucleic Acids Res > 22(22):4673-4680。
  19. Wang,Y.,Tang,H.,Debarry,JD,Tan,X.,Li,J.,Wang,X.,Lee,TH,Jin,H.,Marler,B.,Guo,H.,Kissinger, JC和Paterson,AH(2012)。 MCScanX:基因同线性和共线性的检测和进化分析工具包。 Nucleic Acids Res > 40(7):e49。
  20. Yang,Z。(2007)。 PAML 4:最大可能性的系统发育分析。 Mol Biol Evol > 24(8):1586-1591。
  21. Zhao,M.,Zhang,B.,Lisch,D。和Ma,J。(2017)。 亚基因组分化的模式和后果提供了对植物古多倍体性质的深入了解。 em>植物细胞> 29(12):2974-2994。
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2018 The Authors; exclusive licensee Bio-protocol LLC.
引用:Zhao, M., Zhang, B., Ma, J. and Lisch, D. (2018). Genome-wide Estimation of Evolutionary Distance and Phylogenetic Analysis of Homologous Genes. Bio-protocol 8(23): e3097. DOI: 10.21769/BioProtoc.3097.
提问与回复

(提问前,请先登录)bio-protocol作为媒介平台,会将您的问题转发给作者,并将作者的回复发送至您的邮箱(在bio-protocol注册时所用的邮箱)。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

当遇到任何问题时,强烈推荐您通过上传图片的形式提交相关数据。