参见作者原研究论文

本实验方案简略版
Sep 2020

本文章节


 

Computational Analysis and Phylogenetic Clustering of SARS-CoV-2 Genomes
SARS-CoV-2基因组计算分析及系统进化聚类分析   

引用 收藏 提问与回复 分享您的反馈 Cited by

Abstract

COVID-19, the disease caused by the novel SARS-CoV-2 coronavirus, originated as an isolated outbreak in the Hubei province of China but soon created a global pandemic and is now a major threat to healthcare systems worldwide. Following the rapid human-to-human transmission of the infection, institutes around the world have made efforts to generate genome sequence data for the virus. With thousands of genome sequences for SARS-CoV-2 now available in the public domain, it is possible to analyze the sequences and gain a deeper understanding of the disease, its origin, and its epidemiology. Phylogenetic analysis is a potentially powerful tool for tracking the transmission pattern of the virus with a view to aiding identification of potential interventions. Toward this goal, we have created a comprehensive protocol for the analysis and phylogenetic clustering of SARS-CoV-2 genomes using Nextstrain, a powerful open-source tool for the real-time interactive visualization of genome sequencing data. Approaches to focus the phylogenetic clustering analysis on a particular region of interest are detailed in this protocol.

Keywords: COVID-19 (COVID-19), SARS-CoV-2 (SARS-CoV-2), Phylogenetic analysis (系统发育分析), Genomes (基因组), Coronavirus (冠状病毒)

Background

Severe Acute Respiratory Syndrome- related coronaviruses (SARS-CoV) are one of the largest single-stranded RNA virus families known to date (Zhu et al., 2020). Recently, SARS-CoV-2, a novel strain of coronavirus, has been identified as the causal pathogen for the ongoing Coronavirus disease 2019 (COVID-19) pandemic (Huang et al., 2020). The infectious disease that first originated in Wuhan, China, spread to other nations at an alarmingly rapid pace. With 3,517,345 cases reported globally and a death toll of 243,401 (as of 5th May 2020), the disease continues to be a public health concern and a potential threat to the socio-economic welfare of nations and healthcare systems worldwide (World Health Organization, 2020. Novel Coronavirus (2019-nCoV): situation report, 106).


Owing to the rapid advancement of next-generation sequencing (NGS) technology and analysis methods, sequencing the viral genome has been recognized as a viable tool to aid the diagnosis and treatment of COVID-19 and help to understand the disease epidemiology. As the disease evolves over time, more sequencing data for SARS-CoV-2 genomes is being made available in the public domain. To date, there are over 25,000 publicly available genomes of SARS-CoV-2 from different geographical origins. Phylogenetic principles have previously been successfully utilized to contain and diffuse recent pandemic events such as avian influenza, the Zika virus epidemic, and HIV (Salemi et al., 2008; Babakir-Mina et al., 2009; Angeletti et al., 2016). With the rapid accumulation of sequencing data, phylogenetic and phylodynamic analysis are potentially powerful tools for studying the evolutionary patterns of rapidly evolving RNA viruses, and therefore help to understand the epidemiology of the outbreak.


Visualizing evolutionary epidemiology can help to provide a deeper understanding of the global diversity of SARS-CoV-2. Nextstrain is an open-source project that aims to provide real-time interactive visualization of rapidly evolving pathogens coupled with additional data such as geographic information (Hadfield et al., 2018). Nextstrain utilizes Augur, a bioinformatics toolkit for the systematic analysis of genome sequences, and Auspice, an interactive web service for the visualization of analysis results. This protocol has been created to aid bioinformaticians in gaining an epidemiological understanding of the SARS-CoV-2 pathogen using the powerful phylogenetic analysis toolkit provided by Nextstrain. The data and parameters used in this protocol are specific to SARS-CoV-2 genomes; however, Nextstrain is a generalized toolkit for the analysis of pathogen phylogenies and can be customized using the appropriate data and parameters suited to the pathogen of interest. All software and datasets used in this protocol are available in the public domain.

Equipment

We explicitly assume that the user has some experience working with shell commands on a Linux-based operating system and has superuser privileges.

  1. Computational Requirements

    We recommend using a workstation or a server with a 64 bit Linux-based operating system, possessing 8 GB RAM and sufficient hard disk space (at least 250 GB) to store the files used and produced in this analysis. The commands given in this analysis protocol have been validated on Ubuntu (18.04 LTS) Linux Distribution.

Software

  1. Required Software

    This protocol uses the following tools and Nextstrain software to perform the phylogenetic analysis:

    1. Docker Engine (https://www.docker.com/)

    2. Anaconda (https://www.anaconda.com/)

    3. Nextstrain ( Hadfield et al., 2018 )

    4. Augur ( Hadfield et al., 2018 )

    5. MAFFT (Katoh and Standley, 2013)

    6. IQTREE ( Nguyen et al., 2015 )

    All requisite tools and their dependents must be installed before proceeding with the analysis.

  1. Datasets

    The protocol uses the SARS-CoV-2 genome sequence datasets made available by the Global Initiative on Sharing All Influenza Data (GISAID) (Shu and McCauley, 2017).

    The installation steps for all tools used in this protocol and the instructions for downloading the requisite datasets are given in the following section.

Procedure

The individual steps involved in this protocol and the Augur modules used in each step are summarized in Figure 1.

Downloading and installing requisite software tools and datasets

  1. Install Docker Engine

    Docker is an open-source technology based on virtualization, which is used for developing and running software applications in the form of containers. The Docker Engine can be installed using the following commands:


    sudo apt-get update

    sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common

    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

    sudo apt-key fingerprint 0EBFCD88

    sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

    sudo apt-get update

    sudo apt-get install docker-ce docker-ce-cli containerd.io

    To activate and test Docker installation, execute the following commands:


    sudo groupadd docker

    sudo usermod -aG docker $USER

    newgrp docker

    docker run hello-world


    Figure 1. The different steps described in this protocol and the Augur modules used in each of the analysis steps


  2. Install Anaconda

    Anaconda is an open-source distribution of Python that simplifies the management of Python packages and environments. To install Anaconda, use the following commands:


    wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh

    bash Anaconda3-2020.02-Linux-x86_64.sh

    Proceed with the installation by following the on-screen instructions. You can find the anaconda3 folder in the directory shown in the installer script. You can activate and test your installation by running the following commands:


    source ~/.bashrc

    conda list

  3. Install Nextstrain-CLI

    Nextstrain is available as a Python package and can be installed using pip.


    python3 -m pip install nextstrain-cli

    To check whether Nextstrain has been successfully installed, use the following command:


    nextstrain version

    The version number shown in the output should be 1.16.1 or higher.


  4. Install Augur

    Augur is the toolkit provided by Nextstrain for phylogenetic analysis. Augur is also available as a Python package and can be installed using the following command:


    python3 -m pip install nextstrain-augur

  5. Install MAFFT

    MAFFT (Multiple Alignment using Fast Fourier Transform) is required by Augur to perform multiple-sequence alignments. To download and install this tool, use the following command:


    sudo apt-get install mafft

  6. Install IQ-TREE

    IQ-TREE is an open-source tool for constructing maximum-likelihood trees using phylogenetic data. IQ-TREE is required by Augur for constructing a phylogenetic tree from sequence data. To install IQ-TREE use the following command:


    sudo apt-get install iqtree

    It is recommended to use IQ-TREE version 1.6.1 (default version installed for Ubuntu 18.04 LTS) or higher.


  7. Download the SARS-CoV-2 sequence dataset

    The Global Initiative on Sharing All Influenza Data (GISAID) is the most updated public repository of SARS-CoV-2 genome sequences. For this phylogenetic clustering protocol, we downloaded the dataset of ~15,000 complete (as of 1st May 2020) SARS-CoV-2 genome sequences from GISAID. The database can be accessed by registering for a GISAID account. Upon successful activation, the sequence dataset can be downloaded by logging into the GISAID EpiCoVTM database and navigating to the Browse option (https://www.epicov.org/epi3/frontend).

    To create the metadata file required by Augur, you will also need to download the Acknowledgment Table for all submissions provided by GISAID, which can also be found on the Browse page.


  8. Download the SARS-CoV-2 reference genome

    Before proceeding with the analysis, you also need to download the reference genome for SARS-CoV-2 from NCBI in GenBank (.gb). For this analysis, we downloaded the genome with the accession number MN908947.3.


  9. Preparing input files

    To use Nextstrain for phylogenetic analysis and visualization, you need to prepare the following input files (Table 1):


    Table 1. List of input files required to run the different steps in the analysis pipeline



    1. sequences.fasta

      A single FASTA file containing a collection of pathogen sequences to be analyzed. For this analysis, we used the sequence dataset downloaded from GISAID. Each sequence in the FASTA file should have the strain ID of the virus as the sequence header. A sample sequence record for the FASTA file is shown in Figure 2.



      Figure 2. Sample record for the hCoV-19/India/1-27/2020 SARS-CoV2 strain in the sequences.fasta format


    2. metadata.tsv

      A tab-delimited metadata file that describes the sequences given in the FASTA file. The various fields to be included in the metadata file are as follows:

      1. Required fields: Strain, Virus, Date

        For each strain ID in the sequences.fasta file, there should be an entry under the strain column in the metadata file.

      2. Additional fields (if using published data): Accession, Authors, URL, Title, Journal, Paper_URL.

      3. To infer ancestral traits, additional information fields such as region, country, state, and city need to be included in the metadata file.

      The information for the various fields in the metadata file can be taken from the Acknowledgment Table downloaded from GISAID. A sample metadata spreadsheet is linked here as Supplementary Data 1.


    3. clades.tsv

      This file is required for the addition of clade labeling to the phylogenetic tree. The file specifies the mutations (amino acid or nucleotide) specific to a particular clade of the virus (Figure 3). The clades.tsv file should contain the following fields:

      1. clade: To describe the name of a clade.

      2. gene: The name of the gene in which the mutation lies (for nucleotide changes, the gene name should be ‘nuc’).

      3. site: The position of the mutation within the genome.

      4. alt: The mutated amino acid or nucleotide found at that position.

      For this analysis, we used the clades definition for SARS-CoV-2 genomes defined by Nextstrain ( https://github.com/nextstrain/ncov ).



      Figure 3. Summary screenshot of the clades.tsv file provided by Nextstrain for SARS-CoV-2 genomes


    4. auspice_config.json

      This file is needed to set various display options for visualization. A sample config file is linked here as Supplementary Data 2.

    5. lat_longs.tsv

      A tab-separated file containing latitudes and longitudes for all regions, countries, states, and cities in the dataset (Figure 4). This file will be used to display geographic traits during visualization.



      Figure 4. Summary screenshot of the lat_longs.tsv file required by Nextstrain for visualizing geographic traits


    6. Quality assessment

      In this visualization, we would also like to segregate high-quality FASTA sequences in the dataset from low-quality ones. Accordingly, we added an additional field, ‘quality,’ to the metadata file. The following quality metrics define a high-quality sequence:

      1. Percentage identity to the reference genome after pairwise alignment: >99%

      2. Percentage of gaps in the alignment: <1%

      3. Percentage of N (unknown nucleic acid residue) bases in the sequence: <1%

      4. No degenerate bases in the sequence

        Based on the above criteria, the ‘quality’ metadata field can hold the values, ‘High,’ ‘Low,’ and ‘Not Assessed.’

      To visualize the quality assessment, we created an additional configuration file ‘colors.tsv,’ a tab-delimited file containing hex codes for each value of the sequence quality field that you want to represent. In this analysis, high-quality is shown in green, low-quality in red, and unassessed sequences in yellow by specifying the corresponding hex codes for the required colors in the ‘colors.tsv’ file (Figure 5).


           
            Figure 5. Summary screenshot of the colors.tsv file created for visualizing sequence quality

Data analysis

Due to legibility and performance constraints, Nextstrain can only handle ~3,000 sequences in a single view. Since we are working with a set of ~15,000 genome sequences, we subsampled our data and analyzed them by focusing on an individual geographic region (i.e., India).


  1. Filter sequences

    The input sequence set can be filtered based on certain criteria and subsampled using this command. The following command will filter the SARS-CoV2 sequences based on their submission dates and group them by country, year, and month. All sequences dated prior to 2013 or possessing a missing date record will be dropped. The global data will also be subsampled to 100 sequences per country per year per month.


    augur filter --sequences <sequences.fasta> --metadata <metadata.tsv> --output <filtered_ncov.fasta> --group-by country year month --sequences-per-group 100 --min-date 2013


    To focus on a particular geographic region, the filter command also contains parameters that help to include or exclude certain sequences from the analysis:


    --include <include_file> This constraint can be used to include sequences regardless of other subsampling criteria. For this analysis, the include_file will contain the line hCoV-19/Wuhan/WH01/2019, since we will be using this genome as the root in the phylogenetic tree. The names of any other sequences that you want to include in your analysis can be added to this file.

    --exclude-where <CONDITION> This constraint will be used for focusing the analysis on a particular region.


    To subsample the dataset for a single geographic region, use the following command:


    augur filter --sequences <sequences.fasta> --metadata <metadata.tsv> --output <filtered_ncov_india.fasta> --exclude-where country!=India --include <include_file>


  2. Alignment to the reference genome

    Augur uses MAFFT to perform multiple-sequence alignments. To create an alignment file using Augur use the following command:


    augur align --sequences <filtered_ncov.fasta> --reference-sequence <MN908947.gb> --output <aligned_ncov.fasta> --nthreads <2> --remove-reference --fill-gaps


    For the geographic region-focused analysis, use the following command:


    augur align --sequences <filtered_ncov_india.fasta> --reference-sequence <MN908947.gb> --output <aligned_ncov_india.fasta> --nthreads <2> --remove-reference --fill-gaps


  3. Constructing the phylogenetic tree

    Augur uses IQTREE as the default software to construct a phylogenetic tree from the multiple-sequence alignment file. The branch lengths in the tree are a measure of nucleotide divergence. The following command will generate a phylogenetic tree in Newick format (.nwk):


    augur tree --alignment <aligned_ncov.fasta> --output <raw_tree_ncov.nwk> --nthreads <4>


    For the geographic region-focused analysis, use the following command:


    augur tree --alignment <aligned_ncov_india.fasta> --output <raw_tree_ncov_india.nwk> --nthreads <4>


  4. Refining the phylogenetic tree

    The raw tree constructed in the previous step can be further processed by Augur using TreeTime to adjust the branch lengths according to the sampling dates of the sequences. For this analysis, we specified the root of the tree by giving the sequence name hCoV-19/Wuhan/WH01/2019 explicitly with the --root parameter of the refine command. The --clock-rate parameter was used to run the analysis using a fixed evolutionary rate to produce a robust time-resolved phylogeny, and the --clock-filter-iqd parameter filters out genomes that do not follow the evolutionary rate or molecular clock. For SARS-CoV-2 genomes, this rate is fixed at 0.0008 or 8 × 10-4 substitutions per site per year. To produce a time-resolved tree use the following command:


    augur refine --tree <raw_tree_ncov.nwk> --alignment <aligned_ncov.fasta> --metadata <metadata.tsv> --output-tree <refined_ncov_tree.nwk> --output-node-data <branch_lengths_ncov.json> --root hCoV-19/Wuhan/WH01/2019 --timetree --clock-rate 0.0008 --clock-std-dev 0.0004 --coalescent skyline --date-inference marginal --divergence-unit mutations --date-confidence --no-covariance --clock-filter-iqd 4


    For the geographic region-focused analysis, use the following command:


    augur refine --tree <raw_tree_ncov_india.nwk> --alignment <aligned_ncov_india.fasta> --metadata <metadata.tsv> --output-tree <refined_ncov_tree_india.nwk> --output-node-data <branch_lengths_ncov_india.json> --root hCoV-19/Wuhan/WH01/2019 --timetree --clock-rate 0.0008 --clock-std-dev 0.0004 --coalescent skyline --date-inference marginal --divergence-unit mutations --date-confidence --no-covariance --clock-filter-iqd 4


  5. Annotating ancestral traits

    Augur can use the time tree to infer the region and country of all internal nodes. The ancestral traits for all nodes can be annotated using the following command:


    augur traits --tree <refined_ncov_tree.nwk> --metadata <metadata.tsv> --output <ncov_traits.json> --columns region country --confidence --sampling-bias-correction 2.5


    For the geographic region-focused analysis, use the following command:


    augur traits --tree <refined_ncov_tree_india.nwk> --metadata <metadata.tsv> --output <ncov_traits_india.json> --columns city --confidence --sampling-bias-correction 2.5


  6. Inferring ancestral sequences and nucleotide mutations

    The following command will identify the nucleotide mutations of the branches of the tree and infer the ancestral strain of each node:


    augur ancestral --tree <refined_ncov_tree.nwk> --alignment <aligned_ncov.fasta> --output-node-data <ncov_nt_muts.json> --inference joint --infer-ambiguous


    For the geographic region-focused analysis, use the following command:


    augur ancestral --tree <refined_ncov_tree_india.nwk> --alignment <aligned_ncov_india.fasta> --output-node-data <ncov_nt_muts_india.json> --inference joint --infer-ambiguous


  7. Inferring amino acid mutations

    The following command will identify the amino acid mutations using the reference genome and ancestral sequences:


    augur translate --tree <refined_ncov_tree.nwk> --ancestral-sequences <ncov_nt_muts.json> --reference-sequence <MN908947.gb> --output <ncov_aa_muts.json>


    For the geographic region-focused analysis, use the following command:


    augur translate --tree <refined_ncov_tree_india.nwk> --ancestral-sequences <ncov_nt_muts_india.json> --reference-sequence <MN908947.gb> --output <ncov_aa_muts_india.json>


  8. Identifying clades

    The following command will label clades within the dataset using the nucleotide and amino acid mutations specified in the clades.tsv file:


    augur clades --tree <refined_ncov_tree.nwk> --mutations <ncov_aa_muts.json> <ncov_nt_muts.json> --clades <clades.tsv> --output-node-data <ncov_clades.json>


    For the geographical region-focused analysis, use the following command:


    augur clades --tree <refined_ncov_tree_india.nwk> --mutations <ncov_aa_muts_india.json> <ncov_nt_muts_india.json> --clades <clades.tsv> --output-node-data <ncov_clades_india.json>


  9. Exporting output files for visualization

    The following command will export all output files generated in the previous steps of the analysis as a single JSON file to visualize the data using Nextstrain:


    augur export v2 --tree <refined_ncov_tree.nwk> --metadata <metadata.tsv> --node-data <branch_lengths_ncov.json> <ncov_aa_muts.json> <ncov_nt_muts.json> <ncov_traits.json> <ncov_clades.json> --auspice-config auspice_config.json --lat-longs lat_longs.tsv --colors colors.tsv --output auspice/COVID_global.json


    For the geographic region-focused analysis, use the following command:


    augur export v2 --tree <refined_ncov_tree_india.nwk> --metadata <metadata.tsv> --node-data <branch_lengths_ncov_india.json> <ncov_aa_muts_india.json> <ncov_nt_muts_india.json> <ncov_traits_india.json> <ncov_clades_india.json> --auspice-config auspice_config.json --lat-longs lat_longs.tsv --colors colors.tsv --output auspice/COVID_india.json


  10. Viewing the data

    To visualize the output, use the following command:


    nextstrain view auspice/ --allow-remote-access


    This command will start the Auspice server on port 4000. The output can then be visualized through a browser by navigating to http://127.0.0.1:4000/ or using the IP address of the machine on which the Auspice service is running and navigating to http://IP_ADDRESS_OF_MACHINE:4000/. The different subsampled datasets can be found under the ‘Dataset’ dropdown menu (Figure 6).

    Note: For the links, the user will need to follow the steps given in the protocol. The hyperlinks correspond to a locally operated server through 'Auspice' (installation and instructions are detailed in the protocol), which helps the user to view the phylogeny on their own system through a browser.



    Figure 6. Screenshot of the visualization produced by Nextstrain for the COVID_global and COVID_india datasets

Acknowledgments

This protocol is adapted from the Nextstrain project ( Hadfield et al., 2018 ). The authors acknowledge help from Mukta Poojary and Aastha V to evaluate this protocol. The present work was funded by the Council of Scientific and Industrial Research (CSIR India) through grants given to Vinod Scaria, CSIR-IGIB. BJ acknowledges a GATE Fellowship from the Council of Scientific and Industrial Research. The funders played no role in the preparation of the manuscript or the decision to publish. The authors declare no competing interests.

References

  1. Angeletti, S., Lo Presti, A., Giovanetti, M., Grifoni, A., Amicosante, M., Ciotti, M., Alcantara, L. J., Cella, E. and Ciccozzi, M. (2016). Phylogenesys and homology modeling in Zika virus epidemic: food for thought. Pathog Glob Health 110(7-8): 269-274.
  2. Babakir-Mina, M., Ciccozzi, M., Ciotti, M., Marcuccilli, F., Balestra, E., Dimonte, S., Perno, C. F. and Aquaro, S. (2009). Phylogenetic analysis of the surface proteins of influenza A (H5N1) viruses isolated in Asian and African populations.New Microbiol 32(4): 397-403.
  3. Hadfield, J., Megill, C., Bell, S. M., Huddleston, J., Potter, B., Callender, C., Sagulenko, P., Bedford, T. and Neher, R. A. (2018). Nextstrain: real-time tracking of pathogen evolution.Bioinformatics 34(23): 4121-4123.
  4. Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu, Y., Zhang, L., Fan, G., Xu, J., Gu, X. et al. (2020). Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395(10223): 497-506.
  5. Katoh, K. and Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4): 772-780.
  6. Nguyen, L. T., Schmidt, H. A., von Haeseler, A. and Minh, B. Q. (2015). IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32(1): 268-274.
  7. Salemi, M., de Oliveira, T., Ciccozzi, M., Rezza, G. and Goodenow, M. M. (2008). High-resolution molecular epidemiology and evolutionary history of HIV-1 subtypes in Albania. PLoS One 3(1): e1390.
  8. Shu, Y. and McCauley, J. (2017). GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill 22(13).
  9. Zhu, N., Zhang, D., Wang, W., Li, X., Yang, B., Song, J., Zhao, X., Huang, B., Shi, W., Lu, R., Niu, P., Zhan, F., Ma, X., Wang, D., Xu, W., Wu, G., Gao, G. F. and Tan, W. (2020). A Novel Coronavirus from Patients with Pneumonia in China, 2019. N Engl J Med 382(8): 727-733.


简介

[摘要] COVID-19,造成新的SARS冠状病毒,冠状病毒2的疾病,起源于在湖北省一个孤立爆发的中国,但很快创建一个全球性的流行以及现在对医疗保健系统的世界的主要威胁宽。在人与人之间迅速传播感染之后,世界各地的研究机构都在努力生成该病毒的基因组序列数据。如今,在公共领域已经有了数千种SARS-CoV-2的基因组序列,可以对这些序列进行分析并加深对这种疾病及其起源的了解,及其流行病学。系统发育分析是追踪病毒的传播模式的潜在的强大工具,以帮助IDENTIF ication的潜在的干预。为了实现这一目标,我们已经创建了使用Nextstrain,一个强大的开源工具,SARS-COV-2基因组分析和进化集群的全面协议的基因组测序数据的实时交互可视化。该协议中详细介绍了将系统聚类分析重点放在特定目标区域上的方法。


[背景]严重急性呼吸系统综合症相关冠状病毒(SARS-CoV)是迄今为止已知的最大的单链RNA病毒家族之一(Zhu等,2020)。最近,SARS-CoV-2是一种新型冠状病毒,已被确定为正在进行的冠状病毒Ddisease 2019(COVID-19)大流行的病原体(Huang等,2020)。最初起源于中国武汉的传染病以惊人的速度蔓延到其他国家。截止到2020年5月5日,全球报告了3,517,345例病例,死亡人数为243,401人,该疾病仍然是公共卫生问题,并且可能威胁着世界各国和医疗系统的社会经济福利(世界卫生组织,2020新型冠状病毒(2019-nCoV) :形势报告,106)。

由于下一代测序(NGS)技术和分析方法的飞速发展,对病毒基因组进行测序已被认为是有助于诊断和治疗COVID-19并帮助了解疾病流行病学的可行工具。随着疾病的发展,SARS-CoV-2基因组的更多测序数据已在公共领域提供。迄今为止,已有来自不同地理起源的25,000多种SARS-CoV-2的可公开获得的基因组。系统发生原理以前已被成功地用来遏制和散布最近的大流行病,例如禽流感,寨卡病毒流行病和HIV(Salemi等,2008; Babakir-Mina等,2009; Angeletti等,2016)。 。随着测序数据的快速积累,系统发育分析和系统动力学分析是研究快速发展的RNA病毒进化模式的潜在强大工具,因此有助于了解爆发的流行病学。

可视化进化流行病学可以帮助您更深入地了解SARS-CoV-2的全球多样性。Nextstrain是一个开源项目,旨在提供快速发展的病原体的实时交互式可视化以及地理信息等其他数据(Hadfield et al。,2018)。Nextstrain利用奥格,生物信息学工具包,该系统ANALY的SIS基因组序列,并前兆,对于交互式Web服务的visualiz的通货膨胀分析结果。已创建此协议来帮助生物信息学家使用Nextstrain提供的强大的系统发育分析工具包,对SARS-CoV-2病原体进行流行病学理解。该协议中使用的数据和参数特定于SARS-CoV-2基因组;然而,Nextstrain为广义的工具包的ANALY的SIS病原体系统发育,并且可以使用适合于目标病原体的适当的数据和参数进行定制。该协议中使用的所有软件和数据集都可以在公共领域获得。

关键字:COVID-19, SARS-CoV-2, 系统发育分析, 基因组, 冠状病毒

设备

我们明确地假设用户有一定的经验与Linux的shell命令的工作-为基础的操作系统,具有超级用户权限。

计算要求
我们建议使用工作站或与64位Linux服务器-为基础的操作系统,具有荷兰国际集团8 GB RAM和足够的硬盘空间(至少250 GB)来存储使用和生产这种分析的文件。此分析协议中给出的命令已在Ubuntu(18.04 LTS)Linux发行版上进行了验证。


软件


必备软件
该协议使用以下工具和Nextstrain软件执行系统发育分析:


Docker引擎(https://www.docker.com/)
水蟒(https://www.anaconda.com/)
Nextstrain(Hadfield等人,2018)
Augur(Hadfield等人,2018)
MAFFT(Katoh和Standley,2013年)
智商(Nguyen et al。,2015)
所有必要的工具和他们的dependen TS必须在分析之前进行安装。


数据集
该协议使用了全球共享所有流感数据倡议(GISAID)提供的SARS-CoV-2基因组序列数据集(Shu和McCauley,2017)。


下一节给出了此协议中使用的所有工具的安装步骤以及下载必需数据集的说明。


程序


图1总结了此协议中涉及的各个步骤以及每个步骤中使用的A ugur模块。


下载和我nstall荷兰国际集团所需的软件工具和数据集


安装Docker引擎
Docker是一种基于虚拟化的开源技术,用于以容器的形式开发和运行软件应用程序。该泊坞引擎可以使用以下的命令安装:


sudo apt-get更新


sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common


curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt键添加-


须藤apt键指纹0EBFCD88


sudo add-apt-repository“ deb [arch = amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs)稳定”


sudo apt-get更新


sudo apt-get install docker-ce docker-ce-cli containerd.io


要激活和测试Docker安装,请执行以下命令:


sudo groupadd泊坞窗


sudo usermod -aG泊坞窗$ USER


newgrp码头工人


docker运行hello-world






图1.该协议中描述的不同步骤以及每个分析步骤中使用的A ugur模块


安装Anaconda
Anaconda是P ython的开源发行版,可简化P ython程序包和环境的管理。要安装Anaconda,请使用以下命令:


wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh


bash Anaconda3-2020.02-Linux-x86_64.sh


按照屏幕上的说明继续安装。您可以在安装脚本中显示的目录中找到anaconda3文件夹。您可以通过运行以下命令来激活和测试安装:


来源〜/ .bashrc


康达清单


安装Nextstrain-CLI
Nextstrain是P ython软件包,可以使用pip进行安装。


python3 -m pip安装nextstrain-cli


要检查是否Nextstrain已成功安装,请使用以下命令:


nextstrain版本


输出中显示的版本号应为1.16.1或更高。


安装Augur
Augur是Nextstrain提供的用于系统发育分析的工具包。Augur也可以作为P ython软件包提供,并且可以使用以下命令进行安装:


python3 -m pip安装nextstrain-augur


安装MAFFT
MAFFT(多序列比对利用快速傅里叶变换)是通过奥格需要执行多个-序列比对小号。要下载并安装个是工具,使用以下命令:


须藤apt-get install mafft


安装IQ-TREE
IQ-TREE是使用系统发育数据构建最大似然树的开源工具。Augur需要IQ-TREE才能从序列数据构建系统发育树。要安装IQ-TREE,请使用以下命令:


须藤apt-get install iqtree


建议使用IQ-TREE版本1.6.1(为Ubuntu 18.04 LTS安装的默认版本)或更高版本。


下载的SARS冠状病毒-2序列数据集
全球共享所有流感数据倡议(GISAID)是SARS-CoV-2基因组序列的最新公共存储库。对于此系统发生聚类协议,我们从GISAID下载了约15,000个完整的(截至2020年5月1日)SARS-CoV-2基因组序列的数据集。可以通过注册GISAID帐户来访问该数据库。牛浦n成功激活,序列数据集可以通过登录到全球共享禽流感数据倡议组织EpiCoV下载TM数据库并导航到浏览选项(https://www.epicov.org/epi3/frontend)。


要建立由预言者所需的元数据文件,你还需要下载确认表由全球共享禽流感数据倡议组织所提供的所有意见,这也可以在浏览网页上找到。


下载SARS-CoV-2参考基因组
在进行分析之前,您还需要从GenBank(.gb)中的NCBI下载SARS-CoV-2的参考基因组。为了进行此分析,我们下载了登录号为MN908947.3。的基因组。


准备我NPUT ˚F尔斯
要将Nextstrain用于系统发育分析和可视化,您需要准备以下输入文件(表1):


表1.运行分析管道中不同步骤所需的输入文件列表


文件


描述


所需的输入文件


序列法


以FASTA格式分析SARS-CoV-2序列的集合


metas.tsv


制表符分隔的文本文件,描述序列s.fasta文件中的所有序列


clades.tsv


制表符分隔含分支定义文本文件下载编从该Nextstrain GitHub的库


MN908947.gb


Genbank格式的SARS-CoV-2参考基因组


其他配置文件


auspice_config.json


JSON格式的文本文件,指定可视化设置


lat_longs.tsv


制表符分隔的文本文件,用于显示地理特征


colors.tsv


制表符分隔的文件,其中包含元数据元素的十六进制颜色代码


可选配置文件


include_file


文本文件,包含要包括在分析中的序列名称,而不考虑其他子采样标准


序列法
单个FASTA文件包含要分析的病原体序列的集合。为了便于分析,我们我们编从全球共享禽流感数据倡议组织下载的序列数据集。FASTA文件中的每个序列都应将病毒的毒株ID作为序列头。为FASTA文件的样本序列记录被示于图2。






图2.样本记录的冠状-19 /印度/ 1-27 / 2020 SARS-COV2在sequences.fasta˚F应变ORMAT 。


metas.tsv
制表符分隔的元数据文件,它描述FASTA文件中给定的序列。元数据文件中包含的各个字段如下:


必填字段:应变,病毒,日期
对于在sequences.fasta文件中的每个应变ID ,应该有应变列下的条目中的元数据文件。


其他字段(如果使用发布的数据):收录,作者,URL,标题,期刊,Paper_URL 。
推断祖先性状,附加信息等领域的区域,国家,州,被列入元数据和城市需要的文件。
该通知在元数据文件的各个领域的通货膨胀可以采取从确认表由全球共享禽流感数据倡议组织下载。甲小号充足米etadata小号preadsheet这里链接为补充数据1。


clades.tsv
此文件是必需的附加的银行足球比赛分支标签的系统发育树。该文件指定了特定于该病毒进化枝的突变(氨基酸或核苷酸)(图3)。clades.tsv文件应包含以下字段:


进化枝:描述进化枝的名称。
基因:突变所在的基因的名称(f或核苷酸变化,基因名称应为“ nuc”)。
现场:突变的位置与在基因组中。
alt:在该位置发现的突变的氨基酸或核苷酸。
对于此分析,我们使用Nextstrain(https://github.com/nextstrain/ncov)定义的SARS-CoV-2基因组的进化枝定义。






图3. Nextstrain为SARS-CoV-2基因组提供的clades.tsv文件的摘要屏幕快照。


auspice_config以.json
需要此文件来设置各种显示选项以进行可视化。甲小号充足Ç onfig ˚F ILE这里链接为补充数据2。


lat_longs.tsv
包含所有地区,国家,州纬度和经度甲制表符分隔的文件,并在该数据集的城市(图4)。该文件将用于在可视化过程中显示地理特征。






图4. Nextstrain所需的lat_longs.tsv文件的摘要屏幕快照,用于可视化地理特征。


质量一ssessment
在此可视化中,我们还希望将数据集中的高质量FASTA序列与低质量的FASTA序列分开。因此,我们加编一个额外的领域,“质量,”以元数据文件。以下质量指标定义了高质量的序列:


成对比对后与参考基因组的同一性百分比:> 99%
间隙的百分比的对准:<1%
序列中N(未知核酸残基)碱基的百分比:<1%
序列中没有简并碱基
基于上述标准,“质量”的元数据字段能够保存的值,“高,”“低,”和“未评估。'


以可视化的质量评估,我们创建ð一个额外的配置文件“colors.tsv ,”含有六角制表符分隔的文件编码的序列质量字段的每个值,你要代表。在该分析中,高品质的显示在绿,低质量的红色,和未经评估序列中通过“colors.tsv”文件(图5)指定为所需的颜色对应的十六进制代码黄色。






图5.为查看序列质量而创建的colors.tsv文件的摘要屏幕快照。


数据分析


由于易读性和性能限制,Nextstrain在一个视图中只能处理约3,000个序列。因为我们有一组〜15000基因组序列的工作,我们subsampl ED我们的数据和analyz编他们专注于一个单独的地理区域(即,印度)。



筛选顺序
可以根据某些条件过滤输入序列集,并使用此命令进行二次采样。以下命令将过滤基于他们的提交日期和组的SARS-COV2序列米的国家,每年,和一个月。日所有序列之前,2013年或拥有丢失日期记录将被丢弃。全球数据还将每月每个国家被子采样为100个序列。


预告片过滤器--sequences --metadata --output -按国家/地区年份月份分组-每个组的序列100-最小日期2013


专注于特定的地理区域,所述过滤器命令还包含参数,以帮助包括或排除某些序列的分析:


--include 此约束可用于包括序列,而与其他子采样标准无关。对于此分析,include_file将包含hCoV-19 / Wuhan / WH01 / 2019行,因为我们将使用此基因组作为系统发育树的根。您想要包括在分析中的任何其他序列的名称都可以添加到该文件中。


--exclude-where <条件>此约束将用于将分析重点放在特定区域上。


要对单个地理区域的数据集进行二次采样,请使用以下命令:


预言过滤器--sequences --metadata --output --exclude-where country!=印度--include


与参考基因组比对
奥格使用MAFFT执行米ultiple -s层序一lignment小号。要使用Augur创建路线文件,请使用以下命令:


augur align --sequences --reference-sequence --output --nthreads <2> --remove-reference --fill-gap


对于以地理区域为重点的分析,请使用以下命令:


augur align-序列-参考序列-输出 --nthreads <2> --remove-reference --fill-gaps


构建我纳克的系统发育树
奥格使用IQTREE作为默认软件从多个构建系统发育树-序列比对文件。树中的分支长度是核苷酸差异的量度。以下命令将以Newick格式(.nwk)生成系统发育树:


预言树--alignment -输出 --nthreads <4>


对于以地理区域为重点的分析,请使用以下命令:


预言树--alignment -输出 --nthreads <4>


完善系统发育树
在先前步骤中构成的原料树可以通过奥格使用TreeTime调整进一步处理的根据序列的采样日期分支的长度。为了便于分析,我们特异性灭蝇灯给予序列名树的根冠状-19 /武汉/ WH01 / 2019与细化命令的参数--root明确。该--clock率参数WA ş用于用固定的进化速率产生一个强大的时间分辨系统发生运行分析,和--clock -过滤-伊拉克第纳尔参数过滤š指出,不遵循进化速率或基因组分子钟。对于SARS-CoV-2基因组,该速率固定为每年每个站点0.0008或8 × 10 -4替换。要生成时间分辨树,请使用以下命令:


奥古尔精炼--tree --alignment --metadata --output-tree --output-node-data -根hCoV-19 / Wuhan / WH01 / 2019 --timetree --clock-rate 0.0008 --clock-std-dev 0.0004 --coalescent skyline --date-推断marginal-散度单位突变--date-confidence-无协方差--clock-filter-iqd 4


对于以地理区域为重点的分析,请使用以下命令:


奥古尔精炼--tree --alignment --metadata --output-tree --output-node-data -根hCoV-19 / Wuhan / WH01 / 2019 --timetree --clock-rate 0.0008 --clock-std-dev 0.0004 --coalescent skyline --date-推断marginal-散度单位突变--date-confidence-无协方差--clock-filter-iqd 4


Annotat荷兰国际集团一ncestral牛逼raits
Augur可以使用时间树来推断所有内部节点的地区和国家。可以使用以下命令注释所有节点的祖先特征:


奥古斯特征--tree -元数据-输出-列区域国家-信心-抽样偏差校正2.5


对于以地理区域为重点的分析,请使用以下命令:


奥古尔特征--tree -元数据-输出-列城市-信心-抽样偏差校正2.5


推断ř荷兰国际集团一个ncestral小号equences和Ñ ucleotide米utations
以下命令将识别树的分支的核苷酸突变,并推断每个节点的祖先菌株:


奥古尔祖先--tree --alignment --output-node-data --inference joint --infer-ambiguous


对于以地理区域为重点的分析,请使用以下命令:


augur祖先--tree --alignment --output-node-data --inference joint --infer-ambiguous


推断响一个蓑一个CID米utations
以下命令将使用参考基因组和祖先序列识别氨基酸突变:


augur翻译--tree -祖先序列-参考序列-输出


对于以地理区域为重点的分析,请使用以下命令:


augur翻译--tree -祖先序列-参考序列-输出


确定荷兰国际集团ç lades
以下命令将使用clades.tsv文件中指定的核苷酸和氨基酸突变在数据集中标记进化枝:


奥古尔进化枝--tree --mutations --clades --output-node-data


对于以地理区域为重点的分析,请使用以下命令:


奥古尔进化枝--tree --mutations --clades -输出节点数据


出口ING Ø本安输出˚F尔斯的v isualization
下面的命令将在分析中的先前步骤中生成的所有输出文件导出作为单个JSON文件到visualiz Ë使用Nextstrain数据:


augur导出v2 --tree --metadata --node-data - -auspice-config auspice_config.json --lat-longs lat_longs.tsv --colors colors.tsv-输出auspice / COVID_global.json


对于以地理区域为重点的分析,请使用以下命令:


augur导出v2 --tree --metadata --node-data -auspice-config auspice_config.json --lat-longs lat_longs.tsv --colors colors.tsv-输出auspice / COVID_india.json


查看荷兰国际集团的d ATA
要可视化输出,请使用以下命令:


nextstrain view auspice / --allow-remote-access


此命令将启动安贝服务器在端口4000的输出然后可以被通过浏览器导航到可视化http://127.0.0.1:4000/或使用机器的IP地址在其上的安贝服务正在运行和导航到http:// IP_ADDRESS_OF_MACHINE:4000 /。Ť他不同子采样的数据集可以在“数据集”下拉菜单(图6)下找到。


注意:对于链接,用户将需要遵循协议中给出的步骤。超链接对应于通过“本地操作服务器甲uspice”(安装和指令都在该协议中详述),它帮助用户以通过浏览器查看他们自己的系统上的系统发育。






图6. Nextstrain为COVID_global和COVID_india数据集生成的可视化效果的屏幕截图。


致谢


该协议改编自Nextstrain项目(Hadfield等人,2018)。在一个uthors感谢来自穆克塔Poojary和Aastha V帮助到evaluat è该协议。钍E存在工作是通过赠款理事会科学和工业研究委员会(CSIR印度)给到维诺德·斯卡里亚,CSIR-IGIB。BJ承认科学和工业研究理事会的GATE奖学金。出资者在手稿的准备或出版决定中没有任何作用。在一个uthors声明没有竞争利益。


参考


Angeletti,S.,Lo Presti,A.,Giovanetti,M.,Grifoni,A.,Amicosante,M.,Ciotti,M.,Alcantara,LJ,Cella,E.和Ciccozzi,M.(2016)。Zika病毒流行中的Phylogenesys和同源性建模:值得深思。Pathog Glob Health 110(7-8):269-274。
Babakir-Mina,M.,Ciccozzi,M.,Ciotti,M.,Marcuccilli,F.,Balestra,E.,Dimonte,S.,Perno,CF和Aquaro,S.(2009年)。对亚洲和非洲人群中分离出的A型流感(H5N1)病毒表面蛋白进行系统发育分析。新微生物学32(4):397-403。
Hadfield,J.,Megill,C.,Bell,SM,Huddleston,J.,Potter,B.,Callender,C.,Sagulenko,P.,Bedford,T.和Neher,RA(2018)。Nextstrain:病原体进化的实时跟踪。生物信息学34(23):4121-4123。              
Huang,C.,Wang,Y.,Li,X.,Ren,L.,Zhao,J.,Hu,Y.,Zhang,L.,Fan,G.,Xu,J.,Gu,X.等al。(2020)。中国武汉市2019年新型冠状病毒感染患者的临床特征。柳叶刀395(10223):497-506。
Katoh,K.和Standley,DM(2013)。MAFFT多序列比对软件版本7:性能和可用性方面的改进。Mol Biol Evol 30(4):772-780。
Nguyen,LT,Schmidt,HA,von Haeseler,A. and Minh,BQ(2015)。IQ-TREE:一种用于估计最大似然系统发育的快速有效的随机算法。Mol Biol Evol 32(1):268-274。              
Salemi,M.,de Oliveira,T.,Ciccozzi,M.,Rezza,G. and Goodenow,MM(2008)。阿尔巴尼亚的高分辨率分子流行病学和HIV-1亚型的进化史。PLoS One 3(1):e1390。              
Shu,Y.和McCauley,J.(2017年)。GISAID:共享所有流感数据的全球计划-从视觉到现实。欧洲监察22(13)。              
朱N.,张D.,王W.,李X.,杨B.,宋J.,赵X.,黄B.,石W.,陆R.,牛平,詹飞,马新,王大,徐文,吴庚,高广发和谭伟(2020)。一种来自中国肺炎患者的新型冠状病毒,2019年.N Engl J Med 382(8):727-733。              
登录/注册账号可免费阅读全文
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2021 The Authors; exclusive licensee Bio-protocol LLC.
引用:Jolly, B. and Scaria, V. (2021). Computational Analysis and Phylogenetic Clustering of SARS-CoV-2 Genomes. Bio-protocol 11(8): e3999. DOI: 10.21769/BioProtoc.3999.
提问与回复
提交问题/评论即表示您同意遵守我们的服务条款。如果您发现恶意或不符合我们的条款的言论,请联系我们:eb@bio-protocol.org。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。