参见作者原研究论文

本实验方案简略版
Mar 2013

本文章节


 

HoSeIn: A Workflow for Integrating Various Homology Search Results from Metagenomic and Metatranscriptomic Sequence Datasets
HoSeIn:整合来自宏基因组和元转录组序列数据集的各种同源性搜索结果的工作流   

引用 收藏 提问与回复 分享您的反馈 Cited by

Abstract

Data generated by metagenomic and metatranscriptomic experiments is both enormous and inherently noisy. When using taxonomy-dependent alignment-based methods to classify and label reads, the first step consists in performing homology searches against sequence databases. To obtain the most information from the samples, nucleotide sequences are usually compared to various databases (nucleotide and protein) using local sequence aligners such as BLASTN and BLASTX. Nevertheless, the analysis and integration of these results can be problematic because the outputs from these searches usually show inconsistencies, which can be notorious when working with RNA-seq. Moreover, and to the best of our knowledge, existing tools do not criss-cross and integrate information from the different homology searches, but provide the results of each analysis separately. We developed the HoSeIn workflow to intersect the information from these homology searches, and then determine the taxonomic and functional profile of the sample using this integrated information. The workflow is based on the assumption that the sequences that correspond to a certain taxon are composed of:
1) sequences that were assigned to the same taxon by both homology searches;
2) sequences that were assigned to that taxon by one of the homology searches but returned no hits in the other one.

Keywords: Metagenomics (宏基因组学), Metatranscriptomics (元转录组学), Next Generation Sequencing (下一代测序), Homology Search (同源搜索), Taxonomic Profile (分类概况), Functional Profile (功能简介)

Background

The microbiome can be characterised and its potential function inferred using metagenomics, whereas metatranscriptomics provides a snapshot of the active functional (and taxonomic) profile of the microbial community by analysing the collection of expressed RNAs through high-throughput sequencing of the corresponding cDNAs (Marchesi and Ravel, 2015). Data generated by metagenomic and metatranscriptomic experiments is both enormous and inherently noisy (Wooley et al., 2010). The pipelines used to analyse this kind of data normally include three main steps: (1) pre-processing and (2) processing of the reads, and (3) downstream analyses (Aguiar-Pulido et al., 2016). Pre-processing mainly involves removing adapters, filtering by quality and length, and preparing data for subsequent analysis (Aguiar-Pulido et al., 2016). After pre-processing the reads, the next step (processing) consists in classifying each read according to the organism with the highest probability of being the origin of that read. This classification and labelling can be either taxonomy-dependent or independent. Taxonomy-dependent methods use reference databases, and these can be further classified as alignment-based, composition-based, or hybrid. Alignment-based methods usually give the highest accuracy but are limited by the reference database and the alignment parameters used, and are generally computation and memory intensive. Composition-based methods have not yet achieved the accuracy of alignment-based approaches, but require fewer computational resources because they use compact models instead of whole genomes (Aguiar-Pulido et al., 2016). Taxonomy-independent methods do not require a priori knowledge because they separate reads based on certain properties (distance, k-mers, abundance levels, and frequencies) (Aguiar-Pulido et al., 2016).

Once the reads have been classified or labelled as best as possible, downstream analyses (step 3) attempt to extract useful knowledge from the data, such as the potential (metagenomics) or active (metatranscriptomics) functional profile. There are various useful resources for the functional annotation of the genes to which the reads are mapped, such as functional databases–gene ontology (GO) (Ashburner et al., 2000; Blake et al., 2015), Kyoto Encyclopaedia of Genes and Genomes (KEGG) (Ogata et al., 1999; Kotera et al., 2015), Clusters of Orthologous Groups (COG) (Tatusov, 2000), InterPRO (Finn et al., 2017), SPARCLE (Marchler-Bauer et al., 2017), and SEED (Overbeek et al., 2014)–and other tools that can also be used to obtain functional profiles. Among the latter, some are web-based, such as MG-RAST (Glass and Meyer, 2011) and IMG/M (Markowitz et al., 2012), and others are standalone programs, like MEGAN (Huson et al., 2007). MEGAN uses the NCBI taxonomy to classify the results from the homology searches, and uses reference InterPRO (Finn et al., 2017), EggNOG (Powell et al., 2012), KEGG (Ogata et al., 1999) and SEED (Overbeek et al., 2014) databases to perform functional assignment.

The same suite of tools can be used to perform taxonomic assignments of metagenomic and metatranscriptomic data. Nevertheless, in both cases the same limitations are encountered, including algorithms that have to process large volumes of data (short reads), and the paucity of reference sequences in the databases. Additionally, most of these tools only use a subset of available genomes or focus on certain organisms, and many do not include eukaryotes. On the other hand, there are major differences in how each workflow determines the taxonomic profile, because some perform searches against protein databases, whereas others do so in a nucleotide space (a review can be found in Shakya et al., 2019). Our HoSeIn workflow (from Homology Search Integration) centres on the processing and downstream analyses steps, and we developed it for using with taxonomy-dependent alignment-based methods (Video 1). As we already mentioned, the latter use homology searches against sequence databases as the first step to classify and label reads. To obtain as much information as possible from the samples, the nucleotide datasets are compared to nucleotide and protein databases using local sequence aligners such as BLAST (Altschul et al., 1990) or FASTA (Pearson, 2004). Nevertheless, once the homology searches are complete, the analysis and integration of these results can be problematic because the outputs from these searches usually show differences and inconsistencies, which can be particularly notorious when working with RNA-seq (Video 1 and Figure 1). On one hand, amino acid- based searches can detect organisms distantly related to those in the reference database but are prone to false discovery. In contrast, nucleotide searches are more specific but are unable to identify insufficiently conserved sequences. Consequently, taxonomic and functional profiles should be carefully interpreted when they are assigned using one or the other. For example, assignments using searches against nucleotide databases, especially for protein coding genes, are likely to be less effective if no near neighbours exist in the reference databases. In this respect, and to the best of our knowledge, existing tools do not intersect information from the different homology searches to integrate the different results, but provide the results of each analysis separately. We developed the HoSeIn workflow to criss-cross the information from both homology search results (nucleotide and protein) and then perform final assignments on the basis of this integrated information. Sequences are assigned to a certain taxon if they were assigned to that taxon by both homology searches, and if they were assigned to that taxon by one of the homology searches but returned no hits in the other one (Video 1 and Figure 1). Specifically, our workflow extracts all the available information for each sequence from the different tools that were used to process the dataset (homology searches and whatever method was used to classify and label the sequences, for example MEGAN [Huson et al., 2007]), and uses it to build a local database. The data for each sequence is then intersected to define the taxonomic profile of the sample following the above-mentioned criteria. Consequently, the main novelty of our workflow is that final assignments integrate results from both homology searches, capitalising on their strengths and thus making them more robust and reliable (Video 1). For metatranscriptomics in particular, where results are difficult to interpret, this represents a very useful tool.


Video 1. Homology Search Integration (HoSeIn) workflow abstract video: This 6 min teaser gives a quick overview of the background context and modus operandi of the HoSeIn workflow.



Figure 1. Rationale behind the HoSeIn workflow. A. There are various methods for determining the taxonomic profile of a microbiome in a sequencing-based analysis, and these can be taxonomy dependent or independent (see text for details). B. When using taxonomy-dependent alignment-based methods to analyse metagenomic or metatranscriptomic datasets (1), these are usually compared to nucleotide and protein databases using local sequence aligners such as BLAST (Altschul et al., 1990) or FASTA (Pearson, 2004) (2). Nevertheless, the analysis and integration of these results can be problematic because the outputs from these searches usually show inconsistencies (3). C. The HoSeIn workflow intersects the information from both homology search results and final assignments are determined on the basis of this integrated information. In this way, sequences are assigned to a certain taxon if they were assigned to that taxon by both homology searches (1), and if they were assigned to that taxon by one of the homology searches but returned no hits in the other one (2 and 3).

Equipment

  1. Desktop computer with an Intel Core i7 2600 processor (3,40 Ghz, 8 Mb, 4 Cores, 8 Threads, video and Turboboost); Intel DH67BL Motherboard, LGA 1155 socket; with 7.1 + 2 sound; 1 Gb network; RAID 0,1,5 y 10; and four Kingston 1.333 Mhz DDR3 4 GB memories

Software

  1. Ubuntu 18.04.3 LTS (Ubuntu, https://ubuntu.com/#download), last accessed on 11/9/2019 
  2. BLAST (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/), blast-2.2.25+ last accessed 2/7/2013
    Note: FASTA programs (FASTA DNA:DNA and FASTX) can also be used for the homology searches. Nevertheless, BLAST and FASTA programs represent a major computational bottleneck when aligning high-throughput datasets against protein databases, and different tools have recently been developed to improve performance. In particular, DIAMOND is an open-source sequence aligner for protein and translated DNA searches which performs at 500x-20,000x the speed of BLAST, is suitable for running on standard desktops and laptops, and offers various output formats as well as taxonomic classification (Buchfink et al., 2015). Thus, when aligning large datasets against protein databases with limited computational resources, we recommend using Diamond.
  3. MEGAN6 (http://ab.inf.uni-tuebingen.de/software/megan6/); MEGAN_Community_windows-x64_6_17_0 version last accessed on 18/9/2019
    In this tutorial MEGAN is used to process the homology search output files and then extract the taxonomic and functional information. For downloading and installing this software:
    1. Go to the MEGAN website (http://ab.inf.uni-tuebingen.de/software/megan6) and download the MEGAN6 version that matches your Operating System, as well as the corresponding mapping files
    2. Run the installer
  4. DB Browser for SQLite (DB4S) (https://sqlitebrowser.org/); DB.Browser.for.SQLite-3.11.2-win64 version last accessed on 5/9/2019
    DB Browser for SQLite (DB4S) is a high quality, visual, open source tool used to create, design, and edit database files compatible with SQLite. It uses a familiar spreadsheet-like interface, and does not require learning complicated SQL commands. In our workflow we use DB4S to create a local database that includes all the available information for each sequence from the dataset. All this data is then used to define the taxonomic and functional profile of the sample. For downloading and installing this software:
    1. Download the DB4S version that matches your Operating System from the website (https://sqlitebrowser.org/)
    2. Run the installer

Procedure

Note: This tutorial describes the global procedure for analysing high-throughput metatranscriptomic sequences from an environmental sample, and focuses on how to define its taxonomic and functional profile in a robust and reliable way.
  It does not include a detailed description of the pre-processing of high-throughput sequences obtained from an environmental sample (for this, see Kim et al., 2013; Aguiar-Pulido et al., 2016), nor on how to use MEGAN (for this, see Huson et al. [2007 and 2011] and the MEGAN user manual).
  Below we provide a detailed tutorial to show how HoSeIn works, exemplifying with one of our samples, a sequence dataset obtained from the gut of a lepidopteran larva. The analysis of the metatranscriptomic part of this dataset was recently accepted for publication (Rozadilla et al., 2020). As this type of analysis is often dictated by the goals of the experiment (Shakya et al., 2019), a few remarks follow to explain certain distinctive features of this particular sample and its subsequent analysis. Spodoptera frugiperda (Lepidoptera: Noctuidae) is an economically important agricultural pest native to the American continent. The purpose for analysing this pest was to describe the taxonomic and functional profile of the larval gut transcriptome and associated metatranscriptome to identify new pest control targets. For this, total RNA was extracted from fifth instar larval guts, submitted to a one-step reverse transcription and PCR sequence-independent amplification procedure, and then pyrosequenced (McCarthy et al., 2015); the high-throughput reads were later assembled into contigs (Rozadilla et al., 2020). As we were interested in identifying, differentiating and characterising both the host (S. frugiperda) gut transcriptome and its associated metatranscriptome, we downloaded the following NCBI databases to perform the homology searches locally (ftp://ftp.ncbi.nlm.nih.gov/blast/db/) (download_db.mp4, a video tutorial that shows how to download different types of database files from NCBI, and download_db.sh, a bash script that automatically downloads these database files one by one, are provided as Supplementary Material 1):

1)
Nucleotide:

–“Non-redundant” nucleotide sequence (nt)

–16S rRNA gene (16S)

–Lepidopteran whole genome shotgun (Lep) projects completed at the time of the analysis.

Sequences from nt, 16S and Lep, were then combined in a single database (DB:nt16SLep) using the appropriate BLAST+ applications (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) (blast_tutorial.mp4, a video tutorial that shows how to build and combine different databases and how to run a homology search locally with BLAST, and blast_commands.txt, which contains the commands used in the tutorial, are provided as Supplementary Material 2). The Lep sequences in the combined nucleotide database simplified the identification of host sequences (which represented the majority), and the nt and 16S databases enabled the identification of the associated metatranscriptome (and of host sequences).

2)
Protein:

–non-redundant protein sequence (nr)


Below follows an outline of the main steps included in our workflow (Figure 2; see the tutorial for details):
  1. Analyse sequences with local sequence aligners: Contigs are compared locally to the combined nucleotide database (nt16SLep) using BLASTN (Altschul et al., 1990) with a 1e-50 cutoff E-value, and to the protein database (nr) using BLASTX (Altschul et al., 1990) with a 1e-17 cutoff E-value (Supplementary Material 2).
    Note: Here we use BLASTX because the dataset we are querying is small (the aforementioned assembled reads, namely 737 contigs); but for large datasets and limited computational resources we recommend using Diamond (Buchfink et al., 2015).
  2. Process the homology search results:
    Step A (the stepA.mp4 video tutorial guides you through step-by-step): The output files from both homology searches are then processed with MEGAN, a software which performs taxonomic binning and assigns sequences to taxa using the Lowest Common Ancestor (LCA)-assignment algorithm (Huson et al., 2007). Taxonomic and functional assignments performed by MEGAN for each contig are then exported using a MEGAN functionality.
    Note: MEGAN computes a “species profile” by finding the lowest node in the NCBI taxonomy that encompasses the set of hit taxa and assigns the sequence to the taxon represented by that lowest node. With this approach, every sequence is assigned to some taxon; if the sequence aligns very specifically only to a single taxon, then it is assigned to that taxon; the less specifically a sequence hits taxa, the higher up in the taxonomy it is placed (see the “MEGAN User Manual” for a detailed explanation). We chose MEGAN because this software uses the LCA-assignment algorithm and has a straightforward functionality for exporting the taxonomic and functional information for each sequence from the dataset (i.e., the “species profile” for each sequence can easily be accessed and downloaded). Nevertheless, any other tool or platform that provides this same functionality (i.e., exporting the taxonomic/functional assignment for each sequence from the dataset) can also be used.
    Step B (the stepB.mp4 video tutorial guides you through step-by-step): The output files from both homology searches are also processed with a custom bash script. This script parses the homology search output files and generates two files (one for each homology search) containing the name of each contig, its best hit (or no hit) and the corresponding E-value.
  3. Create local database: Step C (the stepC.mp4 video tutorial found in Supplementary Material Step C guides you through step-by-step): All this information (from the exported MEGAN files and from the bash script output files) is then used to create a local SQLite database which includes all the available information for each contig (from both homology searches).
  4. Analyse the local database: Step D (the stepD.mp4 video tutorial found in Supplementary Material Step D guides you through step-by-step): Final taxonomic assignments are then performed by criss-crossing and comparing all this information using different SQLite commands. Step E (the stepE.mp4 video tutorial found in Supplementary Material Step E guides you through step-by-step): Transcript assignment is achieved by executing certain SQLite commands to group transcripts that correspond to mRNA, rRNA, those that cannot be assigned (not assigned), and those that have to be revised manually (Revise). Step F (the stepF.mp4 video tutorial found in Supplementary Material Step F guides you through step-by-step): Finally, functional assignments of transcripts that were classified by the functional databases are integrated in a single column by executing certain SQLite commands. Only transcripts in the “mRNA” and “Revise” categories can putatively be classified by the functional databases, but because the information in these reference databases is still considerably limited, only around a third of these are assigned a function. Functional assignment of the rest of these transcripts can be done manually on the basis of the homology search results (which are included in the local SQLite database) (see Data analysis).


    Figure 2. HoSeIn (Homology Search Integration) workflow. This figure shows an overview of the main steps that make up the workflow (see the tutorial for details): I. Analyse sequences with local sequence aligners: Sequences are submitted to homology searches against nucleotide (nt16SLep) and protein (nr) databases. II. Process results: Results from both homology searches are processed with MEGAN and with a custom bash script. Step A: Files containing the MEGAN taxonomic and functional assignments are exported using a MEGAN functionality. Step B: The custom bash script generates files containing the name of each sequence, its best hit (or no hit) and the corresponding E-value, for both homology searches. III. Create local database: Step C: The MEGAN and bash script files are then used to create a local SQLite database which includes all available data for each sequence. IV. Analyse local database: Final taxonomic and functional assignments are then performed by criss-crossing and comparing all this information using different SQLite commands. Step D: Taxonomic assignment is defined by using certain criteria and filters: 1) The “host filter” determines which sequences correspond to the S. frugiperda gut transcriptome; 2) the “E-value = 0 filter” for nt16SLep hits, groups those unequivocal hits; 3) the “Lowest Common Ancestor criterion” for sequences with nt16SLep hits showing E-value ≠ 0, defines the identity of the rest of the contigs by criss-crossing the results from both homology searches and retaining the lowest common ancestor assignments. Step E: Transcript assignment is achieved by searching for certain keywords in the hit description. In this way, it is possible to group transcripts that correspond to mRNA, rRNA, those that cannot be assigned (not assigned), and those that have to be revised manually (revise). Step F: Finally, functional annotation of all the sequences that correspond to mRNA (or “Revise”) is then integrated in a single column. N.A., Not assigned.

We provide various files as Supplementary Material for the reader to be able to go through the tutorial and reproduce the same results we show below:
–A FASTA file containing the assembled sequences (Sf_TV_contigs.fasta) and a text file (coverage.csv) containing the assembly information for each contig (i.e., contig name, number of reads used to assemble each contig, read length and contig coverage), are provided as Supplementary Material 3;
–The output files from both homology searches in BLAST pairwise format (blastn_nt16SLep_total-contigs_Sf-TV.txt and blastx_nr_total-contigs_Sf-TV.txt) are provided as Supplementary Material 4;
–Two custom scripts written in bash that process the homology search results (search_parser.sh and analyser_blast.sh) are provided as Supplementary Material 5;
–The "RMA" files generated by MEGAN6 after processing the homology search output files (blastn_nt16sLep_total-contigs.rma6 and blastx_nr_total-contigs.rma6) are provided as Supplementary Material 6;
–text files containing different commands to intersect, assign and analyse the data in the local SQLite database: step_C_creating_taxonomy.txt found in Supplementary Material Step C, step_D_crisscrossing_taxonomy.txt found in Supplementary Material Step D, step_E_assigning transcripts.txt found in Supplementary Material Step E, step_F_functional_assignment.txt found in Supplementary Material Step F, and analysing_taxonomy.txt.


HoSeIn Tutorial (also see Figure 2):

I. Analyse sequences with local sequence aligners: As mentioned previously, homology searches were performed locally using BLASTN and BLASTX (Altschul et al., 1990) against the combined nucleotide (nt16SLep) and protein (nr) databases with 1e-50 and 1e-17 cutoff E-values, respectively. The homology search results are found in the blastn_nt16SLep_total-contigs_Sf-TV.txt and blastx_nr_total-contigs_Sf-TV.txt files (Supplementary Material 4). The video tutorial download_db.mp4 shows how to download different types of database files from NCBI, and the bash script download_db.sh automatically downloads these database files one by one (Supplementary Material 1). The video tutorial blast_tutorial.mp4 shows how to build and combine different databases, and how to run a homology search locally with BLAST; the commands used in this video can be found in blast_commands.txt (Supplementary Material 2).
II. Process the homology search results: The output files from both homology searches were processed with MEGAN and saved as blastn_nt16sLep_total-contigs.rma6 and blastx_nr_total-contigs.rma6 (Supplementary Material 6).

  1. Export the taxonomic and functional assignments performed by MEGAN for both homology searches (Figures 3-9, StepA.mp4 video tutorial and Supplementary Figure S1):
    1. Use MEGAN to open the provided RMA files (blastn_nt16sLep_total-contigs.rma6 and blastx_nr_total-contigs.rma6 found in Supplementary Material 6). To open the RMA files, select File > Open and then browse to the desired file (Figure 3). The main window is used to display the taxonomy and to control the program using the main menus. Once a dataset has been processed, the taxonomy induced by that dataset is shown. The size of the nodes indicates the number of sequences that have been assigned to the nodes (see “MEGAN User Manual” for a detailed explanation).
    2. To extract the taxonomic information from MEGAN, the taxonomic tree must be progressively expanded from Domain to species, selecting all the leaves, and extracting the text files in csv (comma-separated values) format. Choose the taxonomic level that you wish to extract (from Domain to Species): “Tree” > “Rank” (Figure 4).
    3. To select the leaves: “Select” > “All Leaves” (Figure 5).
    4. Without deselecting the leaves, export the file to csv format: “File” > “Export” > “CSV Format” (Figure 6).
    5. Choose what data you want to export and in what way it will be tabbed in the csv file (choose “summarized” so it exports the sequences contained in the chosen taxonomic level as well as all the lower levels) (Figure 7). We recommend naming the file with a representative name indicating the type of homology search and the taxonomic level, for example “nucl_domain”. 
    6. In this way a csv text file is obtained (which can be viewed in a basic word processor such as WordPad). Each of these files has two fields, one with the sequence name and another with the corresponding assigned taxonomic level (Domain, Phylum, etc.) (Figure 8).
    7. Repeat this procedure for each taxonomic level, and for the other homology search. In this way, 14 files are obtained (7 files for each homology search), each one corresponding to a particular taxonomic level (Figure 9). 
    8. The procedure for extracting the functional information is very similar (see Supplementary Figure S1 and StepA.mp4 video tutorial): 1) Choose the functional tree you want to visualise, e.g., InterPro2GO (Figure S1A.1); 2) Uncollapse all nodes: “Tree” > “Uncollapse All” (Figure S1A.2); 3) “Select” > “All Leaves” (Figure S1A.3); 4) Without deselecting the leaves, export the file to csv format: “File” > “Export” > “CSV Format” (Figure S1A.4); 5) Choose what data will be exported: “readName_to_interpro2goPath” (Figure S1A.5). Repeat the procedure for all the functional trees you want to include in the final analysis; for this tutorial we also exported the EggNOG assignments (Figure S1B).


      Figure 3. Opening the provided .rma6 files in MEGAN. A. Select “Open” from the “File” leaf (red rectangle). B. Select one of the provided .rma6 files from the location where you saved it (blastn_nt16sLep_total-contigs.rma6 was selected here; red rectangle). C. Appearance of blastn_nt16sLep_total-contigs.rma6 collapsed to “species” taxonomic rank.


      Figure 4. Selecting a taxonomic rank. A. Select “Rank” from the “Tree” leaf (red rectangle). B. Tick the desired rank from the dropdown menu (the red arrow indicates that “Domain” was selected here).


      Figure 5. Selecting all the leaves from the chosen taxonomic rank. Select “All Leaves” from the “Select” leaf (red rectangle).


      Figure 6. Exporting in csv format. Select “Export” from the “File” leaf, and choose “Text (CSV) format” from the dropdown menu (red rectangle).


      Figure 7. Selecting the way in which data will be exported. Select “readName_to_taxonName” from the “Choose data to export” dropdown menu, “summarized” from the “Choose count to use” dropdown menu, and “tab” from the “Choose separator to use” dropdown menu (red rectangle).


      Figure 8. Partial view of the exported “nucl_domain.csv” file. The first column contains the sequence name (e.g., Contig479), and the second column the taxonomic rank assigned to each sequence (in this figure “Domain”, e.g., “Bacteria”).


      Figure 9. Exported MEGAN files. Folder containing all 14 files exported from MEGAN, named according to the corresponding homology search and taxonomic level (red rectangle).

  2. Parse the output files from the homology searches (Figure 10 and Supplementary Material stepB.mp4 video tutorial):
    This step must be carried out in a Linux Operating System because the scripts that parse the homology search results were written in bash. The scripts process the FASTA file (containing the query sequences) and the homology search output files:
    1. The provided scripts (search_parser.sh and analyser_blast.sh from Supplementary Material 5), FASTA file (Sf_TV_contigs.fasta from Supplementary Material 3) and output files from the homology searches (blastn_nt16SLep_total-contigs_Sf-TV.txt and blastx_nr_total-contigs_Sf-TV.txt from Supplementary Material 4), must all be placed in the same folder (Figure 10A).
    2. Of the two bash scripts we provide, only execute search_parser.sh. This script works by executing various analyser_blast.sh scripts simultaneously to speed up the process. Open a terminal in the same folder that contains the files and execute the search_parser.sh bash script strictly in the following order (also see example below and Figures 10B-10C):
      bash search_parser.sh (name and file extension of the query fasta) (name and file extension of the homology search result) (chosen name and file extension for the output file)

      bash search_parser.sh Sf_TV_contigs.fasta blastn_nt16Slep_total-contigs.txt nucl_hits.csv


      Figure 10. Using the bash script. A. The pale blue rectangle indicates the folder containing all the necessary files for the scripts to work correctly. B. The red rectangle indicates how to execute the script to process the BLASTN output file (homology search against the nucleotide database). C. The image shows the messages that appear in the terminal after processing the output files from the BLASTN homology search (1 and 2) and from the BLASTX homology search (3 and 4).

    3. The script generates a csv file with three fields separated by “%”: the first one shows the name of each sequence, the second shows its best hit (or nothing if there is no hit), and the third its corresponding E-value (or nothing if there is no hit) (Figure 11).


      Figure 11. Partial images of the csv files generated by the script. Files showing the listed BLASTN (A) and BLASTX hits (B). In both files, fields showing the sequence name, its best hit and the corresponding E-value, are separated by “%”.

  3. Create the database in DB4S (Figures 12-18, and step_C_creating_taxonomy.txt and stepC.mp4 video tutorial found in Supplementary Material Step C):
    The file containing all the information for the assembled reads (coverage.csv found in Supplementary Material 3; includes contig name, number of reads used to assemble each contig, read length, and contig coverage), the exported MEGAN files (14 taxonomic files and 2 functional files from Step A) and the bash script output files (2 files from Step B), will now be used to create a local database with DB4S which will include all the available information for each contig (from both homology searches, BLASTN and BLASTX). To do this, first each csv file must be imported individually to DB4S:
    1. Create a new database clicking on “New Database” and choosing where to save it (Figure 12).
    2. Individually import the csv files that were exported from MEGAN (16 files), those that were created by the bash script (2 files), and the file containing the assembly information for the contigs (1 file): “File” > “Import” > “Table from CSV file” (Figure 13).
    3. Choose a name for the table and indicate field separator (for the files imported from MEGAN, Tab; for the files generated by the bash script, Other > %) (Figure 14). Only for coverage.csv, check the “Column names in first line” box, which will automatically give the columns their correct name (Figure 14C).
    4. We recommend renaming table columns with representative names as indicated in Figure 15 to be able to use the commands we provide for Steps C6, D, E and F (and also to simplify interpretation of the commands and avoid mistakes). 
    5. Figure 16 shows what the list of tables should look like after renaming them.
    6. The next step consists in integrating the information from all the imported files in a single new table, as indicated in Figures 17 and 18. The set of commands in step_C_creating_taxonomy.txt from Supplementary Material Step C, creates a new table named “taxonomy” that unifies the columns from all the imported csv files, and adds empty columns (final_domain, final_phylum, etc.) to be filled with the result of the subsequent taxonomic criss-crossing (Step D). It also adds auxiliary columns “state_taxo” (to indicate if the contig was taxonomically assigned or not and to avoid multiple assignments; Step D), “rna_type” (to indicate if the transcripts correspond to mRNA or rRNA; Step E), “state_function” (to indicate whether the transcripts were assigned or need to be revised; Step E), and “function_type” (to integrate the functional assignment of those transcripts that were classified by the functional databases; Step F).


      Figure 12. Creating a new database in DB4S. A. Click on “New Database” (red arrow). B. A window will appear requesting you to choose a filename and location to save the new database (red rectangle).


      Figure 13. Importing the csv files. A. Click on the “File” leaf (red arrow) and select “Import” and “Table from CSV file” from the dropdown menus (red rectangle). B. Select the csv file that you want to import (“nucl_domain.txt” in the figure; red rectangle).


      Figure 14. Naming the imported tables. Once you have selected a csv file, a new window will appear in which you have to name the table and indicate the field separator. A. For files imported from MEGAN the field separator is “Tab”; this table was named “nucl_domain” (red rectangle). B. For files generated by the bash script, the field separator is “Other” > “%”; this table was named “nucl_hit” (red rectangle). C. For the “coverage” file, check the “Column names in first line” box (indicated by the red arrow) and choose “Tab” as field separator; the pale blue rectangle indicates the assigned column names.


      Figure 15. Renaming table columns. A. In the main window, select the table that you want to modify with the mouse´s right button, and click on “Modify table” (red rectangle). B. A new window will open where the fields (which correspond to the column names) appear; double click on each to rename. For the tables obtained from the MEGAN files, rename “field1” as “sequence”, name the table and “field2” according to the database which was used for the homology search (nucl or prot) followed by the appropriate taxonomic rank (e.g., domain); in the figure, “nucl_domain” (red rectangles). C. The tables obtained from the files generated by the bash script have three columns: rename “field1” as “sequence”; rename “field2” as “nucl_hit” for the BLASTN hits, and as “prot_hit” for the BLASTX hits; rename “field3” as “nucl_Evalue” for the BLASTN hits, and as “prot_Evalue” for the BLASTX hits (red rectangles).
      Note: Columns from the “coverage” file were named in the previous step so they do not need to be modified.


      Figure 16. “Database Structure” window in DB4S listing all the imported and renamed tables


      Figure 17. Unifying the information from all the imported files to create a single table. 1) In the “Execute SQL” leaf, 2) paste the commands contained in step_C_creating_taxonomy.txt and 3) execute them with “Play”; 4) the lower panel indicates if the commands were executed successfully.


      Figure 18. View of the “taxonomy” table created in the previous step. A. In the “Database Structure” leaf (red arrow) a new “taxonomy” table will appear (red rectangle). B. To browse the new “taxonomy” table, select it from the dropdown menu in the “Browse Data” leaf (indicated by the red arrow and rectangle); the light blue rectangle shows a partial view of the columns that were created in “taxonomy”.

  4. Determining the taxonomic profile (Figure 19, and step_D_crisscrossing_taxonomy.txt and stepD.mp4 video tutorial found in Supplementary Material Step D):
    The next step consists in intersecting all the information contained in the “taxonomy” table to elucidate the taxonomic profile of the sample, based on the following criteria:
    1. Contigs that were assigned to Arthropoda in at least one of the homology searches are assigned to the host transcriptome.
    2. Contigs that have hits with E-value = 0 in the nt16SLep search, are directly assigned to that taxon.
    3. The rest of the contigs are assigned by comparing the MEGAN assignments from both homology searches according to the LCA logic; i.e., the level of taxonomic assignment for a contig is the one found in common for both results, or for the only result if it returns no hits in the other homology search.
    4. Contigs that were not assigned to any taxon by MEGAN in any of the homology searches are considered as “not assigned”; contigs that returned no hits in both homology searches are considered as “no hits”.
      These assignments are carried out by executing the commands in step_D_crisscrossing_taxonomy.txt in DB4S (Figure 19 and stepD.mp4 video tutorial from Supplementary Material Step D).


      Figure 19. Determining final taxonomic assignments in the “taxonomy” table. A. To perform these assignments: 1) in the “Execute SQL” leaf, 2) paste the commands contained in step_D_crisscrossing_taxonomy.txt and 3) execute them with “Play”; 4) the lower panel indicates if the commands were executed successfully. B. To view the updated “taxonomy” table, select it from the dropdown menu in the “Browse Data” leaf (indicated by the red arrow and rectangle); the light blue rectangle indicates the columns with the final taxonomic assignments after criss-crossing the taxonomic information from both homology searches (final_domain, final_phylum, etc.).

  5. Classifying the transcripts (Figure 20, and step_E_assigning transcripts.txt and stepE.mp4 video tutorial found in Supplementary Material Step E):
    As this sample was obtained from total RNA, the sequences can be further classified as either messenger or ribosomal RNA using the information contained in the hit description from the homology searches (“nucl_hit” and “prot_hit” columns), based on the following criteria:
    1. Contigs are assigned to mRNA if they show the word “mRNA” in the “nucl_hit” column.
    2. Contigs are assigned to rRNA if they show the words “rRNA/ribosomal RNA” in the “nucl_hit” column.
    3. Contigs are considered as functionally “Not assigned” if they show the words "genome/chromosome/scaffold/contig" or "uncharacterized/hypothetical/unknown/ncRNA" in the “nucl_hit” and “prot_hit” columns, respectively.
    4. All the rest of the contigs are included in a “Revise” category to be manually revised.
      These assignments are carried out by executing the commands found in step_E_assigning transcripts.txt in DB4S (Figure 20 and stepE.mp4 video tutorial from Supplementary Material Step E).


      Figure 20. Determining transcript assignments in the “taxonomy” table. A. To perform these assignments: 1) in the “Execute SQL” leaf, 2) paste the commands contained in step_E_assigning transcripts.txt and 3) execute them with “Play”; 4) the lower panel indicates if the commands were executed successfully. B. To view the updated “taxonomy” table, select it from the dropdown menu in the “Browse Data” leaf (indicated by the red arrow and rectangle); the light blue rectangle indicates the columns with the final transcript assignments: “rna_type” and “state_function”. The “rna_type” column indicates what category the transcript was assigned to (i.e., “mRNA”, “rRNA”, “Not assigned” or “Revise”). An “Assigned” status in the “state_function” column appears when the transcripts are assigned to either “mRNA”, “rRNA” or “Not assigned”; if the transcript was included in a “Revise” category, it is “NULL”.

  6. Functional assignment (Figure 21, and step_F_functional assignment.txt and stepF.mp4 video tutorial found in Supplementary Material Step F):
    This step integrates all the functional assignments of those transcripts that were classified by the functional databases in a single column (“function_type”). This is done by executing the commands found in step_F_functional assignment.txt in DB4S (Figure 21 and stepF.mp4 video tutorial from Supplementary Material Step F). As can be deduced from the previous step, only transcripts in the “mRNA” and “Revise” categories can putatively be classified by the functional databases. Nevertheless, only about a third are assigned a function because the information in these reference databases is still considerably limited (see “Data analysis”).


    Figure 21. Integrating functional assignments in the “taxonomy” table. A. 1) in the “Execute SQL” leaf, 2) paste the commands contained in step_F_functional assignment.txt and 3) execute them with “Play”; 4) the lower panel indicates if the commands were executed successfully. B. To view the updated “taxonomy” table, select it from the dropdown menu in the “Browse Data” leaf (indicated by the red arrow and rectangle); the light blue rectangle indicates the “function_type” column showing the integrated functional assignments. Transcripts that were not classified by the functional databases appear as “NULL”.

Data analysis

Browsing the “taxonomy” table (Supplementary Figures S2-S5 and analysing_taxonomy.txt):
    We are finally ready to browse and analyse the updated “taxonomy” table in the “Browse Data” leaf. One or more filters can be applied to facilitate data analysis. Further, some commands can be executed to visualise, for example, the taxonomic distribution of the sample (Supplementary Figure S2), the distribution by transcript type (Supplementary Figure S3), or a non-redundant list of the hits obtained in the homology searches (Supplementary Figure S4).
    To analyse the functional profile in more detail, all the “mRNA” and “Revise” transcripts that were classified by the functional databases, and those that were not, can be listed (Supplementary Figure S5). As we mentioned before, because the reference databases are not comprehensive, we found that only around 30% of all the transcripts that could putatively be classified by the functional databases (“mRNA” and “Revise” categories), were actually classified (49 contigs; Supplementary Figure S5A). To determine the functional profile of the remaining 70% (122 contigs; Supplementary Figure S5B), functional assignment of these transcripts can be determined individually on the basis of the homology search results and then entered manually in the database. To determine an order of priorities and help reduce this considerable workload, contigs can be viewed according to coverage (Supplementary Figure S5B).
    All these commands are included in analysing_taxonomy.txt and can be adapted to cover other interests.

Notes

  1. Due to a question of file size we exemplified the use of our workflow with the assembled reads (737 contigs), but we have also used HoSeIn to analyse our reads (~300,000) and it works seamlessly.
  2. This workflow was originally developed to analyse high-throughput metatranscriptomic sequences, but we have also used it to analyse high-throughput metagenomic sequences. Moreover, we validated our workflow by analysing a mock metagenome (BMock12) (Sevim et al., 2019) and comparing the results we obtained with those reported for the synthetic metagenome (Sevim et al., 2019). This validation was included in the study in which we presented the analysis of the dataset used for this tutorial, which was recently accepted for publication (Rozadilla et al., 2020), and is included here as a Supplementary Analysis (Supplementary Analysis_mock metagenome.docx). In summary, we contrasted our results with those reported by Sevim et al. (2019) (Table S1) and found that our workflow not only identified all the members of the mock metagenome, but also that the number of contigs that we identified per community member was greater (or the same, but never lower) than what the authors reported (Table S1). In conclusion, our workflow enabled us to identify all the community members of the mock metagenome with greater sensitivity than what was previously reported.
  3. Even though our workflow has quite a few manual steps, these are comparable to the number of steps used by taxonomy-dependent alignment-based methods to classify and label reads from metatranscriptomic/metagenomic datasets. There are bioinformatic workflows for metatranscriptomic datasets which aim to streamline some of this complexity by connecting multiple individual tools into a workflow that can take raw sequencing reads, process them and provide data files with taxonomic identities, functional genes, and/or differentially expressed transcripts (Shakya et al., 2019). Nevertheless, to define the taxonomic and functional assignments, these platforms perform their sequence-based searches against either protein or nucleotide databases, not both (Shakya et al., 2019). As has already been mentioned, searches against protein databases enable the detection of distantly related organisms but are liable to false discovery, whereas searches against nucleotide databases are more specific but are unable to identify insufficiently conserved sequences. For this reason, analyses of metatranscriptomes using these streamlined workflows must be carefully interpreted. Another major drawback is that several of these workflows assign taxonomy by searching against databases that are designed for functional characterisation (Shakya et al., 2019).
  4. Summary of the unique innovations in the HoSeIn workflow:
    1. All the available information for each sequence is assembled and integrated in a local database, from both homology searches and from whatever method was used to classify and label the sequences, and it can be easily viewed and analysed.
    2. The taxonomic profile of the sample is defined by comparing the taxonomic assignments from both homology searches for each sequence following the LCA logic; i.e., the taxonomic assignment level of a sequence is the one found in common for both homology search results, or for the only result if it returns no hits in the other homology search.
    3. Consequently, the novelty of our workflow is that final assignments integrate results from both homology searches, capitalising on their strengths, and thus making them more robust and reliable. For metatranscriptomics in particular, where results are difficult to interpret, this represents a very useful tool.
    4. The functional profile is defined by first assigning transcripts and then integrating all the functional information in a single column (in the local database). What we have observed is that functional databases currently are only able to classify ~30% of all the transcripts that can putatively be functionally classified. To the best of our knowledge, the functional information for the remaining two thirds of those transcripts remains unresolved in other existing tools. In contrast, with our workflow the functional assignment of these transcripts can be determined based on the homology search results (which are included in the local database), thus providing a much more complete and detailed functional profile.

Acknowledgments

This research was supported by Agencia Nacional de Promoción Científica y Tecnológica (PICT PRH 112 and PICT CABBIO 3632), and Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) (PIP 0294) grants to CBM. CBM is a member of the CONICET research career. GR and JMC are the recipients of CONICET fellowships. This paper was derived from McCarthy et al. (2013) and Rozadilla et al. (2020).

Competing interests

The authors declare no competing interests.

References

  1. Aguiar-Pulido, V., Huang, W., Suarez-Ulloa, V., Cickovski, T., Mathee, K. and Narasimhan, G. (2016). Metagenomics, metatranscriptomics, and metabolomics approaches for microbiome analysis. Evol Bioinform Online 12(Suppl 1): 5-16.
  2. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol 215(3): 403-410.
  3. Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. and Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1): 25-29.
  4. Blake, J. A., Christie, K. R., Dolan, M. E., Drabkin, H. J., Hill, D. P., Ni, L., Sitnikov, D., et al. (2015). Gene Ontology Consortium: going forward. Nucleic Acids Res 43(Database issue): D1049-1056.
  5. Buchfink, B., Xie, C. and Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods 12(1): 59-60.
  6. Finn, R. D., Attwood, T. K., Babbitt, P. C., Bateman, A., Bork, P., Bridge, A. J., Chang, H. Y., Dosztanyi, Z., El-Gebali, S., Fraser, M., Gough, J., Haft, D., Holliday, G. L., Huang, H., Huang, X., Letunic, I., Lopez, R., Lu, S., Marchler-Bauer, A., Mi, H., Mistry, J., Natale, D. A., Necci, M., Nuka, G., Orengo, C. A., Park, Y., Pesseat, S., Piovesan, D., Potter, S. C., Rawlings, N. D., Redaschi, N., Richardson, L., Rivoire, C., Sangrador-Vegas, A., Sigrist, C., Sillitoe, I., Smithers, B., Squizzato, S., Sutton, G., Thanki, N., Thomas, P. D., Tosatto, S. C., Wu, C. H., Xenarios, I., Yeh, L. S., Young, S. Y. and Mitchell, A. L. (2017). InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res 45(D1): D190-D199.
  7. Glass, E.M. and Meyer, F. (2011). The metagenomics RAST server: A public resource for the automatic phylogenetic and functional analysis of metagenomes. In: Handbook of Molecular Microbial Ecology I: Metagenomics and Complementary Approaches, BioMed Central 9(1): 325-331.
  8. Huson, D. H., Auch, A. F., Qi, J. and Schuster, S. C. (2007). MEGAN analysis of metagenomic data. Genome Res 17(3): 377-386.
  9. Huson, D. H., Mitra, S., Ruscheweyh, H. J., Weber, N. and Schuster, S. C. (2011). Integrative analysis of environmental sequences using MEGAN4. Genome Res 21(9): 1552-1560.
  10. Kim, M., Lee, K. H., Yoon, S. W., Kim, B. S., Chun, J. and Yi, H. (2013). Analytical tools and databases for metagenomics in the next-generation sequencing era. Genomics Inform 11(3): 102-113.
  11. Kotera, M., Moriya, Y., Tokimatsu, T., Kanehisa, M. and Goto, S. (2015). KEGG and GenomeNet, New Developments, Metagenomic Analysis. In: Encyclopedia of Metagenomics: Genes, Genomes and Metagenomes: Basics, Methods, Databases and Tools. Nelson. K. E. (Ed.). Boston, MA, Springer US: 329-339.
  12. Marchesi, J. R. and Ravel, J. (2015). The vocabulary of microbiome research: a proposal. Microbiome 3: 31.
  13. Marchler-Bauer, A., Bo, Y., Han, L., He, J., Lanczycki, C. J., Lu, S., Chitsaz, F., Derbyshire, M. K., Geer, R. C., Gonzales, N. R., Gwadz, M., Hurwitz, D. I., Lu, F., Marchler, G. H., Song, J. S., Thanki, N., Wang, Z., Yamashita, R. A., Zhang, D., Zheng, C., Geer, L. Y. and Bryant, S. H. (2017). CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res 45(D1): D200-D203.
  14. Markowitz, V. M., Chen, I. M., Palaniappan, K., Chu, K., Szeto, E., Grechkin, Y., Ratner, A., Jacob, B., Huang, J., Williams, P., Huntemann, M., Anderson, I., Mavromatis, K., Ivanova, N. N. and Kyrpides, N. C. (2012). IMG: the Integrated Microbial Genomes database and comparative analysis system. Nucleic Acids Res 40(Database issue): D115-122.
  15. McCarthy, C. B., Santini, M. S., Pimenta, P. F. and Diambra, L. A. (2013). First comparative transcriptomic analysis of wild adult male and female Lutzomyia longipalpis, vector of visceral leishmaniasis. PLoS One 8(3): e58645.
  16. McCarthy, C. B., Cabrera, N. A. and Virla, E. G. (2015). Metatranscriptomic Analysis of Larval Guts from Field-Collected and Laboratory-Reared Spodoptera frugiperda from the South American Subtropical Region. Genome Announc 3(4): e00777-15.
  17. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H. and Kanehisa, M. (1999). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 27(1): 29-34.
  18. Overbeek, R., Olson, R., Pusch, G. D., Olsen, G. J., Davis, J. J., Disz, T., Edwards, R. A., Gerdes, S., Parrello, B., Shukla, M., Vonstein, V., Wattam, A. R., Xia, F. and Stevens, R. (2014). The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res 42(Database issue): D206-214.
  19. Pearson, W. (2004). Finding protein and nucleotide similarities with FASTA. Curr Protoc Bioinformatics Chapter 3: Unit3 9.
  20. Powell, S., Szklarczyk, D., Trachana, K., Roth, A., Kuhn, M., Muller, J., Arnold, R., Rattei, T., Letunic, I., Doerks, T., Jensen, L. J., von Mering, C. and Bork, P. (2012). eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res 40(Database issue): D284-289.
  21. Rozadilla, G., Cabrera, N. A., Virla, E. G., Greco, N. M. and McCarthy, C. B. (2020). Gut microbiota of Spodoptera frugiperda (J.E. Smith) larvae as revealed by metatranscriptomic analysis. Journal of Applied Entomology n/a(n/a). doi.org/10.1111/jen.12742.
  22. Sevim, V., Lee, J., Egan, R., Clum, A., Hundley, H., Lee, J., Everroad, R. C., Detweiler, A. M., Bebout, B. M., Pett-Ridge, J., Goker, M., Murray, A. E., Lindemann, S. R., Klenk, H. P., O'Malley, R., Zane, M., Cheng, J. F., Copeland, A., Daum, C., Singer, E. and Woyke, T. (2019). Shotgun metagenome data of a defined mock community using Oxford Nanopore, PacBio and Illumina technologies. Sci Data 6(1): 285.
  23. Shakya, M., Lo, C. C. and Chain, P. S. G. (2019). Advances and challenges in metatranscriptomic analysis. Front Genet 10: 904.
  24. Tatusov, R. L., Galperin, M. Y., Natale, D. A. and Koonin, E. V. (2000). The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 28(1): 33-36.
  25. Wooley, J. C., Godzik, A. and Friedberg, I. (2010). A primer on metagenomics. PLoS Comput Biol 6(2): e1000667.

简介

[摘要 ] 由宏基因组和元转录组学实验产生的数据既巨大又固有地嘈杂。当使用基于分类的依赖于比对的方法对读段进行分类和标记时,第一步包括对序列数据库进行同源搜索。为了从样品中获得最多的信息,通常使用本地序列比对器(例如BLASTN和BLASTX)将核苷酸序列与各种数据库(核苷酸和蛋白质)进行比较。尽管如此,这些结果的分析和整合还是有问题的,因为这些搜索的输出通常显示不一致的cies ,这在使用RNA-seq时可能是臭名昭著的。而且,就我们所知,现有工具不会交叉和整合来自不同同源性搜索的信息,而是分别提供每种分析的结果。我们开发了HoSeIn 工作流程来与来自这些同源性搜索的信息相交,然后使用此集成信息确定样品的分类和功能概况。工作流基于以下假设:与某个分类单元相对应的序列由以下组成:

1)通过两个同源性搜索分配给同一分类单元的序列;

2)由一种同源性搜索分配给该分类单元的序列,但在另一项同源性搜索中未返回任何匹配项。

[背景 ] 可以使用宏基因组学来表征微生物组并推断其潜在功能,而元转录组学则可以通过对相应cD NAs 进行高通量测序来分析表达的RNA的集合,从而提供微生物群落的活跃功能(和分类学)概况的快照。(Marchesi和Ravel,2015年)。通过宏基因组和产生的数据metatranscriptomic 实验既庞大和固有的噪声(伍利等人,2010) 。用于分析此类数据的管道通常包括三个主要步骤:(1)预处理和(2)读取的处理,以及(3)下游分析(Aguiar-Pulido 等,2016)。预处理主要涉及移除衔接子,按质量和长度进行过滤以及为后续分析准备数据(Aguiar-Pulido et al。,2016)。在对读物进行预处理之后,下一步(处理)是根据具有最高机率的生物体对每个读物进行分类。此分类和标签可以是分类学相关的,也可以是独立的。依赖分类法的方法使用参考数据库,并且这些参考数据库可以进一步分类为基于比对,基于组成或混合。基于对齐的方法通常可提供最高的准确性,但受到参考数据库和所使用的对齐参数的限制,并且通常需要大量的计算和内存。基于组合的方法尚未达到基于比对的方法的准确性,但需要较少的计算资源,因为它们使用紧凑模型而不是整个基因组(Aguiar-Pulido 等人,2016)。独立于分类法的方法不需要先验知识,因为它们基于某些属性(距离,k- mers ,丰度水平和频率)将读数分开(Aguiar-Pulido 等,2016)。

一旦对阅读片段进行了尽可能最好的分类或标记,下游分析(步骤3)将尝试从数据中提取有用的知识,例如潜在的(基因组学)或活跃的(元转录组学)功能谱。有许多有用的资源可用于将读物定位到的基因进行功能注释,例如功能数据库- 基因本体论(GO)(Ashburner 等人,2000; Blake 等人,2015),《京都议定书全书》和基因组(KEGG)(Ogata 等,1999; Kotera 等,2015),直系同源簇(COG)(Tatusov,2000),InterPRO (Finn 等,2017),SPARCLE (Marchler-Bauer 等。 。,2017) ,和SEED (Overbeek 等人。,2014)- 并且也可以被用于获得功能简其他工具。在后者中,有些是基于Web的,例如MG-RAST (Glass和Meyer,2011)和IMG / M (Markowitz 等,2012),而另一些是独立的程序,例如MEGAN (Huson 等,2007)。 )。MEGAN使用NCBI分类法对同源搜索的结果进行分类,并使用参考InterPRO (Finn 等人,2017),EggNOG (Powell 等人,2012),KEGG (Ogata 等人,1999)和SEED (Overbeek)等人(2014)。

同一套工具可用于执行宏基因组学和元转录组学数据的分类学分配。但是,在两种情况下都遇到相同的限制,包括必须处理大量数据(短读取)的算法,以及数据库中参考序列的不足。此外,这些工具中的大多数仅使用可用基因组的一个子集或专注于某些生物,而许多工具不包括真核生物。另一方面,每个工作流程如何确定生物分类谱存在重大差异,因为有些对蛋白质数据库进行搜索,而另一些则在核苷酸空间进行搜索(可参见Shakya 等人的综述,2019年)。我们HoSeIn 工作流程(从浩mology 硒拱在tegration)的中心处理和下游的分析步骤,因此,我们开发它与分类相关的基于比对的方法,使用(视频1) 。正如我们已经提到的,后者使用对序列数据库的同源性搜索作为分类和标记读取的第一步。为了从样品中获得尽可能多的信息,使用诸如BLAST (Altschul 等人,1990)或FASTA (Pearson,200 4 )的局部序列比对器将核苷酸数据集与核苷酸和蛋白质数据库进行比较。Ñ evertheless,一旦同源性搜索是完整的,这些结果的分析和整合可能是有问题,因为从这些搜索的输出通常显示的差异和不一致的地方,用RNA-工作时可以是特别声名狼藉SEQ (视频1和图1) 。一方面,基于氨基酸的搜索可以检测与参考数据库中的生物有远距离联系的生物,但容易出现错误的发现。相反,核苷酸搜索更具特异性,但无法识别保守性不足的序列。因此,在使用一个或另一个分配分类和功能配置文件时应仔细解释。例如,如果在参考数据库中不存在近邻,则使用针对核苷酸数据库的搜索进行分配(尤其是针对蛋白质编码基因)的效率可能较低。在这方面,以及 据我们所知,现有工具不会与来自不同同源性搜索的信息相交以整合不同结果,而是分别提供每个分析的结果。我们开发了HoSeIn 工作流程,以交叉搜索来自同源性搜索结果(核苷酸和蛋白质)的信息,然后根据此综合信息进行最终分配。小号equences被分配到某个类群,如果他们是由两个同源性检索分配给该分类群,并且如果他们被同源性搜索的一个分配给类群,但回到在另一个没有命中(视频1 和图1 )。 具体来说,我们的工作流程从用于处理数据集的不同工具(同源性搜索以及用于分类和标记序列的任何方法,例如MEGAN [ Huson 等,2007 ])中提取所有可用信息f 或每个序列。),并使用它来构建本地数据库。然后将每个序列的数据相交,以按照上述标准定义样品的分类学特征。因此,我们工作流程的主要新颖之处在于,最终分配将两种同源性搜索的结果整合在一起,从而利用它们的优势,从而使它们更加健壮和可靠(视频1)。特别是对于转录组学而言,结果难以解释,这是一个非常有用的工具。





视频1. 何mology 硒拱在tegration (HoSeIn )工作流程一个bstract视频:这6分钟预告片给予小号的快速概述的的背景情况和作案手法的的HoSeIn 工作流程。



图1. HoSeIn 工作流程背后的原理。答:在基于序列的分析中,有多种方法可用于确定微生物组的生物分类谱,这些方法可以是生物分类学相关的或独立的(有关详细信息,请参见文本)。B. 当使用基于分类学的基于方法的方法来分析宏基因组或元转录组数据集时(1),通常使用局部序列比对器,例如BLAST (Altschul 等人,1990)或FASTA (Pearson ),将它们与核苷酸和蛋白质数据库进行比较,200 4 )(2)。但是,由于这些搜索的输出通常显示出不一致,因此对这些结果的分析和整合可能会出现问题(3)。C. HoSeIn 工作流程将来自同源性搜索结果的信息相交,并根据此综合信息确定最终分配。以这种方式,序列分配给特定的分类单元,如果他们是由两个同源性检索(1)分配给该分类单元,并且如果它们是由同源性搜索的一个分配给该分类单元但返回在另一个没有命中(2和3)。

关键字:宏基因组学, 元转录组学, 下一代测序, 同源搜索, 分类概况, 功能简介

设备


 


具有Intel Core i7 2600处理器(3.40 GHz ,8 Mb,4核,8线程,视频和Turboboost )的台式计算机;英特尔DH67BL主板,LGA 1155插槽; 具有7.1 + 2声音;1 Gb网络;RAID 0.1.5年10; 和四个Kingston 1.333 Mhz DDR3 4 GB内存
 


软件


 


Ubuntu 18.0 4.3 LTS(Ubuntu,https: //ubuntu.com/#download ),最后一次访问于11/9/2019
BLAST(ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/),blast-2.2.25 +上次访问时间为2013年2月7日
Ñ OTE :FASTA程序(FASTA DNA:DNA 和FASTX)也可用于同源性搜索。尽管如此,当将高通量数据集与蛋白质数据库进行比对时,BLAST和FASTA程序仍然是主要的计算瓶颈,并且最近开发了各种工具来提高性能。特别是DIAMOND是一种用于蛋白质和翻译后的DNA搜索的开源序列比对器,其执行速度为BLAST的500x-20,000x,适用于在标准台式机和笔记本电脑上运行,并提供各种输出格式以及分类学分类( Buchfink 等,201 5 )。因此,当使用有限的计算资源将大型数据集与蛋白质数据库对齐时,我们建议使用Diamond。


MEGAN6(http://ab.inf.uni-tuebingen.de/software/megan6/);MEGAN_Community_windows-x64_6_17_0版本最后访问于18/9/2019
在本教程中,MEGAN用于处理同源性搜索输出文件,然后提取分类和功能信息。要下载和安装此软件:


转到MEGAN网站(http://ab.inf.uni-tuebingen.de/software/megan6),然后下载与您的操作系统匹配的MEGAN6版本以及相应的映射文件。
运行安装程序
SQLite数据库浏览器(DB4S)(https://sqlitebrowser.org/); DB.Browser.for.SQLite-3.11.2-win64版本最后访问日期为5/9/2019
用于SQLite的数据库浏览器(DB4S)是一种高质量,可视化的开源工具,用于创建,设计和编辑与SQLite兼容的数据库文件。它使用类似电子表格的界面,不需要学习复杂的SQL命令。在我们的工作流程中,我们使用DB4S创建一个本地数据库,其中包含数据集中每个序列的所有可用信息。然后将所有这些数据用于定义样品的分类和功能概况。要下载和安装此软件:


从网站上下载与您的操作系统匹配的DB4S 版本( https://sqlitebrowser.org/)
运行安装程序
 


程序


 


注意:本教程介绍了用于分析环境样品中高通量元转录组序列的全局过程,并着重于如何以可靠可靠的方式定义其分类学和功能概况。


  它不包括对从环境样品中获得的高通量序列的预处理的详细说明(为此,请参见Kim 等,2013;Aguiar-Pulido 等,2016 ),也不包括如何使用MEGAN。 (为此,请参阅Huson 等人[ 2007 和2011 ] 和MEGAN用户手册)。


下面,我们提供详细的教程来展示HoSeIn的工作原理,以我们的一个样本为例,该序列是从鳞翅目幼虫的肠道中获得的序列数据集。该数据集的转录组学部分的分析最近被接受发表(Rozadilla 等,2020)。由于这类分析通常是由实验目标决定的(Shakya et al。,2019),因此有几句话解释了该特定样品及其后续分析的某些独特特征。夜蛾(Spodoptera frugiperda)(鳞翅目:夜蛾科)是一种经济上重要的农业害虫,原产于美洲大陆。分析该害虫的目的是描述幼虫肠道转录组和相关的转录组的分类和功能概况,以鉴定新的害虫防治目标。为此,从第五龄幼虫胆中提取总RNA,进行一步一步逆转录和PCR序列独立扩增程序,然后进行焦磷酸测序(McCarthy 等人,2015);高通量的读段随后被组装成重叠群(Rozadilla 等,2020)。由于我们对鉴定,区分和鉴定宿主(S. frugiperda )肠道转录组及其相关的元转录组感兴趣,因此我们下载了以下NCBI数据库,以便在本地进行同源性搜索(ftp://ftp.ncbi.nlm.nih .GOV /爆炸/ DB /)(download_db.mp4 ,一个视频教程是介绍如何从NCBI下载不同类型的数据库文件,并download_db.sh ,一个bash脚本即自动下载这些数据库文件一个接一个,提供一个s 补充材料1):


1)核浪潮:      


- “ 非冗余”核苷酸序列(nt )


- 16S rRNA基因(16S)


- 鳞翅目全基因组shot弹枪(Lep )项目在分析时完成。


然后使用适当的BLAST +应用程序(ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)将nt ,16S和Nice的序列合并到一个数据库中(DB:nt16SLep )(blast_tutorial.mp4 ,一个视频教程其显示了如何构建和组合不同的数据库和如何使用BLAST本地运行的同源性检索,并blast_commands.txt ,其包含在第使用的命令ë 教程,如提供补充材料2)。组合核苷酸数据库中的Lep 序列简化了宿主序列(代表大多数)的鉴定,而nt 和16S数据库使相关的转录组(以及宿主序列)的鉴定成为可能。


2)蛋白质:      


- 非冗余蛋白序列(nr)


 


下面概述了工作流中包含的主要步骤(图2;有关详细信息,请参见教程):


分析用序列升OCAL 小号层序一ligners :重叠群被局部比组合(使用核苷酸数据库(nt16SLep)BLASTN Altschul 等人,1990)与1E-50 截止E值,并且对蛋白质数据库(NR)使用BLASTX(Altschul 等人,1990)使用1e-17 截止E值( 补充材料2)。
注:^ h 埃雷我们使用BLASTX,因为该数据集我们正在查询是小(上述组装读取,即737重叠群); 但是对于大型数据集和有限的计算资源,我们建议使用Diamond (Buchfink 等,201 5 )。


处理同源性搜索结果:
步骤A (在小号tepA.mp4视频tutoria 升指导您一步工序):输出文件从两个同源性搜索,然后用MEGAN,其执行分类学分级一个软件处理使用最低共同祖先和受让人序列类群(LCA)-分配算法(Huson et al。,2007)。然后,使用MEGAN功能导出由MEGAN为每个重叠群进行的分类和功能分配。


注意:MEGAN通过在NCBI分类法中找到包含命中分类单元集的最低节点,并为该最低节点所代表的分类单元分配序列,来计算“物种概况” 。使用这种方法,每个序列都分配给某个分类单元。如果序列非常明确地仅与单个分类单元对齐,则将其分配给该分类单元;序列打到分类单元的次数越少,其在分类法中的位置就越高(有关详细说明,请参见“ MEGAN用户手册”)。我们之所以选择MEGAN,因为该软件使用了LCA-分配算法,并有一个简单的功能为出口荷兰国际集团日Ë 分类和功能从数据集中的每个序列信息(即“品种简介”每个序列可以很容易地访问和下载) 。但是,也可以使用提供此功能(即从数据集中导出每个序列的分类/功能分配)的任何其他工具或平台。


步骤B (s tepB.mp4视频教程指导您逐步进行):两种同源性搜索的输出文件也将使用自定义bash脚本进行处理。该脚本解析同源性搜索输出文件,并生成两个文件(每个同源性搜索一个文件),其中包含每个重叠群的名称,最佳匹配(或无效匹配)和相应的E值。


创建本地数据库:步骤C (在补充材料步骤C中找到stepC.mp4 视频教程,逐步指导您):然后,将所有这些信息(来自导出的MEGAN文件和bash脚本输出文件)用于:创建一个本地SQLite数据库,其中包含每个重叠群的所有可用信息(来自两个同源性搜索)。
分析本地数据库:步骤d (该stepD.mp4 视频教程中找到的补充材料步骤d通过一步一步的引导你):最后的分类作业,然后由纵横交错的和比较所有使用不同的SQLite的命令该信息执行。步骤E (在stepE.mp4 视频教程中找到的补充材料。步骤E通过引导您一步步骤):文稿分配是通过执行一定的SQLite命令组的转录物来实现对应于mRNA,rRNA基因,那些不能被分配(未分配),以及那些必须手动修改(修订)的内容。步骤F (所述stepF.mp4 发现视频教程在补充材料步骤F引导你通过一步一步):最后,被转录的功能分配由功能划分的数据库被集成在单个列中通过执行一定的SQLite命令。功能数据库只能推定“ mRNA”和“ Revise”类别中的转录本,但是由于这些参考数据库中的信息仍然受到很大限制,因此只有大约三分之一的功能被分配了功能。这些转录其余的功能分配可以手动完成的基础上的同源性搜索结果(其中包括在本地SQLite数据库)(见数据一nalysis )。
 


 


图2. HoSeIn (何mology 硒拱在tegration)的工作流程。此图显示了构成工作流程的主要步骤的概述(有关详细信息,请参见教程):I.使用局部序列比对器分析序列:将序列提交到针对核苷酸(nt16SLep)和蛋白质(nr)数据库的同源性搜索。二。处理结果:两种同源性搜索的结果均使用MEGAN和自定义的bash脚本进行处理。步骤A :使用MEGAN功能导出包含MEGAN分类和功能分配的文件。步骤B :自定义bash脚本会为两个同源性搜索生成包含每个序列名称,其最佳匹配(或无匹配)和相应E值的文件。三,创建本地数据库:步骤C :然后,使用MEGAN和bash脚本文件创建本地SQLite数据库,该数据库包含每个序列的所有可用数据。IV。分析本地数据库:通过交叉交叉并使用不同的SQLite命令比较所有这些信息来执行最终的分类和功能分配。步骤D :使用某些标准和过滤器定义分类分配:1)“宿主过滤器”确定哪些序列对应于S. frugiperda 肠道转录组;2)nt16SLep匹配的“ E值= 0过滤器”,将那些明确的匹配分组;3)“最低共同祖先准则”用于与nt16SLep序列命中表示E-值 0时,由纵横交错从两个同源性搜索的结果,并保持最低的共同祖先的分配限定重叠群的其余部分的身份。步骤E :通过在匹配描述中搜索某些关键字来实现字幕分配。以这种方式,有可能组转录物对应于mRNA,rRNA基因,那些不能被分配(未分配),以及那些必须被手动修订(修订)。步骤F :最后,将对应于mRNA的所有序列的功能注释(或“修订”)整合到单个列中。不适用,未分配。


 


我们提供各种文件作为补充材料,供读者阅读本教程并重现下面显示的相同结果:


-一个FASTA文件,其中包含已组装的序列(Sf_TV_contigs.fasta )和一个文本文件(coverage.csv ),其中包含每个contig的组装信息(即contig名称,用于组装每个contig 的读取次数,读取长度和contig覆盖范围),作为补充材料3提供 ;


- 从BLAST成对格式(二者同源性搜索的输出文件blastn_nt16SLep_total-contigs_Sf-TV.txt 和blastx_nr_total-contigs_Sf-TV.txt )被提供作为增刊乐mentary材料4 ;


- 处理该同源性搜索结果写在bash两个自定义脚本(search_parser.sh 和analyser_blast.sh )是作为增刊乐mentary材料5 ;


- 通过MEGAN6生成处理所述同源性检索输出文件(后“RMA”文件blastn_nt16sLep_total-contigs.rma6 和blastx_nr_total-contigs.rma6 )被提供作为增刊乐mentary材料6 ;


- 含有相交,分配不同的命令和文本文件分析在本地SQLite数据库中的数据:step_C_creating_taxonomy.txt 发现在补充材料步骤C,step_D_crisscrossing_taxonomy.txt 发现在补充材料步骤d,step_E_assigning transcripts.txt 发现在补充材料步骤E,step_F_functional_assignment.txt 发现在补充材料步骤F,和analysing_taxonomy.txt。


 






HoSeIn 教程(另请参见图2):


 


I. 分析用序列升OCAL 小号层序一ligners :如前所述,同源性搜索,使用BLASTN和BLAS本地执行TX(Altschul 等人。,199 0)对合并的核苷酸(nt16SLep)和蛋白质(NR)与数据库1E 截止E值分别为-50和1e-17 。同源性搜索结果可在blastn_nt16SLep_total-contigs_Sf-TV.txt 和blastx_nr_total-contigs_Sf-TV.txt 文件中找到(补充材料4)。视频教程download_db.mp4 显示了如何从NCBI下载不同类型的数据库文件,并且bash脚本download_db.sh 自动地一个接一个地下载这些数据库文件(补充资料1)。视频教程blast_tutorial.mp4 展示了如何构建和组合不同的数据库,以及如何使用BLAST在本地运行同源性搜索;该视频中使用的命令可以在blast_commands.txt (补充材料2)中找到。             


二。处理同源性搜索结果:两次同源性搜索的输出文件都用MEGAN处理,并保存为blastn_nt16sLep_total-contigs.rma6 和blastx_nr_total-contigs.rma6 (补充材料6)。


导出的分类学和功能分配两者的同源性进行由MEGAN 搜索(图小号3 -9 ,StepA.mp4视频tutoria 升和补充˚F igure S1):
使用MEGAN打开提供的RMA文件(blastn_nt16sLep_total-contigs.rma6 和blastx_nr_total-contigs.rma6 在补充材料6中找到)。要打开RMA文件,请选择“文件”>“打开”,然后浏览到所需的文件(图3 )。主窗口用于显示分类法并使用主菜单控制程序。处理完数据集后,将显示该数据集引起的分类。节点的大小指示已分配给节点的序列数(有关详细说明,请参见“ MEGAN用户手册”)。
要从MEGAN中提取分类信息,必须将分类树从“域”逐步扩展到物种,选择所有叶子,然后以csv(逗号分隔值)格式提取文本文件。选择您要提取的分类级别(从域到物种):“树”>“等级”(图4 )。
选择叶子:“选择”>“所有叶子”(图5 )。
在不取消选择叶子的情况下,将文件导出为csv格式:“文件”>“导出”>“ CSV格式”(图6 )。
选择要导出的数据,并以哪种方式在csv文件中将其制表(选择“汇总”,以便导出所选分类标准以及所有较低分类中包含的序列)(图7 )。我们建议的命名有代表性的名,指示同源性搜索和分类级别类型的文件,例如“ nucl_d omain ”。
通过这种方式,可以获取一个csv文本文件(可以在基本的文字处理器(如写字板)中查看)。这些文件中的每一个都有两个字段,一个具有序列名称,另一个具有相应的指定分类级别(域,Phylum 等)(图8 )。
对每个分类水平和其他同源性搜索重复此过程。以此方式,获得了14个文件(每个同源性搜索为7个文件),每个文件对应于特定的分类标准(图9 )。
提取功能信息的过程非常相似(请参见补充图S1和StepA.mp4视频教程):1)选择要可视化的功能树,例如InterPro2GO(图S1A.1);2)取消折叠所有节点:“树”>“ 全部折叠”(图S1A.2);3)“选择”>“所有叶子”(图S1A.3);4)在不取消选择叶子的情况下,将文件导出为csv格式:“文件”>“导出”>“ CSV格式”(图S1A.4);5)选择要导出的数据:“ readName_to_interpro2goPath”(图S1A.5)。对要包含在最终分析中的所有功能树重复该过程;在本教程中,我们还导出了EggNOG 分配(图S1B)。                            
 


 


图3. 在MEGAN中打开提供的.rma 6文件。A.从“文件”叶子(红色矩形)中选择“打开”。B. 从保存文件的位置中选择提供的.rma 6文件之一(在此处选择blastn_nt16sLep_total-contigs.rma6 ;红色矩形)。C. blastn_nt16sLep_total- contigs.rma 6的外观已降至“物种”分类学等级。


 


 


图4. 选择一个分类等级。A.从“树”叶子(红色矩形)中选择“等级”。B.从下拉菜单中选择所需的等级(红色箭头表示此处已选择“域” )。


 


 


图5. 从选定的生物分类等级中选择所有叶子。从“选择”叶子(红色矩形)中选择“所有叶子”。


 


 


图6. 以csv格式导出。从“文件” 叶子中选择“导出” ,然后从下拉菜单(红色矩形)中选择“文本(CSV)格式”。


 


 


图7. 选择导出数据的方式。从“选择要导出的数据”下拉菜单中选择“ readName_to_taxonName ”,从“选择要使用的计数”下拉菜单中选择“摘要”,然后从“选择要使用的分隔符”下拉菜单中选择“选项卡”(红色矩形)。


 


 


图8. 导出的“ nucl_domain.csv”文件的局部视图。第一列包含序列名称(例如Contig479),第二列包含分配给每个序列的分类等级(在此图中为“域”,例如“细菌”)。


 


 


图9 。导出的MEGAN文件。包含从MEGAN导出的全部14个文件的文件夹,根据相应的同源性搜索和分类级别(红色矩形)进行命名。


 


解析同源性搜索的输出文件(图10 和补充材料 stepB.mp4视频教程):
Ť 他一步都必须进行的一个Linux操作系统,因为解析同源性搜索结果中的脚本,我们重新写在bash。脚本处理FASTA文件(包含查询序列)和同源性搜索输出文件:


提供的脚本(search_parser.sh 和analyser_blast.sh 从补充材料5,FASTA文件()Sf_TV_contigs.fasta 从补充材料3根据同源性搜索()和输出文件blastn_nt16SLep_total-contigs_Sf-TV.txt 和blastx_nr_total-CONT TXT 从补充材料4),毛里求斯T总为p 在同一文件夹系带(图10 甲)。
我们提供的两个bash脚本中,仅执行search_parser.sh 。该脚本通过同时执行各种analyser_blast.sh 脚本来加快处理速度。打开包含文件相同的文件夹中的终端并执行search_parser.sh bash脚本以下列顺序严格(也见下面的例子和图小号10 B- 10 C):
bash search_parser.sh (查询fasta的名称和文件扩展名)(同源性搜索结果的名称和文件扩展名)(为输出文件选择名称和文件扩展名)


 


bash search_parser.sh Sf_TV_ contigs.fasta blastn_nt16Slep_total-contigs.txt nucl_hits.csv


 


 


图10 。使用bash脚本。答:淡蓝色矩形表示包含所有必要文件的文件夹,脚本可以正常工作。B.红色矩形表示如何执行脚本来处理BLASTN输出文件(针对核苷酸数据库的同源性搜索)。C.该图像显示在处理来自BLASTN同源性搜索(1和2)和来自BLASTX同源性搜索(3和4)的输出文件之后,出现在终端中的消息。


 


脚本会生成一个csv文件,其中三个字段之间用“%”分隔:第一个显示每个序列的名称,第二个显示其最佳匹配(如果没有匹配则不显示),第三个显示其对应的E值(或如果没有命中,则什么都不是)(图11 )。
 


 


图11 。脚本生成的csv文件的部分图像。文件显示列出的BLASTN(A)和BLASTX命中(B)。在两个文件中,显示序列名称,其最佳匹配和相应E值的字段均以“%”分隔。


 


创建在DB4S数据库(图小号12-18 ,和step_C_creating_taxonomy.txt 和stepC.mp4 视频教程˚F 在ound 补充材料步骤C):
包含所有用于组装的信息的文件读取(coverage.csv 发现在补充材料3 ;包括重叠群名称,读取用于组装每个重叠群,读出长度,并且重叠群覆盖数),则导出MEGAN文件(14页分类的文件和现在将使用来自步骤A的2个功能文件和bash脚本输出文件(来自步骤B的2个文件)创建具有DB4S的本地数据库,该数据库将包含每个重叠群的所有可用信息(来自同源搜索,BLASTN和BLASTX)。为此,首先必须将每个csv文件分别导入DB4S:


通过单击“ New Database”并选择保存位置来创建一个新数据库(图12 )。
分别导入从MEGAN导出的csv文件(16个文件),由bash脚本创建的csv文件(2个文件)以及包含contig的程序集信息的文件(1个文件):“文件”>“导入” > “来自CSV文件的表格”(图13 )。
为表选择一个名称,并指示字段分隔符(对于从MEGAN导入的文件,Tab;对于bash脚本生成的文件,Other>%)(图14 )。仅对于coverage.csv ,选中“第一行中的列名”框,这将自动为列提供正确的名称(图14 C )。
我们建议重命名具有代表性名称的表列,如图15 所示,以便能够使用我们为步骤C6 ,D,E和F 提供的命令(并且还可以简化命令的解释并避免错误)。
图16 显示了重命名后的表列表。
下一步包括将来自所有导入文件的信息集成到一个新表中,如图17 和18所示。该集的命令step_C_creating_taxonomy.txt 从补充材料步骤C,创建一个名为“分类法”统一了所有导入的CSV文件列,并增加了空列(新表final_domain ,final_phylum ,等等)进行填充后续分类学交叉交叉的结果(步骤D)。它还添加辅助列“ state_taxo ”(以指示重叠群是否在分类上进行分配,并避免多次分配;步骤D)“ rna_type ”(以指示转录本是否对应于mRNA或rRNA;步骤E)“ state_function” ”(表示是否已分配成绩单或需要对其进行修改;步骤E)和“ function_type ”(集成功能数据库对那些成绩单进行的功能分配;步骤F)。
 


 


图12. 在DB4S中创建一个新数据库。答:单击“新数据库”(红色箭头)。B.将出现一个窗口,要求您选择文件名和位置来保存新数据库(红色矩形)。


 


 


图13. 导入csv文件。答:单击“文件”叶子(红色箭头),然后从下拉菜单(红色矩形)中选择“导入”和“来自CSV文件的表格”。B.选择要导入的csv文件(图中“ nucl_domain.txt”;红色矩形)。


 


 


图14. 命名导入的表。选择一个csv文件后,将出现一个新窗口,您必须在其中命名表并指出字段分隔符。A.对于从MEGAN导入的文件,字段分隔符为“ Tab”;该表被命名为“ nucl_domain ”(红色矩形)。B.对于由bash脚本生成的文件,字段分隔符为“其他”>“%”;该表被命名为“ nucl_hit ”(红色矩形)。C.对于“ coverage”文件,选中“第一行中的列名”框(由红色箭头指示),然后选择“ Tab”作为字段分隔符;浅蓝色矩形表示分配的列名称。


 


 


图15. 重命名表列。答:在主窗口中,用鼠标右键选择要修改的表,然后单击“修改表”(红色矩形)。B.将打开一个新窗口,其中显示字段(与列名称相对应);双击每个以重命名。对于从MEGAN文件中获得的表格,将“ field1”重命名为“ sequence”,并根据用于同源性搜索的数据库(nucl 或prot )命名表格和“ field2”,后跟适当的分类等级(例如,域); 在图中,“ nucl_domain ”(红色矩形)。C.从bash脚本生成的文件中获得的表具有三列:将“ field1”重命名为“ sequence”;将“ field1”重命名为“ sequence”。对于BLASTN匹配,将“ field2”重命名为“ nucl_hit ”,对于BLASTX匹配,将其重命名为“ prot_hit ” ;对于BLASTN匹配,将“ field3”重命名为“ nucl_Evalue ”,对于BLASTX匹配,将“ field3”重命名为“ prot_Evalue ” (红色矩形)。


注意:“覆盖”文件中的列是在上一步中命名的,因此不需要进行修改。


 


 


图16. DB4S中的“数据库结构”窗口列出了所有导入和重命名的表。


 


 


图17. 统一所有导入文件中的信息以创建一个表。1)在“ Execute SQL”叶中,2)粘贴step_C_creating_taxonomy.txt中包含的命令,并3)用“ Play”执行它们;4)下面板指示命令是否成功执行。


 


 


图18 。上一步中创建的“分类”表的视图。答:在“数据库结构”叶子(红色箭头)中,将出现一个新的“分类”表(红色矩形)。B. 要浏览新的“分类”表,请从“浏览数据”叶子的下拉菜单中选择它(由红色箭头和矩形表示);浅蓝色矩形显示在“分类法”中创建的列的局部视图。


 


确定分类配置文件(图19 以及在补充材料步骤D中找到的step_D_crisscrossing_taxonomy.txt 和stepD.mp4 视频教程):
下一步是根据以下标准,将“分类法”表中包含的所有信息相交,以阐明样品的分类法概况:


在至少一种同源性搜索中分配给节肢动物的重叠群被分配给宿主转录切片机。
在nt16SLep搜索中具有E值= 0的命中的重叠群将直接分配给该分类单元。
其余重叠群通过根据LCA逻辑比较两个同源搜索的MEGAN分配来分配;也就是说,重叠群的分类分配水平是两个结果的共同点,或者如果在其他同源性搜索中没有命中的话,则是唯一的结果。
在任何同源性搜索中未由MEGAN分配给任何分类单元的重叠群均被视为“未分配”;在两个同源性搜索中均未命中的重叠群被认为是“无命中”。
这些分配进行通过执行这些命令step_D_crisscrossing_taxonomy.txt 在DB4S(图19 和stepD.mp4 视频教程从补充材料步骤d)。


 


 


图19 。在“分类”表中确定最终的分类分配。A.要执行这些分配:1)在“执行SQL”叶中,2)粘贴step_D_crisscrossing_taxonomy.txt中包含的命令,并3)使用“播放” 执行它们;4)下面板指示命令是否成功执行。B.要查看更新的“分类”表,请从“浏览数据”页面的下拉菜单中选择它(由红色箭头和矩形表示);淡蓝色的矩形表示与最终分类分配的列之后纵横交错从两个同源性搜索的分类信息(final_domain ,final_phylum ,等等)。


 


对成绩单进行分类(图20 以及在补充材料步骤E中找到的step_E_assigning transcripts.txt 和stepE.mp4 视频教程):
由于此样品是从总RNA中获得的,因此可以使用同源性搜索的命中描述中包含的信息(“ nucl_hit ”和“ prot_hit ”列)将这些序列进一步分为信使RNA或核糖体RNA ,基于以下条件:


如果重叠群在“ nucl_hit ”列中显示单词“ mRNA”,则被分配给mRNA 。
如果重叠群在“ nucl_hit ”列中显示单词“ rRNA /核糖体RNA”,则被分配给rRNA 。
如果重叠群在“ nucl_hit ”和“ prot_hit ”栏中分别显示“基因组/染色体/支架/重叠群”或“未表征/假设/未知/ ncRNA”,则在功能上被视为“未分配” 。
所有其他重叠群都包含在“修订”类别中,以手动进行更改。
这些分配进行通过执行找到的命令step_E_assigning transcripts.txt 在DB4S(图20 和stepE.mp4 视频教程从补充材料步骤E)。


 


 


图20 。在“分类”表中确定成绩单分配。在“执行SQL”叶1),2)粘贴包含在命令:A.为了执行这些任务step_E_assigning transcripts.txt 一个第二3)与“播放”执行它们; 4)下面板指示命令是否成功执行。B.要查看更新的“分类”表,请从“浏览数据”页面的下拉菜单中选择它(由红色箭头和矩形表示);浅蓝色矩形表示具有最终成绩单分配的列:“ rna_type ”和“ state_function ”。“ rna_type ”列指示该成绩单被分配到的类别(即“ mRNA”,“ rRNA”,“未分配”或“修订”)。将成绩单分配给“ mRNA”,“ rRNA”或“未分配”时,“ state_function ”列中将显示“已分配”状态;如果成绩单包含在“修订”类别中,则为“ NULL”。


 


功能分配(图21 ,和step_F_ 功能分配的.txt 和stepF.mp4 视频教程发现在补充材料步骤F):
此步骤将由功能数据库分类的那些成绩单的所有功能分配整合到单个列中(“ function_type ”)。这是通过执行的命令完成在发现step_F_functional assignment.txt 在DB4S(图21 和stepF.mp4 视频教程从补充材料步骤F)。从上一步可以推论,功能数据库只能推定“ mRNA”和“ Revise”类别中的转录本。尽管如此,只有约三分之一被分配的功能,因为在这些参考数据库中的信息仍然很大限制(见“数据一nalysis”)。


 


 


图21 。将功能分配集成到“分类”表中。A. 1)在“ Execute SQL”叶中,2)粘贴step_F_functional assignment.txt中包含的命令,并3)用“ Play”执行它们;4)下面板指示命令是否成功执行。B.要查看更新的“分类”表,请从“浏览数据”页面的下拉菜单中选择它(由红色箭头和矩形表示);浅蓝色矩形表示“ function_type ”列,其中显示了集成的功能分配。功能数据库未分类的成绩单显示为“ NULL”。


 


数据分析


 


浏览“分类法”的表(补充˚F igures S2-S5和analysing_taxonomy.txt):


  我们终于准备好浏览和分析“浏览数据”叶中更新的“分类”表。可以应用一个或多个过滤器来促进数据分析。此外,可以执行一些命令来可视化,例如,样品的分类分布(补充图S2),转录本类型的分布(补充图S3)或在同源性搜索中获得的匹配的非冗余列表(补充图S4)。


  为了更详细地分析功能概况,可以列出通过功能数据库分类的所有“ mRNA”和“ Revise”转录本,以及没有转录的(补充图S5)。如前所述,由于参考数据库并不全面,因此我们发现,实际上可以被功能数据库分类的所有转录本(“ mRNA”和“ Revise”类别)中只有大约30%被真正分类了。 ; 补充图S5 甲)。为了确定剩余的70%的功能谱(122个重叠群;补充图S5 B ),可以根据同源性搜索结果分别确定这些转录本的功能分配,然后手动将其输入数据库。为了确定优先顺序并帮助减少此大量工作量,可以根据覆盖范围查看重叠群(补充图S5 B )。


所有这些命令都包含在analysing_taxonomy.txt中,并且可以进行调整以涵盖其他兴趣。


 


记事簿


 


由于文件大小的问题,我们举例说明与组装读取(737个重叠群),B使用我们的工作流程的UT,我们还使用HoSeIn 分析我们读(〜300,000)和它的作品完美。
此工作流最初是为分析高通量元转录组序列而开发的,但我们也已使用它来分析高通量宏基因组序列。此外,我们通过分析一个模拟宏基因组(BMock12)验证了我们的工作流程(塞维姆等人,2019)以及比较用我们所报道的合成宏基因组中获得的结果(塞维姆等人,2019) 。该验证已包括在研究中,在该研究中我们介绍了本教程使用的数据集的分析,该数据集最近已被接受发表(Rozadilla 等人,2020年),并在此处作为补充分析(Supplementary Analysis_mock metagenome.docx )包括在内。)。总之,我们将我们的结果与Sevim 等人报道的结果进行了对比。(2019)(表S1),发现我们的工作流程不仅确定了模拟元基因组的所有成员,而且我们确定的每个社区成员重叠群的数量大于(或相同,但绝不低于)作者报告了(表S1)。总之,我们的工作流程使我们能够比以前报告的方法更敏感地识别模拟元基因组的所有社区成员。
尽管我们的工作流程中有许多手动步骤,但这些步骤可与基于分类的基于比对的方法用来对来自转录组学/宏基因组学数据集的读数进行分类和标记的步骤数量相媲美。有一些用于元转录组数据集的生物信息学工作流,旨在通过将多个单独的工具连接到一个工作流中来简化这种复杂性,该工作流可以进行原始测序读取,对其进行处理并为数据文件提供分类学身份,功能基因和/或差异表达的转录本( Shakya et al。,2019)。然而,为了定义分类和功能分配,这些平台针对蛋白质或核苷酸数据库(而非两者)执行基于序列的搜索(Shakya 等人,2019)。如已经提到的,对蛋白质数据库的搜索使得能够检测远距离相关的生物,但是容易被错误发现,而对核苷酸数据库的搜索则更加特异性,但是无法识别保守性不足的序列。因此,必须仔细解释使用这些简化的工作流程进行的元转录组分析。另一个主要缺点是,这些中的一些通过搜索针对设计用于功能表征数据库工作流分配分类学(释迦等人,2019 )。
HoSeIn 工作流程中独特创新的摘要:
从同源性搜索和通过tevertever方法对序列进行分类和标记后,可将每个序列的所有可用信息组装并整合到本地数据库中,并且可以轻松查看和分析。
通过比较来自两个同源搜索的遵循LCA逻辑的每个序列的分类分配来定义样品的分类图谱。即,一个序列的生物分类分配水平是两个同源性搜索结果中共同发现的,或者是在另一个同源性搜索中未返回命中值时唯一的结果。
因此,我们工作流程的新颖之处在于,最终分配可以利用同源性搜索的优势来整合两个同源性搜索的结果,从而使它们更加健壮和可靠。特别是对于转录组学而言,结果难以解释,这是一个非常有用的工具。
通过首先分配成绩单,然后将所有功能信息集成到一个列中(在本地数据库中)来定义功能配置文件。我们观察到的是,功能数据库目前仅能够对所有可能被功能分类的转录本进行分类,大约30%。据我们所知,其他现有工具尚无法解析这些成绩单的其余三分之二的功能信息。相反,在我们的工作流程中,可以根据同源性搜索结果(包含在本地数据库中)确定这些成绩单的功能分配,从而提供更加完整和详细的功能配置文件。
 


致谢


 


本研究是由支持AGENCIA 国立Promoción Científica ý 技术网络(PICT PRH 112和PICT出租车司机3632),和理事会全国国立科学调查Científicas Ý TÉCNICAS (圆锥)(PIP 0294)授予CBM。CBM是CONICET研究事业的成员。GR和JMC是CONICET研究金的获得者,本文摘自McCarthy 等人。(2013年)和Rozadilla 等人。(2020)。


 


利益争夺


 


作者宣称没有利益冲突。


 


参考资料


 


Aguiar-Pulido,V.,Huang,W.,Suarez-Ulloa,V.,Cickovski ,T.,Mathee ,K.和Narasimhan,G.(2016)。用于微生物组分析的元基因组学,元转录组学和代谢组学方法。Evol Bioinform Online 12(增刊1):5-16。
Altschul ,SF,Gish,W.,Miller,W.,Myers,EW和Lipman,DJ(1990)。基本的局部比对搜索工具。分子生物学杂志215(3):403-410。
Ashburner,M.,Ball,CA,Blake,JA,Botstein,D.,Butler,H.,Cherry,JM,Davis,AP,Dolinski ,K.,Dwight,SS,Eppig ,JT,Harris,MA,希尔, DP,Issel- Tarver,L.,Kasarskis ,A.,Lewis,S.,Matese ,JC,Richardson,JE,Ringwald,M.,Rubin,GM和Sherlock,G。(2000)。基因本体论:统一生物学的工具。基因本体联盟。Nat Genet 25(1):25-29。
布雷克,J. A.,克里斯蒂,K. R.,多兰,M. E.,德拉布坎,H. J.,希尔,D. P.,镍,L.,西特尼科夫,D.,等人。(2015)。基因本体联盟:向前迈进。核酸研究43(数据库问题):D1049-1056。
Buchfink,B.,X 即C. and Huson,DH(2015)。使用DIAMOND快速灵敏的蛋白质比对。自然方法12(1):59-60 。
Finn,RD,Attwood,TK,Babbitt,PC,Bateman,A.,Bork,P.,Bridge,AJ,Chang,HY,Dosztanyi ,Z.,El- Gebali ,S.,Fraser,M.,Gough,J 。,Haft,D.,Holliday,GL,Huang,H.,Huang,X.,Letunic ,I.,Lopez,R.,Lu,S.,Marchler- Bauer,A.,Mi,H.,Mistry, J.,Natale,DA,Necci ,M.,Nuka,G.,Orengo ,CA,Park,Y.,Pesseat ,S.,Piovesan ,D.,Potter,SC,Rawlings,ND,Redaschi ,N.,理查森,L.,Rivoire ,C.,Sangrador -Vegas,A.,Sigrist ,C.,Sillitoe,I.,Smithers,B.,Squizzato ,S.,Sutton,G.,Thanki ,N.,Thomas,PD,Tosatto ,SC,Wu,CH,Xenarios ,I.,Yeh,LS,Young,SY和Mitchell,AL(2017)。InterPro在2017年中超越蛋白质家族和领域注释。核酸研究45(D1):D190-D199。
Glass,EM和Meyer,F。(2011)。的宏基因组学RAST 服务器:一个公共resourc e此自动系统发育和宏基因组的功能分析。在:分子手册微生物生态学我:宏基因组学和互补的方法,生物医学中心9 (1 ):325 - 331。
Huson ,DH,Auch,AF,Qi,J. and Schuster,SC(2007)。MEGAN对宏基因组学数据的分析。Genome Res 17(3):377-386。
Huson ,DH,Mitra,S.,Ruscheweyh ,HJ,Weber,N。和Schuster,SC(2011)。使用MEGAN4对环境序列进行综合分析。Genome Res 21(9):1552-1560。
Kim,M.,Lee,KH,Yoon,SW,Kim,BS,Chun,J.和Yi,H.(2013)。下一代测序时代中宏基因组学的分析工具和数据库。Genomics Inform 11(3):102-113。
Kotera ,M.,Moriya,Y.,Tokimatsu ,T.,Kanehisa ,M.和Goto ,S.(2015)。KEGG和GenomeNet,新进展,元基因组分析。在:Metagenomics百科全书:基因,基因组和元基因组:基础,方法,数据库和工具中。纳尔逊 KE(编辑)。马萨诸塞州波士顿,史普林格美国:329-339。
Marchesi ,JR和Ravel,J.(2015年)。微生物组研究词汇:一项建议。微生物组3:31 。
Marchler -Bauer ,A.,Bo,Y.,Han,L.,He,J.,Lanczycki ,CJ,Lu,S.,Chitsaz ,F.,Derbyshire,MK,Geer,RC,Gonzales,NR,Gwadz , M.,Hurwitz,DI,Lu,F.,Marchler ,GH,Song,JS,Thanki ,N.,Wang,Z.,Yamashita,RA,Zhang,D.,Zheng,C.,Geer,LY and Bryant, SH(2017)。CDD / SPARCLE:通过亚家族结构设计师对蛋白质进行功能分类。核酸研究45(D1):D200-D203。
Markowitz,VM,Chen,IM,Palaniappan ,K.,Chu,K.,Szeto,E.,Grechkin ,Y.,Ratner,A.,Jacob,B.,Huang,J.,Williams,P.,Huntemann , M.,Anderson,I.,Mavromatis ,K.,Ivanova,NN和Kyrpides ,NC(2012)。IMG:在综合微生物基因组数据库和比较分析系统。Nucleic Acids Res 40(数据库问题):D115-122。
McCarthy,CB,Santini,MS,Pimenta ,PF和Diambra ,LA(2013)。首次比较转录组学分析野生成年雌性和长形Lutzomyia longipalpis,内脏利什曼病的载体。PLoS 一8(3):e58645。
McCarthy,CB,Cabrera,NA和Virla ,EG(2015)。南美亚热带地区田间采集和实验室培养的草地贪夜蛾幼虫肠道的转录组学分析。基因组公告3(4):e00777-15。
H.Ogata ,H.,Goto ,S.,Sato,K.,W.Fujibuchi ,H.Bono ,M 。和Kanehisa ,M。(1999)。KEGG:《京都基因与基因组百科全书》。核酸研究27(1):29-34。
Overbeek ,R.,Olson,R.,Pusch ,GD,Olsen,GJ,Davis,JJ,Disz ,T.,Edwards,RA,Gerdes,S.,Parrello ,B.,Shukla,M.,Wonstein ,W. ,Wattam ,AR,Xia F.和Stevens,R.(2014年)。使用子系统技术(RAST)的SEED和微生物基因组快速注释。核酸研究42(数据库问题):D206-214。
皮尔森W.(2004)。查找与FASTA的蛋白质和核苷酸相似性。Curr Protoc 生物信息学第3章:Unit3 9。
Powell,S.,Szklarczyk ,D.,Trachana ,K.,Roth,A.,Kuhn,M.,Muller,J.,Arnold,R.,Rattei ,T.,Letunic ,I.,Doerks ,T., Jensen,LJ,von Mering 和C.Bork,P.(2012年)。eggNOG v3.0:直系同源群体,涵盖41种不同分类范围内的1133个生物。核酸研究40(数据库问题):D284-289。
Rozadilla ,G.,Cabrera,NA,Virla ,EG,Greco,NM和McCarthy,CB (2020)。通过转录组学分析揭示了Spodoptera frugiperda(JE Smith)幼虫的肠道菌群。应用昆虫学报n / a(n / a)。doi.org/10.1111/jen.12742 。
Sevim ,V.,Lee,J.,Egan,R.,Clum ,A.,Hundley,H.,Lee,J.,Everroad ,RC,Detweiler,AM,Bebout ,BM,Pett- Ridge,J.,Goker ,M.,Murray,AE,Lindemann,SR,Klenk ,HP,O'Malley,R.,Zane,M.,Cheng,JF,Copeland,A.,Daum ,C.,Singer,E。和Woyke ,T (2019)。使用牛津纳米孔,PacBio和Illumina技术的已定义模拟社区的Shotgun元基因组数据。科学日期6(1):285。
Shakya,M.,Lo,CC和Chain,PSG(2019)。元转录组学分析的进展和挑战。创世纪10:904。
Tatusov ,RL,Galperin,MY,Natale,DA和Koonin ,EV(2000)。COG数据库:用于蛋白质功能和进化的基因组规模分析的工具。核酸研究28(1):33-36。
Wooley,JC,Godzik ,A。和Friedberg,I。(2010)。一个例子是宏基因组学。PLoS Comput Biol 6(2):e1000667。
登录/注册账号可免费阅读全文
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2020 The Authors; exclusive licensee Bio-protocol LLC.
引用:Rozadilla, G., Moreiras Clemente, J. and McCarthy, C. B. (2020). HoSeIn: A Workflow for Integrating Various Homology Search Results from Metagenomic and Metatranscriptomic Sequence Datasets. Bio-protocol 10(14): e3679. DOI: 10.21769/BioProtoc.3679.
提问与回复
提交问题/评论即表示您同意遵守我们的服务条款。如果您发现恶意或不符合我们的条款的言论,请联系我们:eb@bio-protocol.org。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。