参见作者原研究论文

本实验方案简略版
Jul 2019
Advertisement

本文章节


 

A Pipeline for Non-model Organisms for de novo Transcriptome Assembly, Annotation, and Gene Ontology Analysis Using Open Tools: Case Study with Scots Pine
使用开放工具进行从头转录组组装、注释和基因本体分析的非模式生物管道:Scots Pine案例研究   

引用 收藏 提问与回复 分享您的反馈 Cited by

Abstract

RNA sequencing (RNA-seq) has opened up the possibility of studying virtually any organism at the whole transcriptome level. Nevertheless, the absence of a sequenced and accurately annotated reference genome may be an obstacle for applying this technique to non-model organisms, especially for those with a complex genome. While de novo transcriptome assembly can circumvent this problem, it is often computationally demanding. Furthermore, the transcriptome annotation and Gene Ontology enrichment analysis without an automatized system is often a laborious task. Here we describe step-by-step the pipeline that was used to perform the transcriptome assembly, annotation, and Gene Ontology analysis of Scots pine (Pinus sylvestris), a gymnosperm species with complex genome. Using only free software available for the scientific community and running on a standard personal computer, the pipeline intends to facilitate transcriptomic studies for non-model species, yet being flexible to be used with any organism.

Keywords: RNA-seq (RNA-seq), De novo assembly (从头组装), Non-model organism (非模式生物), Pinus sylvestris (樟子松), Gymnosperm (裸子植物), Transcriptome tutorial (转录组教程)

Background

Non-model organisms are valuable for environmental and evolutionary studies. However, the absence of a closely related model species to serve as reference for transcriptomic studies was a limitation until the development of the de novo assembly methods. De novo assemblers brought the non-model organisms to the omics era, but the required analyses may be computationally demanding and are often time consuming. This is especially the case for testing, evaluating, and establishing the pipeline for performing the required analyses. The costs for acquiring a computing server and/or software for automated data processing, which could speed up the process, may impose limitations to some research groups. We detail here the procedures that were employed for the transcriptome assembly, annotation, and gene ontology (GO) analysis of Scots pine (Pinus sylvestris), an organism with complex genome (Duarte et al., 2019). The pipeline was developed based on works and benchmark evaluations that describe the best practices for transcriptome studies (Conesa et al., 2016; Honaas et al., 2016; Geniza and Jaiswal, 2017). The pine transcriptome analyses were performed on a 32G of RAM 8-cores machine. The outputs of three different assemblers, namely BinPacker (Liu et al., 2016), SOAPdenovo-Trans (Xie et al., 2014), and Trinity (Grabherr et al., 2011), were evaluated individually and combined. For the pine transcriptome assembly, the best result was obtained by combining the outputs of the different assemblers, following filtering for redundancies with EvidentialGene (Gilbert, 2020). The selection and combination of assemblers, nevertheless, should be adjusted according to the input data, and the decision can be made based on the quality assessment that is described step-by-step here. InterProScan (Jones et al., 2014) performs a comprehensive signature annotation of the predicted proteins, while BLAST+ (Camacho et al., 2009) allows intra- and interspecific mapping for functional comparisons. Because of that, we combined InterProScan’s protein signatures with several BLAST+ searches for annotating the transcriptome of P. sylvestris (Duarte et al., 2019), which were integrated with Trinotate (Bryant et al., 2017). The unique GO identifiers were retrieved with a Python script that we make available, and used for GO enrichment analysis with BiNGO (Maere et al., 2005). The pipeline is based on tools that are open for the scientific community. While it is especially interesting for studies with non-model organisms, which lack closely related and well-annotated species for guiding the assembly and annotation, the pipeline is flexible to be applied to virtually any organism.


Software

Software – Linux OS:

Several programs that will be used in the protocol, such as BLAST+, Bowtie2, FASTA Splitter, and BUSCO, can easily be managed using the Bioconda channel (Grüning et al., 2018; https://bioconda.github.io/). Bioconda is a repository of packages specialized in bioinformatics, and it requires the conda package to be installed. The manager automatically installs all dependencies for a given program and makes it available in your PATH. The software support via Bioconda is continuously updated, and we suggest always to check if the required software is available through this repository. After installing conda, add the channels following the recommended order (please refer to https://bioconda.github.io/user/install.html#install-conda). Other software will need to be installed individually, and may require adjustments depending on your system. Next, we list the software that will be required for each step, which are summarized in the flowchart (Figure 1).



Figure 1. Flowchart illustrating the five steps for transcriptome assembly, annotation, and Gene Ontology analysis. The input for starting each step is represented in the boxes, and the main procedures of each step are described in black. The software that will be required for each step are listed in blue. For performing the transcriptome assembly (B), the user may use the software of choice. The three assemblers described in this pipeline were used for the Scots pine study (Duarte et al., 2019), and are marked as optional [opt].



  1. Data pre-processing

    1. FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) – quality control of sequencing files

    2. FastQ Screen (https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen) – screening of sequencing files

    3. Trimmomatic (Bolger et al., 2014; available via Bioconda) – trimming of sequencing files

    4. Trinity package (Grabherr et al., 2011) – in silico normalization


  2. Transcriptome assembly

    1. BinPacker (Liu et al., 2016; https://sourceforge.net/projects/transcriptomeassembly/) – assembler

    2. EvidentialGene (Gilbert, 2020; http://arthropods.eugenes.org/EvidentialGene/) – assembly filtering

    3. Galaxy platform (Afgan et al., 2018; usegalaxy.org) – open source online platform for data research

    4. SOAPdenovo-Trans (Xie et al., 2014; https://github.com/aquaskyline/SOAPdenovo-Trans) – assembler

    5. Trinity package (Grabherr et al., 2011; available in Galaxy platform) – assembler


  3. Quality assessment

    1. Bowtie2 (Langmead and Salzberg, 2012; available via Bioconda) – short read aligner

    2. BLAST+ (Camacho et al., 2009; available via Bioconda) – tool for comparing sequences and searching databases

    3. BUSCO (Simão et al., 2015; available via Bioconda) – assembly completeness assessment

    4. Samtools (Li et al., 2009; https://github.com/samtools/samtools) – data manipulation

    5. DETONATE (Li et al., 2014; available via Bioconda) – evaluation of de novo transcriptome assemblies

    6. Trinity package (Grabherr et al., 2011) – estimation of full-length transcripts in the assembly


  4. Transcriptome annotation

    1. BLAST+ (Camacho et al., 2009; available via Bioconda) – tool for comparing sequences and searching databases

    2. Bowtie2 (Langmead and Salzberg, 2012; available via Bioconda) – short read aligner

    3. Corset (Davidson and Oshlack, 2014; available via Bioconda) – clustering of read counts

    4. FASTA splitter (http://kirill-kryukov.com/study/tools/fasta-splitter/) – creation of subsets of a FASTA file

    5. getorf (Rice et al., 2000; http://emboss.sourceforge.net/) – prediction of open reading frames

    6. HMMER (http://hmmer.org/) – prediction of domains for annotation

    7. InterProScan (Jones et al., 2014; https://www.ebi.ac.uk/interpro/) – protein domains identifier

    8. RNAMMER (Lagesen et al., 2007; http://www.cbs.dtu.dk/services/RNAmmer) – ribosomal RNA annotation

    9. SignalP (Petersen et al., 2011; https://services.healthtech.dtu.dk/) – prediction of signal peptides

    10. TmHMM (Krogh et al., 2001; https://services.healthtech.dtu.dk/) – protein transmembrane helix prediction

    11. TransDecoder (Haas et al., 2013; http://transdecoder.github.io) – identification of coding regions

    12. Trinotate (Bryant et al., 2017; https://trinotate.github.io/) – suite for transcriptome annotation


  5. Gene Ontology analysis

    1. BiNGO (Maere et al., 2005; download available via Cytoscape platform) – Gene Ontology analysis

    2. Cytoscape (Shannon et al., 2000; https://cytoscape.org/) – platform for performing the Gene Ontology analysis

    3. GO retriever – provided script for retrieving transcript ID - Gene Ontology ID relationships

Procedure

The protocol requires basic bioinformatics and Linux command line skills. We provide the general instructions for using each software for the required analyses. Each software settings may be adjusted according to the user’s need, and for that, we suggest referring to the software’s manual for further information and troubleshooting. The time required for performing each step will fully depend on your computing power, and on the size of your dataset. While the assembly and quality assessment steps may require some hours, the comprehensive annotation procedure may require a few days to run. Based on the results obtained with Scots pine, an example of application is shown at the end of this protocol as supporting information.


  1. Pre-processing the RNA-seq dataset

    For starting this pipeline, it is necessary that your RNA-seq samples have been trimmed and quality checked. As a general guideline, FastQC will provide you an overview of the quality of your reads. If you observe remaining vector sequences, they can be removed with FastQ Screen providing a list of vector sequences, which can be obtained from the National Center for Biotechnology Information (NCBI, https://ftp.ncbi.nlm.nih.gov/pub/UniVec/). Trimmomatic can filter your samples based on the quality value of choice and prepares your paired reads. Trimmomatic also performs trimming of adapter sequences, which must be provided according to the method used for preparing your library. Both FastQ Screen and Trimmomatic have detailed manuals that are easy to follow.


    [Tip 1] Several files will be generated for performing all steps of this protocol. Save them with descriptive names that are easy to identify, and sort them in folders accordingly.

    [Tip 2] Some assemblers also accept the unpaired reads as an input. Do not discard them.


    For building a common de novo transcriptome reference for a given species, it is recommended to merge all RNA-seq files (i.e., all samples and conditions) for using a unique input for the assembly. Some software may perform this step as a part of their command line. Always refer to the software's manual for specific details.

    1. Using the Linux terminal, concatenate all trimmed fastq files (left, right, and unpaired) of all samples as three unique inputs for the assemblers (e.g., “all_left.file”, “all_right.file”, and “all_unpaired.file”). Concatenation of the data files can be done by using the “cat” function:


      > cat sample_1-1.file sample_2-1.file sample_n-1.file> all_left.file

      > cat sample_1-2.file sample_2-2.file sample_n-2.file > all_right.file


    2. [Optional] Perform in silico normalization of the three concatenated files to reduce redundancy in read datasets. Trinity package provides a script for it (insilico_read_normalization.pl). If you have paired-end reads, the command will be:


      > ./insilico_read_normalization.pl --seqType <string> \

      --left all_left.file --right all_right.file


      where seqType specifies the format of your reads (fa or fq). Several other parameters can be adjusted, as the RNA-seq read orientation (--SS_lib_type <string>; RF should be used for the dUTP method). If you intend to use Trinity and your unpaired reads, append them to the left ones (e.g., “all_left_all_unpaired.file”) using the “cat” function. Note that the backslash (\) is merely used for breaking the command into several lines for visualization purposes (it can be removed).


  2. Transcriptome assembly (analyses performed in Linux OS)

    For performing the transcriptome assembly, you may use the software of choice. Here, we provide instructions for the assemblers that were used for P. sylvestris transcriptome (Duarte et al., 2019), which therefore will be marked as optional. BinPacker and SOAPdenovo-Trans can be run locally on a standard personal computer but may require in silico normalization to reduce the input data. Because Trinity is computationally demanding, an alternative is to run it using the Galaxy platform (usegalaxy.org). Using multiple k-mer strategy can improve the assembly by balancing sensitivity and specificity (Zerbino and Birney, 2008), but it may also introduce redundancy (Xie et al., 2014). This issue will be corrected later. Trinity is optimized to run with fixed k-mer size (25 k-mers) (Grabherr et al., 2011). If RAM memory is limiting, BinPacker can be used for performing assemblies with the smaller k-mers, while SOAPdenovo-Trans is efficient for longer lengths. For Bruijn-based assemblers, the increase in k-mer size must be based on odd numbers to avoid the creation of palindromes, which may block the graph construction (Zerbino and Birney, 2008).

    1. Perform the transcriptome assembly with different assemblers.

      1. [Optional] BinPacker only accepts paired inputs. The parameters must be adjusted according to the data set and run using various k-mer lengths (-k option), as for instance:


        > ./binpacker -s <string> -p <string> -k <int> \

        -l all_left_normalized.file -r all_right_normalized.file


        where -k <int> is the k-mer length of choice, -s <string> sets between fa or fq files, -p <string> sets for single-end (single, requiring -u single_reads.file for instance) or paired-end (pair, requiring -l left_reads.file -r right_reads.file). At the end, concatenate all assemblies into one file using the “cat” function.

      2. [Optional] The easiest way to run SOAPdenovo-Trans is by setting all assembly parameters and input files in the configuration file, including the k-mer length of choice. It also allows both paired (q1=path_to_all_left_normalized.file and q2= path_to_all_right_normalized.file) and unpaired (q=path_to_ all_unpaired_normalized.file) reads to be used. SOAPdenovo-Trans has two versions: “SOAPdenovo-Trans-31mer”, which performs assemblies with k-mer sizes from 13 to 31, and “SOAPdenovo-Trans-127mer”, which accepts k-mers up to 127 (always odd values). After adjusting the configuration file, run the assemblies using different k-mers:


        > ./SOAPdenovo-Trans-127mer all -s config_file -o output_prefix


        Increasing k-mer value reduces the number of transcripts that are recovered, until none can be retrieved. At the end, combine all assemblies into one final file using the “cat” function.

      3. [Optional] Trinity can be run using the Galaxy platform, either via the web servers (https://usegalaxy.org/) or using your own instance (https://galaxyproject.org/admin/get-galaxy/). Trinity can also perform the assembly combining the paired and unpaired reads, which must be appended to the left reads beforehand. Tutorials are available showing how to use the web server (https://usegalaxy.org/). Upload the quality-processed fastq files (e.g.,“all_right_normalized.file” and “all_left_all_unpaired_normalized.file”). Select Trinity in the tools list and set up the parameters according to your data. If your data have already been normalized, turn off this option and start the analysis.


      [Tip 3] We suggest testing different assemblers. The quality assessment that will be described in the next section will help you to decide which assembly or assemblies will be used as your reference transcriptome.


    1. Filter out the small length isotigs from each assembly, which might be spurious sequencing products. A routinely used threshold is 200 bp. The “Filter sequences by length” tool in Galaxy platform can perform this task. If you used other assemblers locally, upload the assemblies to Galaxy for filtering them. The tool can be found under “Genomic File Manipulation > FASTA/FASTQ > Filter sequences by length”. Set the minimal length and run. Once it is done, download all filtered assembly files.

    2. [Recommended] If you opted for using more than one assembly file (different k-mers and/or different software), it is also possible to combine them into a single super-set. Because the combination of different assemblies creates a redundant file with transcripts of heterogeneous quality, they must be filtered. The EvidentialGene tr2aacds pipeline performs a coding potential analysis and removal of perfectly redundant transcripts rather than filtering the transcripts based solely on their length. The package provides several other scripts organized in different folders. Perform the tasks in the following order using as input your assembly files:

      1. Regularize the transcripts ID of each assembly using the “trformat.pl” script (required for the tr2aacds pipeline, located in /scripts/rnaseq/ folder), as for instance:


        > ./trformat.pl -input assembly_1.fa -output ID-assembly_1.fa


        Combine the ID-regularized assemblies into one super-set (e.g., using the “cat” function”);

      2. Run the tr2aacds (“tr2aacds4.pl”, which can be found in /scripts/prot/ folder) pipeline using as input the concatenated sets obtained with different assemblers and the super-set:


        > ./tr2aacds4.pl -mrnaseq ID-assembly_1.fa


        where -mrnaseq sets the name of the assembly to be evaluated.

      3. The pipeline outputs the relevant transcripts as “okay” sets (“okay-main” and “okay-alts”), which may be combined and used. A folder containing discarded sequences (“drop set”) is also created.


    3. The “assemblystats” tool may provide a summary about the general statistics of the transcriptome assembly, such as isotig length and number of assembled isotigs, called “assemblystat”. If you are running your own Galaxy instance, the tool repository may be found at https://toolshed.g2.bx.psu.edu/view/konradpaszkiewicz/assemblystats/6544228ea290. Alternatively, it may be accessed at PlantGenIE Galaxy server (http://galaxy.plantgenie.org).


  3. Quality assessment (analyses performed in Linux OS)

    For selecting the most representative assembly, it is necessary to compare their quality. In this pipeline, four criteria will be evaluated for quality assessment: (1) mapping back the reads to the transcriptome assembly; (2) completeness; (3) representation of full-length transcripts; (4) evaluation of the DETONATE scores. It is highly recommended to perform all four analyses for a complete overview of the quality of your assemblies. These four criteria must be applied for each of the assemblies to be compared.

    1. [Recommended] For mapping back the reads to the transcriptome, a read alignment tool is necessary. For instance, Bowtie2 can be used as follows:

      1. First, build an index for each of the transcriptome assemblies to be tested:


        > bowtie2-build input_transcriptome.fasta input_transcriptome.fasta


      2. Perform the alignment of the RNA-seq reads to each of your assemblies, as for instance:


        > bowtie2 --local --no-unal -x input_transcriptome.fasta \

        -q -1 all_left_normalized.file \

        -2 all_right_normalized.file | samtools view -Sb - -o output.bam


        The alignment statistics will be reported at the end of the run.

    2. [Recommended] Completeness of a transcriptome can be assessed with BUSCO. Check for the newest version at https://busco.ezlab.org/. Running BUSCO is straightforward by using the following command line:


      > ./busco -m transcriptome -i input_transcriptome.fasta -o output_name -l LINEAGE


      where -m sets the mode for BUSCO (genome, proteins, or transcriptome), and the lineage (-l) is the dataset that will be used as reference for comparison, and must be related to the organism of your study. The lineage dataset can be downloaded from BUSCO website. Optionally, the number of CPU threads (--cpu <int>) and compression of output files (--tarzip) can be set.

    3. [Recommended] Representation of full-length transcripts. When working with a model organism, the representation can be estimated by running a translated BLAST+ (blastx) search against the organism’s reference protein dataset. For non-model organisms, because of the absence of a specific reference for comparison, the analysis has to be done against the proteins of a closely related species (when available), or a dataset containing all known proteins (e.g., Swiss-Prot, available at https://www.uniprot.org). Next, the percentage of best aligning targets can be recovered using the script “analyze_blastPlus_topHit_coverage.pl” available in the Trinity package.

      1. First, create a BLAST database using your reference protein set:


        > makeblastdb -in ref_prot_dataset.fasta -dbtype prot


        where -in sets your reference input, and -dbtype sets the type of sequence (prot, for proteins)

      2. Run blastx against your reference:


        > blastx -query input_transcriptome.fasta -db ref_prot_dataset.fasta \

        -out blastx_output.outfmt6 -evalue 1e-20 -max_target_seqs 1 -outfmt 6


        where -query is your transcriptome input, -db sets the reference database, -evalue sets the threshold, -max_target_seqs informs the number of sequences to be retained (1, for only the best hit), and -outfmt <list> sets the format of the output file (6 is required for using the script). Optionally, the number of CPU threads may be set (-num_threads <int>).

      3. Run “analyze_blastPlus_topHit_coverage.pl” script on the output file:


        > ./analyze_blastPlus_topHit_coverage.pl blastx_output.outfmt6 \


        input_transcriptome.fasta ref_prot_dataset.fasta

        The script reports for each percentage level (1st column) the number of best matching transcripts (2nd column) and the comparison to the level below (3rd column).

    4. [Recommended] Evaluation of the DETONATE scores. RSEM-EVAL, which is a component of the DETONATE v1.11 package, performs a reference-free evaluation of the assembly quality using a probabilistic model (Li et al., 2014). DETONATE requires some settings to be adjusted prior to the quality evaluation:

      1. First, estimate the transcript length distribution using the “rsem-eval-estimate-transcript-length-distribution” script provided in the package. If a reference dataset is not available, a closely related species can be used:


        > ./rsem-eval-estimate-transcript-length-distribution \


        input_reference.fasta parameter_file

        where parameter_file is the output that will store the information about the estimated mean length and standard deviation of the transcripts;

      2. Next, calculate the scores with the “rsem-eval-calculate-score” script:


        > rsem-eval-calculate-score \

        --transcript-length-parameters parameter_file \

        --paired-end all_left_normalized.file all_right_normalized.file \

        input_transcriptome.fasta \

        assembly_name L


        where --transcript-length-parameters sets the parameter file, --paired sets the analysis for paired-end sequences, input_transcriptome.fasta should be the assembly that will be evaluated, and assembly_name is the prefix that will be used for the output files, and L is an integer for the average fragment length of your paired-end data. The program uses Bowtie2 as default. If it is not available in your PATH, use the --bowtie-path <path> option to set the destination. Several other options are available to be set if necessary. DETONATE scores will always be negative, but the higher the score values, the better the assembly.


  4. Transcriptome annotation (analyses performed in Linux OS)

    InterProScan performs the recognition of protein signatures that is useful for retrieving the possible function of the assembled transcripts. However, the analysis is time consuming. Alternatively, Trinotate can be used for a less detailed protein annotation, but provides a good way to integrate BLAST searches and to extract information from them, thus being a good option when a closely related species is well annotated. Trinotate also uses some tools for protein signature recognition, but the input file is limited by the outputs generated by Trinity, thus requiring a few extra steps when other assemblers are used. For performing a detailed protein signature recognition with InterProScan, the following steps should be performed:

    1. First, predict the open reading frames (ORFs) of the transcripts with getorf, which is a part of EMBOSS package (http://emboss.sourceforge.net/), or Transdecoder. For instance, getorf can be used with the following command:


      > getorf -sequence input_transcriptome.fasta -outseq ORF-predicted_transcriptome.fasta \

      -table <list> -find 0 -minsize <int>


      where -minsize <int> sets the minimum nucleotide size of the ORFs (e.g., 200), -table <list> sets the genetic code to be used (0 for standard), and -find <list> sets the searching method (0 for translation of regions between STOP codons).

    2. InterProScan analysis can be computationally demanding for large datasets. When not running on a computing server, in order to circumvent memory-related issues, an alternative is to divide the ORF-predicted transcriptome file into several parts. Next, run the analysis for each of the subsets (for instance, using a script to start sequential analyses). First, use the FASTA splitter tool for dividing the ORF-predicted transcriptome file:


      > ./fasta-splitter.pl --measure count --part-size <int> ORF-predicted_transcriptome.fasta


      where --measure (all, seq, or count) specifies whether all data, sequence length, or number of sequences will be used for dividing the file, --part-size <int> sets the number of fasta sequences to be present in each subset.

    3. Next, run InterProScan analysis for each of the subsets. For best results with InterProScan, activate the databases of partner resources (e.g. Panther, SignalP, SMART and TMHMM) and make use of the lookup service which has pre-calculated results (requires internet connection and the latest version of the software). If you intend to perform Gene Ontology enrichment analysis, which will be detailed in the next section, the --goterms and --iprlookup options are required. Detailed information about InterProScan configuration can be found in InterProScan Documentation page (https://github.com/ebi-pf-team/interproscan).


      [Tip 4]
      In the case you are using conda tool, check if your Python and Java JDK/JRE versions match InterProScan’s requirements. If not, it is possible to create new conda environments for running with specific software versions (check how to create new environments at Conda webpage; https://docs.conda.io/projects/conda/en/latest/index.html).

    4. If you divided your ORF-predicted transcriptome in several subsets, concatenate all .tsv output files using the “cat” function.

    5. [Optional] For integrating Trinotate to the analysis when non-Trinity assemblies are included, it is necessary to provide a gene-transcript relationship of the transcriptome, which can be created with Corset (https://github.com/Oshlack/Corset/wiki):

      1. First, create a Bowtie2 index for the transcriptome assembly file as described in Step C1a;

      2. Next, map back each of the RNA-seq samples used for the assembly to the transcriptome with Bowtie2. If you used both paired and unpaired reads for your assembly, the command would be:


        > bowtie2 --all -X <int> -x input_transcriptome.fasta \

        -q -1 sample_n-1.fastq -2 sample_n-2.fastq \

        -U sample_n-1.unpaired.fastq,sample_n-2.unpaired.fastq | \

        samtools view -Sb > sample_n.bam


        where -X <int> is the estimated fragment size of your data. If only paired reads were used for the assembly, remove the -U part;

      3. The .bam files are used as input for Corset. First, create the summaries independently for each input sample:


        > ./corset -m <int> -r true-stop sample_n.bam


        where -m sets to filter out transcripts with fewer than <int> reads, and -r true-stop specifies that only the summaries must be created (outputs .corset-reads). Since we are only interested in having a rough estimation for creating a required input file for Trinotate, we can set -m 0 (consider all transcripts). Later, if performing gene expression analysis, the number of reads mapping to a transcript may be precisely considered by the tool you opt to use.

      4. When the summaries have been created for all samples, resume Corset run defining the grouping for analysis:


        > ./corset -i corset -g <list> .corset-reads


        where -i corset sets the summaries previously created as input file (the default is .bam). If you have groups of samples, as for instance two replicates of both control and treatment, they can be specified with the -g option separated by commas (e.g., -g control, control, treatment, treatment). The input files have to follow the same order. If -g is not set, each sample will be considered independently. The command for a given example will be as following:


        > ./corset -i corset -g control,control,treatment,treatment \

        control_1.bam.corset-reads control_2.bam.corset-reads \

        treatment_1.bam.corset-reads treatment_2.bam.corset-reads


      5. When the analysis is over, edit the Corset clusters file, which has the gene-transcript relationship, by switching the position of the two columns to conform to Trinotate's requirements.

    6. [Optional] A detailed Trinotate step-by-step guide is available at http://trinotate.github.io. Trinotate relies on Swiss-Prot database for performing its annotation. An interesting feature of this tool is that it integrates BLAST searches to the analysis. When working with non-model species or samples from uncontrolled environments, BLAST searches against higher taxa can be used for identifying spurious transcripts which you might considering filtering out.


  5. Gene Ontology analysis

    Several methods and tutorials are currently available for running differential gene expression analysis (e.g., edgeR [McCarthy et al., 2012], Deseq2 [Love et al., 2014], among others), which is straightforward. Therefore, we will focus on the description of the GO enrichment analysis. Nevertheless, for this step, the annotated set of genes and the list with differentially expressed genes are both required.

    BiNGO is a tool that runs in the Cytoscape platform (https://cytoscape.org/). It is available for Linux, Mac OS, and Microsoft Windows. BiNGO allows one to perform the GO enrichment analysis using a custom annotation as background. The custom file for BiNGO must be parsed as “transcript_ID = number”, where the transcript ID is associated to a unique GO ID number (one entry per line). Therefore, before performing the analysis it is necessary to retrieve the GO identifiers from the annotation file. We provide a Python script for retrieving the GO information from InterProScan (“GO_retriever.py”, available as Supplementary File S1). For each transcript ID on the 1st column, the script will return the respective GO number, if any, from the Nth column. When more than one GO term is associated to a transcript, the script splits the transcript ID into two or more lines, each one being linked to a unique GO number. The output is a .txt file. The script can also be applied for extracting the GO numbers from Trinotate output, but it would be necessary to remove the GO descriptions from the report file beforehand.

    1. For running the script, it is necessary to edit it according to your input data:

      1. Line 1: Substitute INPUT_FILE.tsv by the name of your input file.

      2. [Optional] Line 2: Change the name of the output file (OUTPUT_FILE_GOs.txt).

      3. Line 12: Based on your annotation file, set the column (N) where the GO ID numbers were reported.

      4. Line 17: By default, the script is set to recognize the GO split character used by InterProScan (vertical bar). If you intend to use the script for Trinotate’s output, first you will need to remove the descriptions that are provided for each term (keep only GO:number), adjust the split character in the script accordingly, and change the column where the information is reported (previous step).

      5. With the annotation file and the script in the same directory, run the script:


        > python GO_retriever.py input_file

    2. BiNGO can be installed directly through Cytoscape (on the menu bar, select Apps > App Manager ..., select BiNGO on the list, install). After installation, BiNGO will be listed in the Apps menu. With BiNGO running:

      1. From your table with differentially expressed genes, copy the IDs of those that you want to check for overrepresentation (e.g., up-regulated or down-regulated genes).

      2. Select "paste Genes from Text" option, and paste the IDs in the box.

      3. "Overrepresentation" and "Visualization" are selected by default.

      4. Select the statistical test, multiple testing correction, and significance level (e.g., Benjamini & Hochberg False Discovery Rate, P < 0.05).

      5. Select the categories which you want to visualize. By default, only those overrepresented after the correction will be shown.

      6. For the reference set, select the whole annotation (default).

      7. Select the ontology file (.obo file, e.g., “go-basic.obo”), which can be downloaded from the GO Consortium website (https://geneontology.github.io/). BiNGO provides some ontology file options but they may be out of date.

      8. Under "Select namespace", it is possible to choose which GO category will be evaluated, such as biological process, cellular component, or molecular function. For each of them, an independent run will be required.

      9. Finally, under "Select organism/annotation", select as your background the custom annotation file created on the previous steps.

      10. If you check the box for saving the data, you can already set the destination folder. Otherwise, you can also save the results later.


    BiNGO outputs a graphical visualization of the network, representing the statistically significant enriched GO terms by colours. The nodes can be adjusted for a better visualization. The image can be exported by selecting this option under the "File" menu. At the bottom, it is possible to recover the significance levels of each enriched category for plotting a table for easier visualization.


Example of the protocol application – Scots pine transcriptome analysis


For describing this pipeline, the datasets that were used as example are detailed in Duarte et al. (2019). In summary, RNA samples were isolated from pools of Scots pine needles belonging to three areas contaminated by radionuclides after the Chernobyl Nuclear Power Plant accident (Kulazhin, Masany, and Zabor’e), and from a reference plot. The samples were 125 bp paired-end sequenced. FastQC v0.11.5 was used for quality check the sequence data, and FastQ Screen v0.9.2 was applied for removing vector sequences from the RNA-seq reads. The reads were then trimmed with Trimmomatic v0.36, using ILLUMINACLIP:Adapters.fa:2:30:10 HEADCROP:10, MINLEN:35, LEADING:5, TRAILING:5, and SLIDINGWINDOW:4:30 parameters. Preliminary tests were performed with three different Phred qualities for trimming (5, 20, and 30), and indicated that the highest Phred cut-off was the most reliable for performing the transcriptome assembly (Duarte et al., 2019). All Linux-dependent analyses were performed on Ubuntu 16.04 or 18.04 releases, while the GO enrichment evaluation was run on Windows 10. The processed reads are available in Sequence Read Archive (BioProject PRJNA497687).


  1. Transcriptome assembly

    For pine transcriptome assembly (Duarte et al., 2019), we opted for normalizing the concatenated read files to reduce redundancy using the insilico_read_normalization.pl script available in Trinity v2.2.0 package. Next, three different assemblers were used, namely BinPacker v1.1, SOAPdenovo-Trans v1.04, and Trinity v2.2.0. Five different assemblies were generated and filtered for isotigs with minimum length of 200 bp. BinPacker was run with default parameters except for the gap length (-g) set to 500 according to the insert size of the data set (Duarte et al., 2019). The combined set comprised five different k-mers (23, 25, 27, 29, and 31), which was then filtered for redundancies using EvidentialGene pipeline (release 19-Mar-2015). For SOAPdenovo-Trans assemblies, ten different k-mer lengths were used and then combined (25, 31, 37, 43, 49, 55, 65, 75, 85, and 95). The program was run using following parameters: max_rd_len = 115, rd_len_cutoff = 115, avg_ins = 500, reverse_seq = 0, asm_flags = 3, map_len = 32, −f, -F. No data were retrieved with k-mer > 95. Using a 32G of RAM 8-cores machine, BinPacker and low-kmer SOAPdenovo-Trans assemblies required a few hours to run. The concatenated set was filtered with EvidentialGene. Trinity was run on Galaxy platform (usegalaxy.org), using the paired-end mode, with unpaired reads appended to the left reads, and default parameters except for the normalisation, which was set off. Although Trinity was run only with the optimized k-mer length, filtering with EvidentialGene was also performed for comparison. Finally, “assemblystat” tool was used for retrieving the general statistics of each assembly, which is shown in Table 1.


    Table 1. General statistics for P. sylvestris assemblies comparing the output of BinPacker, SOAPdenovo-Trans, Trinity, and the super-set combining all assemblies . Adapted from Duarte et al. (2019), Table S1.



    In general, SOAPdenovo-Trans performed poorly for assembling long isotigs, as shown by the max isotig length. However, considering the mean length, this assembler provided a higher score than unfiltered Trinity. Apparently, Trinity provides a highly redundant output, as indicated by the number of isotigs before and after filtering. Trinity assembly must also contain a high proportion of short isotigs, which would explain the shortest median length and N50 length retrieved among the assemblers. However, filtering Trinity’s output significantly reduced the number of isotigs in N50. Considering these general statistical scores, BinPacker, the filtered super-set, and unfiltered Trinity had the most promising initial statuses.


  2. Quality assessment

    The four criteria for transcriptome assembly quality assessment described in this protocol were also applied for the P. sylvestris transcriptome study (Duarte et al., 2019). They are summarized in Table 2. The reads were mapped back to each of the five assemblies using Bowtie v2.2.9. Completeness with BUSCO v2 was assessed using the plants set (embryophyta_odb9). Translated nucleotide BLAST (blastx v2.7.1) search was performed against Pinus taeda v1.01 proteins retrieved from TreeGenes Database (https://treegenesdb.org/), and TrEMBL Viridiplantae protein sequences (release 2017_03). Finally, DETONATE v1.11 was used for scoring the quality, using P. taeda proteins nucleotides sequences for estimating the transcript length distribution. Performing all four analyses required some hours, and the results for each assembly are provided in Table 2. The filtered SOAPdenovo-Trans set had the lowest scores for all four criteria evaluated, while filtering Trinity’s output did not improve its quality scores. On the other hand, the filtered BinPacker set comprising five different k-mer lengths, the filtered super-set comprising all assemblies, and unfiltered Trinity had similar alignment rates and completeness. Although the unfiltered Trinity assembly had a slightly better DETONATE score, the super-set showed better representation of full-length transcripts and BUSCO scores (highest completeness and lowest number of missing BUSCOs). The combined super-set was selected for the evaluation of differentially expressed genes and GO enrichment analysis (Duarte et al., 2019).


    Table 2. Quality assessment of P. sylvestris assemblies comparing the output of BinPacker, SOAPdenovo-Trans, Trinity, and the super-set with all assemblies. Adapted from Duarte et al. (2019), Table S1.


  3. Transcriptome annotation and gene ontology enrichment analysis

    For the pine transcriptome study (Duarte et al., 2019), InterProScan v5.26-65.0 results were combined with the summary of BLAST searches reported by Trinotate v3.1.1. For running InterProScan, the ORFs were predicted with getorf (EMBOSS v6.6.0). Then, FASTA splitter v0.2.6 was used to divide the ORF-predicted file in subsets containing 10000 transcripts. InterProScan was run with the following methods: CDD, Gene3D, Hamap, PANTHER, Pfam, PIRSF, PRINTS, ProDom, ProSitePatterns, ProSiteProfiles, SFLD, SMART, SUPERFAMILY, TIGRFAM. The comprehensive annotation with InterProScan required some days to complete. For integrating Trinotate to the analysis, the gene-transcript relationship was created using Corset v1.06. The prediction of coding regions for Trinotate was performed with TransDecoder v5.3.0 (http://transdecoder.github.io). Blastp and blastx searches were run against Swiss-Prot database (release 2017_03), reporting only the best hit (-max_target_seqs 1). Blastp and blastx analyses were also conducted for a collection of gymnosperm-specific protein sequences retrieved from the TreeGenes Database (https://treegenesdb.org/) and ConGenIE (Sundell et al., 2015) (http://congenie.org/), namely Picea abies v1.0, Picea glauca PG29V4, Pinus lambertiana v1.0, Pinus taeda v1.01, and Pseudotsuga menziesii v1.0. In addition, blastp was run against TrEMBL Viridiplantae protein sequences (release 2017_03). Only gymnosperm and/or Viridiplantae-related sequences were retained. Protein domains were identified with HMMER v2.3 (http://hmmer.org/). Prediction of signal peptides, rRNA transcripts, and transmembrane regions were performed using SignalP v4.1, RNAMMER v1.2, and TmHMM v2.0, respectively. As the final result, the combination of Trinotate and InterProScan analyses allowed 75652 out of the 98870 pine transcripts (76.5%) to be successfully annotated (Duarte et al., 2019). Using this final dataset, the transcript ID GO ID relationships were retrieved using the GO_retriever.py script provided in this protocol for generating the background file which was used for the GO enrichment analysis with BiNGO v3.03. The enrichment analysis is provided as supplementary Figures S2 and S3, and Table S3 in Duarte et al. (2019).


Supplementary material

Supplementary File S1. GO_retriever: Python script for extracting GO IDs from an annotation file.

Acknowledgments

The authors thank M. J. Neave for the Python script, and Duarte et al. (2019) for the Scots pine data. This work was supported by the Russian Science Foundation (Grant #14-14-00666) and by the Russian Foundation for Basic Research (Grant # 18-34-20012).

Competing interests

The authors declare no competing interests.

References

  1. Afgan, E., Baker, D., Batut, B., van den Beek, M., Bouvier, D., Cech, M., Chilton, J., Clements, D., Coraor, N. and Grüning, B. A., et al. (2018). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res 46(W1): W537-W544.
  2. Bolger, A. M., Lohse, M. and Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina sequence data. Bioinformatics 30(15): 2114-2120.
  3. Bryant, D. M., Johnson, K., DiTommaso, T., Tickle, T., Couger, M. B., Payzin-Dogru, D., Lee, T. J., Leigh, N. D., Kuo, T. H. and Davis, F. G., et al. (2017). A tissue-mapped axolotl de novo transcriptome enables identification of limb regeneration factors. Cell Rep 18(3): 762-776.
  4. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J. and Bealer, K.and Madden (2009). BLAST+: architecture and applications. BMC Bioinformatics 10: 421.
  5. Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., Szcześniak, M. W., Gaffney, D. J., Elo, L. L., Zhang, X. and Mortazavi, A. (2016). A survey of best practices for RNA-seq data analysis. Genome Biol 17: 13.
  6. Davidson, N. M. and Oshlack, A. (2014). Corset: enabling differential gene expression analysis for de novo assembled transcriptomes. Genome Biol 15(7): 410.
  7. Duarte, G. T., Volkova, P. Yu. and Geras’kin, S. A. (2019). The response profile to chronic radiation exposure based on the transcriptome analysis of Scots pine from Chernobyl affected zone. Environ Pollut 250: 618-626.
  8. Geniza, M. and Jaiswal, P. (2017). Tools for building de novo transcriptome assembly. Curr Plant Biol 11-12: 41-45.
  9. Gilbert, D. (2020). Longest protein, longest transcript or most expression, for accurate gene reconstruction of transcriptomes. bioRxiv. doi: https://doi.org/10.1101/829184
  10. Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z. and Thompson, D. A., et al. (2011). Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat Biotechnol 29(7): 644-652.
  11. Grüning, B., Dale, R., Sjödin A, Chapman, B. A., Rowe, J. et al. (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods 15(7): 475-476.
  12. Haas, B. J, Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D. and Bowden, J., et al. (2013). De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity. Nat Protoc 8(8): 1494-1512.
  13. Honaas, L. A., Wafula, E. K., Wickett, N. J., Der, J. P., Zhang, Y., Edger, P. P., Altman, N. S., Pires, J. C., Leebens-Mack, J. H. and dePamphilis, C. W. (2016). Selecting superior de novo transcriptome assemblies: lessons learned by leveraging the best plant genome. PLoS One 11(1): e0146062.
  14. Jones, P., Binns, D., Chang, H. Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A. and Nuka, G., et al. (2014). InterProScan 5: genome-scale protein function classification. Bioinformatics 30(9): 1236-1240.
  15. Krogh, A., Larsson, B., von Heijne, G. and Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305(3): 567-580.
  16. Lagesen, K., Hallin, P., Rødland, E. A., Staerfeldt, H. H., Rognes, T. and Ussery, D. W. (2007). RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35(9): 3100-3108.
  17. Langmead, B. and Salzberg, S. (2012). Fast gapped-read alignment with Bowtie. Nat Methods 9(4): 357-359.
  18. Li, B., Fillmore, N., Bai, Y., Collins, M., Thomson, J. A., Stewart, R. and Dewey, C. N. (2014). Evaluation of de novo transcriptome assemblies from RNA-Seq data. Genome Biol 15(12): 553.
  19. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G.and Durbin, R., 1001 Genome Project Data Processing Subgroup. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078-2079.
  20. Liu, J., Li, G., Chang, Z., Yu, T., Liu, B., McMullen, R., Chen, P. and Huang, X. (2016). BinPacker: packing-based de novo transcriptome assembly from RNA-seq data. PLoS Comput Biol 12(2): e1004772.
  21. Love, M.I., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq. Genome Biol 15: 550.
  22. Maere, S., Heymans, K. and Kuiper, M. (2005). BiNGO: a Cytoscape plugin to assess overrepresentation of Gene Ontology categories in biological networks. Bioinformatics 21(16): 3448-3449.
  23. McCarthy, D. J., Chen, Y. and Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res 40(10): 4288-4297.
  24. Petersen, T. N., Brunak, S., von Heijne, G. and Nielsen, H. (2011). SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Methods 8(10): 785-786.
  25. Rice, P., Longden, I. and Bleasby, A. (2000). EMBOSS: the european molecular biology open software suite. Trends Genet 16(6): 276-277.
  26. Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D., Amin, N., Schwikowski, B. and Ideker, T. (2003). Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11): 2498-2504.
  27. Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. and Zdobnov, E. M. (2015). BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19): 3210-3212.
  28. Sundell, D., Mannapperuma, C., Netotea, S., Delhomme, N., Lin, Y. C., Sjödin, A., Van de Peer, Y., Jansson, S., Hvidsten, T. R. and Street, N. R. (2015). The Plant Genome Integrative Explorer Resource: PlantGenIE.org. New Phytol 208(4): 1149-1156.
  29. Xie, Y., Wu, G., Tang, J., Luo, R., Patterson, J., Liu, S., Huang, W., He, G., Gu, S. and Li, S. (2014). SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-seq reads. Bioinformatics 30(12): 1660-1666.
  30. Zerbino, D. R. and Birney, E. (2008). Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18(5): 821-829.

简介

[摘要] RNA测序(RNA-seq)开启了研究整个转录组水平上几乎任何生物的可能性。然而,不存在测序和注释的准确参照基因组的可以是用于障碍物施加这种技术对非模式生物,尤其是对于那些具有复杂的基因组中。虽然从头转录组装配可以解决此问题,但通常在计算上要求很高。此外,没有自动化系统的转录组注释和基因本体富集分析通常是一项艰巨的任务。在这里,我们逐步描述了用于进行苏格兰松树转录组组装,注释和基因本体分析的管道( 樟子松(Pinus sylvestris ),一种裸子植物,具有复杂的基因组。该管道仅使用可用于科学界且在标准个人计算机上运行的免费软件,旨在促进非模型物种的转录组研究,但仍可灵活地与任何生物一起使用。


[背景]非模型生物对于环境和进化研究非常有价值。然而,直到从头组装方法的发展,缺乏密切相关的模型物种作为转录组学研究的参考一直是一个局限。从头组装人员将非模型生物带入了组学时代,但所需的分析可能对计算要求很高,并且通常很耗时。在测试,评估和建立执行所需分析的管道的情况下尤其如此。获得用于自动数据处理的计算服务器和/或软件的成本(可能会加快该过程)可能会限制某些研究小组的工作。我们在此详细介绍了具有复杂基因组生物的苏格兰松(Pinus sylvestris )的转录组组装,注释和基因本体(GO)分析所采用的程序(Duarte等,2019)。该管道是根据描述转录组研究最佳实践的工作和基准评估开发的(Conesa等,2016; Honaas等,2016; Geniza和Jaiswal,2017)。松树转录组分析是在32G RAM 8核计算机上进行的。分别评估并组合了三种不同的汇编器(BinPacker(Liu等人,2016),SOAPdenovo-Trans(Xie等人,2014)和Trinity(Grabherr等人,2011))的输出。对于松树转录组组装,通过用EvidentialGene过滤冗余后,将不同组装器的输出组合起来,可获得最佳结果(Gilbert,2020年)。尽管如此,组装人员的选择和组合仍应根据输入数据进行调整,并且可以根据此处逐步介绍的质量评估来做出决定。InterProScan(Jones等人,2014)对预测的蛋白质进行全面的特征注释,而BLAST +(Camacho等人,2009)允许进行种内和种间作图进行功能比较。因此,我们将InterProScan的蛋白质签名与多个BLAST +搜索结合起来,以注释P的转录组。樟子松(Duarte等,2019),与Trinotate整合(Bryant等,2017)。使用我们提供的Python脚本检索了唯一的GO标识符,并将其用于BiNGO的GO富集分析(Maere等,2005)。该管道基于对科学界开放的工具。虽然对于非模型生物的研究特别有趣,因为缺乏模型的生物缺乏密切相关且注释明确的物种来指导组装和注释,但该管道可以灵活地应用于几乎任何生物。

关键字:RNA-seq, 从头组装, 非模式生物, 樟子松, 裸子植物, 转录组教程

软件
软件– Linux OS :

              该协议中将使用的几个程序,例如BLAST +,Bowtie2,FASTA Splitter和BUSCO,可以使用Bioconda通道轻松进行管理(Grüning等人,2018; https://bioconda.github.io/)。Bioconda是专门用于生物信息学的软件包的存储库,它需要安装conda软件包。该经理会自动安装给定的程序都依赖关系,并使得它在你的可用路径。通过Bioconda进行的软件支持会不断更新,我们建议始终检查此存储库中是否提供了所需的软件。安装conda后,请按照建议的顺序添加频道(请参阅https://bioconda.github.io/user/install.html#install-conda)。其他软件将需要单独安装,并且可能需要进行调整,具体取决于您的系统。接下来,我们列出了每个步骤所需的软件,这些软件在流程图中进行了汇总(图1)。





图1.流程图说明了转录组组装,注释和基因本体分析的五个步骤。在方框中表示开始每个步骤的输入,每个步骤的主要步骤用黑色表示。每个步骤所需的软件以蓝色列出。为了执行转录组组装(B),用户可以使用选择的软件。该管道中描述的三个组装器用于苏格兰松树研究(Duarte等人,2019),并标记为可选[opt]。


A.数据预处理     
FastQC(http://www.bioinformatics.babraham.ac.uk/projects/fastqc)–测序文件的质量控制
FASTQ屏幕(https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen)-筛选排序文件
Trimmomatic(Bolger等人,2014;可通过Bioconda获得)–整理测序文件
Trinity软件包(Grabherr等,2011)–计算机标准化

B.转录组组装     
BinPacker(刘等人,2016 ; https://sourceforge.net/projects/transcriptomeassembly/)-汇编
EvidentialGene(吉尔伯特,2020; http://arthropods.eugenes.org/EvidentialGene/) -组件滤波
Galaxy平台(Afgan等人,2018;usegalaxy.org)–用于数据研究的开源在线平台
SOAPdenovo反式(解等人,2014; https://github.com/aquaskyline/SOAPdenovo-Trans)-汇编
Trinity软件包(Grabherr等人,2011;在Galaxy平台上可用)–汇编程序


C.质量评估     
Bowtie2(Langmead和Salzberg,2012;可通过Bioconda获得)–短读对准仪
BLAST +(Camacho等,2009;可通过Bioconda获得)–用于比较序列和搜索数据库的工具
BUSCO(Simão等人,2015;可通过Bioconda获得)–组装完整性评估
Samtools(李等人,2009; https://github.com/samtools/samtools)-数据操作
DETONATE(Li等人,2014;可通过Bioconda获得)–从头转录组装配的评估
三位一体软件包(Grabherr等,2011)–汇编中全长转录本的估计

D.转录组注释     
BLAST +(Camacho等,2009;可通过Bioconda获得)–用于比较序列和搜索数据库的工具
Bowtie2(Langmead和Salzberg,2012;可通过Bioconda获得)–短读对准仪
紧身胸衣(Davidson和Oshlack,2014年;可通过Bioconda获得)–读取计数的聚类
FASTA分路器(http://kirill-kryukov.com/study/tools/fasta-splitter/)-创建FASTA文件的子集
getorf(水稻等人; 2000年http://emboss.sourceforge.net/)-开放阅读框预测
HMMER(http://hmmer.org/)–注释域的预测
InterProScan的(琼斯等人,2014; https://www.ebi.ac.uk/interpro/)-蛋白质结构域标识符
RNAMMER(Lagesen等人,2007; http://www.cbs.dtu.dk/services/RNAmmer)-核糖体RNA注解
SIGNALP(彼得森等人,2011; https://services.healthtech.dtu.dk/)-信号肽的预测
TMHMM(克罗等人,2001; https://services.healthtech.dtu.dk/)-蛋白跨膜螺旋预测
TransDecoder(Haas et al。,2013; http://transdecoder.github.io)–识别编码区
Trinotate(科比等人,2017; https://trinotate.github.io/)-套件转录注释

E.基因本体分析     
BiNGO(Maere等人,2005;可通过Cytoscape平台下载)–基因本体分析
Cytoscape的(香农等人,2000; https://cytoscape.org/)-用于执行基因本体论分析平台
GO检索器–提供的脚本用于检索笔录ID-基因本体ID关系

程序


              该协议需要基本的生物信息学和Linux命令行技能。我们提供有关使用每种软件进行所需分析的一般说明。可以根据用户需要调整每种软件设置,为此,建议您参考软件手册以获取更多信息和故障排除。执行每个步骤所需的时间将完全取决于您的计算能力和数据集的大小。虽然组装和质量评估步骤可能需要几个小时,但全面的注释过程可能需要几天才能运行。根据使用Scots pine获得的结果,该协议末尾显示了一个应用示例作为支持信息。


预处理的RNA-SEQ数据集
              要启动此管道,必须修剪您的RNA-seq样品并进行质量检查。作为一般准则,FastQC将为您提供阅读质量的概述。如果您观察到剩余的载体序列,则可以使用FastQ屏幕将其除去,以提供载体序列列表,可以从美国国家生物技术信息中心(NCBI,https: //ftp.ncbi.nlm.nih.gov/pub获得)/ UniVec /)。Trimmomatic可以根据所选的质量值过滤样品,并准备配对的读数。Trimmomatic还可以对衔接子序列进行修饰,必须根据制备文库的方法提供该序列。FastQ Screen和Trimmomatic都有详细的手册,易于阅读。


[提示1]将生成几个文件以执行该协议的所有步骤。用易于识别的描述性名称保存它们,并相应地在文件夹中对它们进行排序。

              [提示2]一些汇编程序还接受未配对的读取作为输入。不要丢弃它们。


              用于构建通用从头对于给定物种的转录组的参考,建议合并所有RNA-SEQ文件(我。ê ,所有样品和条件),用于使用一个唯一的输入组件。某些软件可能会将其作为命令行的一部分来执行。有关特定的详细信息,请始终参阅软件的手册。


利用Linux终端,串联的所有样本的所有修整的fastq文件(左,右,和未配对)作为用于装配三个唯一输入(É 。克,“all_left.f ILE ”,“all_right.f ILE ”和“ all_unpaired.f ILE “)。数据文件的串联可以通过使用“ cat”函数来完成:

>猫sample_1-1.f ILE sample_2-1.f ILE sample_n-1.F ILE > all_left.f ILE

>猫sample_1-2.f ILE sample_2-2.f ILE sample_n-2.F ILE > all_right.f ILE


[可选]对三个串联的文件进行计算机标准化,以减少读取数据集中的冗余。Trinity软件包为此提供了一个脚本(insilico_read_normalization.pl )。如果您具有配对末端读取,则命令将为:

> ./insilico_read_normalization.pl --seqType <字符串> \

--left all_left.file --right all_right.file


其中seqType指定读取的格式(fa或fq )。可以调整其他几个参数,如RNA序列读取方向(--SS_lib_type ;dUTP方法应使用RF )。如果你打算使用三位一体和你配对的读取,它们附加到左边的人(例如,“all_left_all_unpaired.f ILE使用“猫”功能”)。请注意,反斜杠(\)仅用于出于可视化目的将命令分成几行(可以将其删除)。


转录组汇编(在Linux OS中执行的分析)
              为了执行转录组组装,您可以使用选择的软件。在此,我们提供了用于樟子松转录组的组装程序的说明(Duarte等,2019),因此将其标记为可选。BinPacker和SOAPdenovo-Trans可以在标准个人计算机上本地运行,但可能需要进行计算机标准化,以减少输入数据。由于Trinity的计算要求很高,因此可以选择使用Galaxy平台(usegalaxy.org)运行它。使用多个k聚体策略可以通过平衡敏感性和特异性来改善组装(Zerbino和Birney,2008),但它也可能引入冗余(Xie等人,2014)。稍后将解决此问题。Trinity已优化为以固定的k-mer大小(25 k-mers )运行(Grabherr等人,2011)。如果RAM内存有限,则可以使用BinPacker使用较小的k-mers进行组装,而SOAPdenovo-Trans可以有效地处理更长的长度。对于基于Bruijn的汇编程序,k-mer大小的增加必须基于奇数,以避免产生回文,这可能会阻碍图的构造(Zerbino和Birney,2008)。


                用不同的汇编器执行转录组组装。

[可选] BinPacker仅接受配对的输入。必须根据数据集调整参数,并使用各种k-mer长度(-k选项)运行,例如:

> ./binpacker -s <字符串> -p <字符串> -k <整数> \

-l all_left_ normalized.file -r all_right_ normalized.file


其中-k 是选择的k-mer长度,-s 设置fa或fq文件之间,-p 设置单端(单个,例如,需要-u single_reads.file)或成对结束(成对,需要-l left_reads.file -r right_reads.file)。最后,使用“ cat”功能将所有程序集连接到一个文件中。


[可选]运行SOAPdenovo-Trans的最简单方法是在配置文件中设置所有程序集参数和输入文件,包括所选的k-mer长度。它还允许同时使用配对(q1 = path_to_ all_left_ normalized.file和q2 = path_to_all_right_ normalized.file )和非配对(q = path_to_ all_unpaired_normalized.file)的读取。SOAPdenovo-Trans有两个版本:“ SOAPdenovo-Trans-31mer”,它执行k-mer大小从13到31的程序集;“ SOAPdenovo-Trans-127mer”,它接受最多127个k-mer(总是奇数值) 。调整配置文件后,使用不同的k-mers运行程序集:

> ./SOAPdenovo-Trans-127mer all -s config_file -o output_prefix


增加k-mer值会减少恢复的转录本数量,直到无法检索到为止。最后,使用“ cat”功能将所有程序集合并到一个最终文件中。


[可选] Trinity可以使用Galaxy平台(通过Web服务器(https://usegalaxy.org/)或使用您自己的实例(https://galaxyproject.org/admin/get-galaxy/)来运行。Trinity还可以将成对和不成对的读段结合起来进行汇编,必须将这些读段预先附加到左侧的读段上。提供了一些教程,显示了如何使用Web服务器(https://usegalaxy.org/)。上载经过质量处理的fastq文件(例如“ all_right_normalized.file”和“ all_left_all_unpaired_normalized.file”)。在工具列表中选择Trinity,然后根据您的数据设置参数。如果您的数据已经标准化,请关闭此选项并开始分析。

[提示3]我们建议测试不同的汇编程序。下一部分将描述的质量评估将帮助您决定将哪个程序集用作参考转录组。


从每个程序集中过滤出较小长度的isotig,这可能是伪测序产物。常规使用的阈值为200 bp。Galaxy平台中的“按长度过滤序列”工具可以执行此任务。如果您在本地使用其他汇编程序,则将这些程序集上载到Galaxy以对其进行过滤。可以在“基因组文件操作> FASTA / FASTQ>按长度过滤序列”下找到该工具。设置最小长度并运行。完成后,下载所有过滤的程序集文件。

[推荐]如果您选择使用多个汇编文件(不同的k-mers和/或不同的软件),则还可以将它们组合成一个单一的超集。由于不同程序集的组合会创建质量不同的副本的冗余文件,因此必须对其进行过滤。EvidentialGene tr2aacds管道执行编码潜力分析并删除完全冗余的笔录,而不是仅根据其长度过滤笔录。该软件包提供了其他几个组织在不同文件夹中的脚本。使用组装文件作为输入,按以下顺序执行任务:

使用“ trformat.pl”脚本(位于/ scripts / rnaseq /文件夹中的tr2aacds管道必需)来规范化每个程序集的脚本ID,例如:

> ./trformat.pl-输入assembly_1.fa-输出ID-assembly_1.fa


合并ID-正规化组件插入一个超集(ê 。克,采用“猫”功能”。);
运行tr2aacds(“ tr2aacds4.pl”,可在/ scripts / prot /文件夹中找到)管道,使用通过不同汇编程序获得的级联集和超集作为输入,以运行:

> ./tr2aacds4.pl -mrnaseq ID-assembly_1.fa


其中-mrnaseq设置要评估的程序集的名称。


管道将相关的成绩单输出为“ okay”集(“ okay-main”和“ okay-alts”),可以进行组合和使用。还创建了一个包含丢弃序列(“放置集”)的文件夹。

“ assemblystats”工具可以提供有关转录组装配的一般统计信息的摘要,例如isotig长度和装配的isotig的数量,称为“ assemblystat”。如果您正在运行自己的Galaxy实例,则可以在https://toolshed.g2.bx.psu.edu/view/konradpaszkiewicz/assemblystats/6544228ea290中找到工具存储库。或者,可以在PlantGenIE Galaxy服务器(http://galaxy.plantgenie.org)上访问它。

质量评估(在Linux OS中执行的分析)
              为了选择最具代表性的组件,有必要比较它们的质量。在此流程中,将评估四个标准以进行质量评估:(1)将读段映射回转录组。(2)完整性;(3)成绩单的表示;(4)评估DETONAT E分数。强烈建议执行所有四个分析,以全面了解装配件的质量。必须对要比较的每个程序集应用这四个标准。


[推荐]为了将读段映射回转录组,必须使用读段比对工具。例如,Bowtie2可以按以下方式使用:

首先,为要测试的每个转录组程序集建立一个索引:

> bowtie2-build input_transcriptome.fasta input_transcriptome.fasta

对每个程序集进行RNA-seq读数的比对,例如:
> bowtie2 --local --no-unal -x input_transcriptome.fasta \

-q -1 all_left_normalized.file \

-2 all_right_normalized.file | samtools视图-Sb--o output.bam


比对统计信息将在运行结束时报告。


[推荐]可以使用BUSCO评估转录组的完整性。在https://busco.ezlab.org/上检查最新版本。通过使用以下命令行,运行BUSCO非常简单:

> ./busco -m转录组-i input_transcriptome.fasta -o output_name -l LINEAGE


其中-m设置BUSCO的模式(基因组,蛋白质或转录组),谱系(-l)是将用作比较参考的数据集,并且必须与您的研究有机体相关。血统数据集可从BUSCO网站下载。(可选)可以设置CPU线程数(--cpu )和输出文件压缩(--tarzip)。


[推荐]全长成绩单的表示。当与模型生物一起工作时,可以通过对该生物的参考蛋白质数据集进行翻译后的BLAST +(blastx)搜索来估计表示形式。对于非模式生物,因为不存在用于比较的特定参考的,分析必须针对密切相关的物种(当可用时),或含所有已知的蛋白质(数据集的蛋白质进行ë 。克。,瑞士-Prot,网址为https://www.uniprot.org)。接下来,可以使用Trinity软件包中提供的脚本“ analyze_blastPlus_topHit_coverage.pl”来恢复最佳对齐目标的百分比。

首先,使用您的参考蛋白质组创建一个BLAST数据库:

> makeblastdb-在ref_prot_dataset.fasta -dbtype prot中


其中-in设置参考输入,-dbtype设置序列类型(prot ,对于蛋白质)

根据您的参考运行blastx:

> blastx -query input_transcriptome.fasta -db ref_prot_dataset.fasta \

-out blastx_output.outfmt6 -evalue 1e-20 -max_target_seqs 1 -outfmt 6


其中-query是您的转录组输入,-db设置参考数据库,-evalue设置阈值,-max_target_seqs通知要保留的序列数(1 ,仅针对最佳匹配),-outfmt 设置格式的输出文件(使用脚本需要6个)。(可选)可以设置CPU线程数(-num_threads )。


在输出文件上运行“ analyze_blastPlus_topHit_coverage.pl”脚本:

> ./analyze_blastPlus_topHit_coverage.pl blastx_output.outfmt6 \

input_transcriptome.fasta ref_prot_dataset.fasta


对于每个百分率水平(1脚本报告第一列)最佳匹配的转录物(2的数量第二列)和比较下面(3电平RD列)。


[推荐]评估DETONATE分数。RSEM-EVAL是DETONATE v1.11软件包的组成部分,它使用概率模型对装配质量进行无参考评估(Li等,2014)。DETONATE需要在质量评估之前调整一些设置:

首先,使用包装中提供的“ rsem-eval-estimate-transcript-length-distribution”脚本估算脚本的长度分布。如果没有参考数据集,则可以使用密切相关的物种:

> ./rsem-eval-estimate-transcript-length-distribution \

input_reference.fasta parameter_file


其中parameter_file是输出,将存储有关笔录的估计平均长度和标准差的信息;


接下来,使用“ rsem-eval-calculate-score”脚本计算分数:

> rsem-eval-calculate-score \

--transcript-length-parameters parameter_file \

--paired-end all_left_normalized.file all_right_normalized.file \

input_transcriptome.fasta \

assembly_name L


其中--transcript-length-parameters设置参数文件,--paired设置对末端序列的分析,input_transcriptome.fasta应该是将要评估的程序集,而Assembly_name是将用于输出文件的前缀,L是配对端数据的平均片段长度的整数。该程序默认使用Bowtie2。如果PATH中没有它,请使用--bowtie-path 选项设置目标。如有必要,可以设置其他几个选项。引爆分数将始终为负,但分数值越高,装配效果越好。

           
转录组注释(在Linux OS中执行的分析)
              InterProScan执行蛋白质签名的识别,这对于检索组装转录本的可能功能很有用。但是,分析很耗时。另外,Trinotate可以用于不太详细的蛋白质注释,但是提供了一种很好的方式来整合BLAST搜索并从中提取信息,因此当对密切相关的物种进行了良好注释时,它是一个不错的选择。Trinotate还使用一些工具进行蛋白质签名识别,但是输入文件受Trinity生成的输出限制,因此在使用其他汇编程序时,需要一些额外的步骤。为了使用InterProScan执行详细的蛋白质特征识别,应执行以下步骤:


首先,使用getorf预测转录本的开放阅读框(ORF),getorf是EMBOSS软件包(http://emboss.sourceforge.net/)或Transdecoder的一部分。例如,getorf可以与以下命令一起使用:

> getorf -sequence input_transcriptome.fasta -outseq ORF-predicted_transcriptome.fasta \

-表<列表>-查找0-最小


其中-minsize 的设置ORF的最小核苷酸大小(É 。克。,200), -表<列表>将遗传密码中使用(0为标准),和-find <列表>将搜索方法(0用于终止密码子之间区域的翻译)。


对于大型数据集,InterProScan分析可能在计算上要求很高。当不在计算服务器上运行时,为了避开与内存相关的问题,另一种方法是将ORF预测的转录组文件分成几部分。接下来,对每个子集运行分析(例如,使用脚本开始顺序分析)。首先,使用FASTA拆分器工具划分ORF预测的转录组文件:

> ./fasta-splitter.pl-测量数量-部分大小 ORF-predicted_transcriptome.fasta


其中--measure(all ,seq或count )指定是否使用所有数据,序列长度或序列数来划分文件,--part-size 设置要在其中存在的fasta序列数每个子集。


接下来,对每个子集运行InterProScan分析。对于InterProScan的最佳效果,激活合作伙伴资源的数据库(ē 。摹,豹,SIGNALP,SMART和TMHMM),并使用具有预先计算的结果的查找服务(需要互联网连接和软件的最新版本) 。如果您打算执行基因本体富集分析(将在下一节中详细介绍),则需要--goterms和--iprlookup选项。有关InterProScan配置的详细信息,可以在InterProScan文档页面(https://github.com/ebi-pf-team/interproscan)中找到。

[提示4]如果您使用的是conda工具,请检查您的Python和Java JDK / JRE版本是否符合InterProScan的要求。如果不是,则可以创建新的conda环境以与特定的软件版本一起运行(请在Conda网页上查看如何创建新环境;https://docs.conda.io/projects/conda/en/latest/index.html) 。


如果您将ORF预测的转录组划分为几个子集,请使用“ cat”函数将所有.tsv输出文件连接起来。
[可选]为了在包含非Trinity程序集时将Trinotate集成到分析中,有必要提供转录组的基因-转录本关系,可以使用Corset(https://github.com/Oshlack/Corset/ Wiki):
首先,按照步骤C1a中的说明为转录组汇编文件创建Bowtie2索引;
接下来,将每个用于组装的RNA-seq样本映射回带有Bowtie2的转录组。如果对程序集同时使用了成对和非成对读取,则命令将为:

> bowtie2 --all -X -x input_transcriptome.fasta \

-q -1 sample_n-1.fastq -2 sample_n-2.fastq \

-U sample_n-1.unpaired.fastq,sample_n-2.unpaired.fastq | \

samtools视图-Sb> sample_n.bam


其中-X 是数据的估计片段大小。如果仅将成对的读取用于程序集,则删除-U部分;


.bam文件用作紧身胸衣的输入。首先,为每个输入样本独立创建摘要:

> ./corset -m -r true-stop sample_n.bam


其中-m设置以过滤出少于个读的记录,而-r true-stop指定仅创建摘要(输出.corset-reads)。由于我们只想对为Trinotate创建所需的输入文件进行粗略估算,因此我们可以设置-m 0 (考虑所有记录)。以后,如果执行基因表达分析,则您选择使用的工具可能会精确考虑映射到转录本的读段数。


为所有样本创建汇总后,请继续进行Corset运行,定义要进行分析的分组:

> ./corset -i corset -g <列表> .corset-reads


其中-i corset设置先前创建的摘要作为输入文件(默认为.bam)。如果有样本组,如例如控制和处理的两次重复,它们可与由逗号(分离的-g选项指定ë 。克。,-g控制,控制,治疗,治疗)。输入文件必须遵循相同的顺序。如果未设置-g,则将独立考虑每个样本。给定示例的命令如下:


> ./corset -i corset -g控制,控制,处理,治疗\

control_1.bam.corset-reads control_2.bam.corset-reads \

treatment_1.bam.corset-reads treatment_2.bam.corset-reads


分析结束后,通过切换两列的位置以符合Trinotate的要求,编辑具有基因-转录本关系的Corset聚类文件。

[可选]可在http://trinotate.github.io获得详细的Trinotate分步指南。Trinotate依靠Swiss-Prot数据库执行其注释。该工具的一个有趣功能是它将BLAST搜索集成到分析中。当处理非模型物种或不受控制的环境中的样本时,针对高等分类单元的BLAST搜索可用于识别伪造的转录本,您可以考虑将其过滤掉。
           
基因本体分析
              几种方法和教程目前可用于运行的差异基因表达分析(É 。克。,轧边机[麦卡锡等人,2012 ] ,Deseq2 [洛夫等人,2014 ] ,等等),这是简单的。因此,我们将专注于GO富集分析的描述。但是,对于此步骤,都需要注释的基因集和具有差异表达基因的列表。

BiNGO是在Cytoscape平台(https://cytoscape.org/)中运行的工具。它可用于Linux,Mac OS和Microsoft Windows。BiNGO允许使用自定义注释作为背景执行GO富集分析。BiNGO的自定义文件必须解析为“ transcript_ID =数字”,其中,转录本ID与唯一的GO ID编号相关联(每行一个条目)。因此,在执行分析之前,有必要从注释文件中检索GO标识符。我们提供了一个Python脚本,用于从InterProScan检索GO信息(“ GO_retriever.py”,可作为补充文件S1获得)。有关1各转录ID ST列,脚本将返回各自的GO数,如果有的话,从N个列。当一个转录本与多个GO术语相关联时,该脚本会将转录本ID分成两行或更多行,每一行都链接到一个唯一的GO编号。输出是一个.txt文件。该脚本也可以用于从Trinotate输出中提取GO编号,但是有必要事先从报告文件中删除GO描述。


为了运行脚本,有必要根据您的输入数据对其进行编辑:
1号线:小号ubstitute INPUT_FILE.tsv通过输入文件名。
[可选]第2行:ç焊割输出文件(OUTPUT_FILE_GOs.txt)的名称。
第12行:B在注释文件上,设置报告GO ID号的列(N)。
第17行:乙Ý默认情况下,脚本被设置为识别由InterProScan的(竖线)中使用的GO分裂字符。如果打算将脚本用于Trinotate的输出,则首先需要删除为每个术语提供的描述(仅保留GO:number),相应地调整脚本中的分割字符,并更改信息所在的列报告(上一步)。
将注释文件和脚本放在同一目录中,运行脚本:

> python GO_retriever.py input_file


可以通过Cytoscape直接安装BiNGO(在菜单栏上,选择“应用程序”>“应用程序管理器...”,在列表中选择“ BiNGO”,然后安装)。安装后,BiNGO将列在“应用程序”菜单中。在运行BiNGO的情况下:
从差异表达基因的表,复制这些的ID要检查是否有过多表现(ē 。摹,上调或下调的基因)。
选择“从文本粘贴基因”选项,然后将ID粘贴到框中。
默认情况下选择“过度代表”和“可视化”。
选择统计检验,多重检验校正,和显着性水平(É 。克。,&的Benjamini霍赫贝格假发现率,P <0.05)。
选择您要可视化的类别。默认情况下,将仅显示在更正后过多代表的那些。
对于参考集,选择整个注释(默认)。
选择本体文件(.obo文件,ē 。g ^ 。,“去basic.obo”),它可以从GO协会网站上下载(https://geneontology.github.io/)。BiNGO提供了一些本体文件选项,但它们可能已过时。
在“选择名称空间”下,可以选择要评估的GO类别,例如生物过程,细胞成分或分子功能。对于它们中的每一个,将需要独立运行。
最后,在“选择生物体/注释”下,选择在先前步骤中创建的自定义注释文件作为背景。
如果选中用于保存数据的框,则已经可以设置目标文件夹。否则,您也可以稍后保存结果。

              BiNGO输出图形化的网络可视化图像,以颜色表示具有统计意义的丰富GO术语。可以调整节点以获得更好的可视化效果。通过在“文件”菜单下选择此选项可以导出图像。在底部,可以恢复每个丰富类别的显着性水平,以便绘制表格以便于可视化。


协议应用示例–苏格兰松树转录组分析


              为了描述该管道,Duarte等人详细介绍了用作示例的数据集。(2019)。总之,从切尔诺贝利核电站事故(库拉任,马萨尼和扎伯雷)事故后被放射性核素污染的三个区域的苏格兰松针库中分离出RNA样品,并从一个参考样区中分离出RNA样品。样品是125 bp的配对末端测序。FastQC v0.11.5用于质量检查序列数据,FastQ Screen v0.9.2用于从RNA-seq读数中去除载体序列。然后使用ILLUMINACLIP:Adapters.fa:2:30:10 HEADCROP:10,MINLEN:35,LEADING:5,TRAILING:5和SLIDINGWINDOW:4:30参数用Trimmomatic v0.36修剪读数。初步测试以三种不同的Phred品质进行修剪(5、20和30),结果表明,最高的Phred截止值是执行转录组组装的最可靠方法(Duarte等,2019)。所有与Linux相关的分析都是在Ubuntu 16.04或18.04版本上进行的,而GO富集评估是在Windows 10上运行的。处理后的读数可在Sequence Read Archive(BioProject PRJNA497687)中找到。


转录组组装
              对于松树转录组组装(Duarte等人,2019),我们选择使用Trinity v2.2.0软件包中提供的insilico_read_normalization.pl脚本对连接的读取文件进行规范化以减少冗余。接下来,使用了三种不同的汇编器,即BinPacker v1.1,SOAPdenovo-Trans v1.04和Trinity v2.2.0。生成了五个不同的程序集,并筛选了最小长度为200 bp的同构体。BinPacker使用默认参数运行,但根据数据集的插入大小,将缺口长度(-g)设置为500(Duarte et al。,2019)。组合的集合包含五个不同的k-mer(23、25、27、29和31),然后使用EvidentialGene管道(2015年3月19日发行)对其进行过滤以进行冗余。对于SOAPdenovo-Trans组件,使用十种不同的k-mer长度,然后进行组合(25、31、37、43、49、55、65、75、85和95)。该程序使用以下参数运行:max_rd_len = 115,rd_len_cutoff = 115,avg_ins = 500,reverse_seq = 0,asm_flags = 3,map_len = 32,-f,-F。k-mer> 95时未检索到数据。使用32G RAM 8核计算机,BinPacker和低耗电量的SOAPdenovo-Trans组件需要几个小时才能运行。连接的集用EvidentialGene过滤。Trinity是在Galaxy平台(usegalaxy.org)上使用配对端模式运行的,未配对读段附加在左侧读段上,并且默认参数(标准化除外)已取消。尽管仅在优化的k-mer长度下运行Trinity,但也使用EvidentialGene进行了过滤以进行比较。最后,使用“ assemblystat”工具检索每个程序集的常规统计信息,如表1所示。


表1. P. sylvestris程序集的常规统计信息,比较BinPacker,SOAPdenovo-Trans,Trinity和组合了所有程序集的超集的输出。改编自Duarte等。,(201 9 ),表S1。



部件



装箱机

SOAPdenovo

三位一体

超集

EvidentialGene管道





没有







配对

配对+未配对

配对+未配对

配对+未配对

配对+未配对

Isotig长度

最小长度

200

200

201

201

200

最长长度

16,876

9,566

14,596

12,302

16,876

平均长度

1487.12

885.15

624.99

1009.45

1072.61

标准偏差

946.61

752.01

684.85

894.45

888.64

中长

1,305

570

334

606

742

N50 isotig长度

1,963

1,378

1,050

1,714

1,677

isotig的数量



62,188

51,761

211,646

48,409

110,767

so≥1kb

38,593

15,658

35,612

17,071

44,162

N50中的Itig

16,962

10,756

33,842

9,681

24,080

isotigs中的碱基

所有isotig中的碱基

92,480,965

45,816,195

132,276,358

48,866,676

118,810,053

isotigs中的碱基≥1kb

77,776,253

28,686,010

67,951,389

34,341,436

85,844,429

isotig的GC含量

42.82%

43.37%

41.10%

42.85%

43.08%


              通常,SOAPdenovo-Trans在组装较长的isotig时表现不佳,如最大isotig长度所示。但是,考虑到平均长度,该汇编程序比未过滤的Trinity得分更高。显然,Trinity提供了高度冗余的输出,如过滤前后isotig的数量所示。三位一体的汇编还必须包含很大比例的短同构体,这可以解释在汇编器中检索到的最短中位长度和N50长度。但是,对Trinity的输出进行过滤会显着减少N50中的isotig数量。考虑到这些总体统计得分,BinPacker,已过滤的超集和未过滤的Trinity具有最有希望的初始状态。


质量评估
              该方案中描述的转录组装配质量评估的四个标准也适用于樟子松转录组研究(Duarte等人,2019)。它们在表2中进行了汇总。使用Bowtie v2.2.9将读取结果映射回五个程序集中的每个程序集。使用工厂集(embryophyta_odb9)评估BUSCO v2的完整性。针对从TreeGenes数据库(https://treegenesdb.org/)检索到的Pinus taeda v1.01蛋白和TrEMBL Viridiplantae蛋白序列(2017_03版)进行了翻译的核苷酸BLAST(blastx v2.7.1)搜索。最后,用于评分的质量,用引爆V1.11 P 。taeda蛋白核苷酸序列用于估计转录物的长度分布。执行所有四个分析需要几个小时,每个组装的结果在表2中提供。过滤后的SOAPdenovo-Trans集在所有四个评估标准中得分最低,而过滤Trinity的输出并不能提高其质量得分。另一方面,包括五个不同k-mer长度的已过滤BinPacker集,包括所有程序集的已过滤超集和未过滤的Trinity具有相似的对齐率和完整性。尽管未过滤的Trinity程序集的DETONATE得分略好,但超集显示了更好的全长成绩单和BUSCO得分表示(最高完整性和最少的BUSCO缺失数量)。选择组合的超集用于差异表达基因的评估和GO富集分析(Duarte等人,2019)。

表2.樟子松装配体的质量评估,将BinPacker,SOAPdenovo-Trans,Trinity和超集与所有装配体的输出进行比较。改编自Duarte等。,(2019 ),表S1。



部件



装箱机

SOAPdenovo

三位一体

超集

EvidentialGene管道





没有





对准

整体对齐

84.27%

81,73%

85,78%

81,02%

84,48%

布斯科

(n = 1440)

完成

1132(78.6%)

1124(78.0%)

1155(80.3%)

1114(77.4%)

1171(81.3%)

完整副本和单副本

851(59.1%)

869(60.3%)

669(46.5%)

975(67.7%)

815(56.6%)

完成并重复

281(19.5%)

255(17.7%)

486(33.8%)

139(9.7%)

356(24.7%)

支离破碎

62(4.3%)

72(5.0%)

58(4.0%)

54(3.8%)

58(4.0%)

缺少的BUSCO

246(17.1%)

244(17.0%)

227(15.7%)

272(18.8%)

211(14.7%)

Blastx松taeda

蛋白质命中长度(%)

箱中的命中计数/低于此箱的命中计数

100

3735/3735

3282/3282

3455/3455

3324/3324

3959/3959

90

671/4406

673/3955

672/4127

648/3972

679/4638

80

485/4891

516/4471

498/4625

485/4457

485/5123

70

427/5318

473/4944

509/5134

469/4926

444/5567

60

414/5732

474/5418

514/5648

493/5419

422/5989

50

316/6048

426/5844

469/6117

406/5825

359/6348

40

220/6268

332/6176

406/6523

345/6170

241/6589

30

180/6448

284/6460

401/6924

309/6479

238/6827

20

71/6519

156/6616

252/7176

188/6667

118/6945

10

7/6526

16/6632

32/7208

16/6683

9/6954

紫草

蛋白质命中长度(%)

箱中的命中计数/低于此箱的命中计数

100

3577/3577

3277/3277

3459/3459

3238/3238

3856/3856

90

1207/4784

1186/4463

1228/4687

1157/4395

1271/5127

80

718/5502

743/5206

760/5447

723/5118

761/5888

70

499/6001

512/5718

555/6002

505/5623

540/6428

60

421/6422

507/6225

499/6501

446/6069

483/6911

50

374/6796

442/6667

484/6985

397/6466

458/7369

40

406/7202

453/7120

588/7573

463/6929

489/7858

30

326/7528

426/7546

601/8174

415/7344

427/8285

20

160/7688

285/7831

613/8787

323/7667

322/8607

10

22/7710

50/7881

151/8938

61/7728

57/8664

起爆

得分

-1,50E + 09

-1,63E + 09

-1,35E + 09

-1,54E + 09

-1,50E + 09


转录组注释和基因本体富集分析
对于松树转录组研究(Duarte等,2019),将InterProScan v5.26-65.0结果与Trinotate v3.1.1报告的BLAST搜索摘要结合在一起。对于运行InterProScan,可以使用getorf(EMBOSS v6.6.0)预测ORF。然后,使用FASTA splitter v0.2.6将ORF预测的文件划分为包含10000个转录本的子集。InterProScan使用以下方法运行:CDD,Gene3D,Hamap,PANTHER,Pfam,PIRSF,PRINTS,ProDom,ProSitePatterns,ProSiteProfiles,SFLD,SMART,SUPERFAMILY,TIGRFAM。InterProScan的全面注释需要几天才能完成。为了将Trinotat e整合到分析中,使用Corset v1.06创建了基因与转录物的关系。Trinotate的编码区域的预测是使用TransDecoder v5.3.0(http://transdecoder.github.io)执行的。Blastp和blastx搜索针对Swiss-Prot数据库(版本2017_03)运行,仅报告了命中率最高的(-max_target_seqs 1)。还对从TreeGenes数据库(https://treegenesdb.org/)和ConGenIE(Sundell等人,2015)(http://congenie.org/)中检索到的裸子植物特异性蛋白质序列进行了Blastp和blastx分析。),即Picea abies v1.0,Picaa glauca PG29V4,Pinus lambertiana v1.0,Pinus taeda v1.01和Pseudotsuga menziesii v1.0。此外,blastp针对TrEMBL Viridiplantae蛋白序列(2017_03版)运行。仅保留了裸子植物和/或茄科植物相关序列。蛋白结构域用HMMER v2.3(http://hmmer.org/)鉴定。分别使用SignalP v4.1,RNAMMER v1.2和TmHMM v2.0进行信号肽,rRNA转录本和跨膜区域的预测。作为最终结果,Trinotate和InterProScan分析的结合使98870松树笔录中的75652(76.5%)被成功注释(Duarte等人,2019)。使用此最终数据集,使用此协议中提供的GO_retriever.py脚本检索脚本ID ‒ GO ID关系,以生成背景文件,该文件用于BiNGO v3.03的GO富集分析。富集分析作为补充图S2和S3,以及Duarte等人的表S3提供。(2019)。


补充材料

补充文件S1。GO_retriever:用于从注释文件中提取GO ID的Python脚本。


致谢


              作者感谢MJ Neave提供的Python脚本,以及Duarte等人。(2019)中的苏格兰松树数据。这项工作得到了俄罗斯科学基金会(授权号14-14-00666)和俄罗斯基础研究基金会(授权号18-34-20012)的支持。


利益争夺


              作者宣称没有利益冲突。


参考文献


Afgan,E.,Baker,D.,Batut,B.,van den Beek,M.,Bouvier,D.,Cech,M.,Chilton,J.,Clements,D.,Coraor,N.,Grüning, BA等。(2018)。适用于可访问,可复制和协作式生物医学分析的Galaxy平台:2018年更新。核酸研究46(W1):W537-W544。
Bolger,AM,Lohse,M.和Usadel,B.(2014年)。Trimmomatic:用于Illumina序列数据的灵活修剪器。生物信息学30(15):2114-2120。
布莱恩特,DM,约翰逊,K。,迪托马索,T。,滴答,T,库格,MB,Payzin-Dogru,D。,李,TJ,利,ND,郭,TH,戴维斯,FG等。(2017)。组织映射的新转录组可以鉴定肢体再生因子。细胞Rep 18(3):762-776。
Camacho,C.,Coulouris,G.,Avagyan,V.,Ma,N.,Papadopoulos,J.,Bealer,K.和Madden,TL(2009)。BLAST +:体系结构和应用程序。BMC生物信息学10:421。
Conesa,A.,牧歌,P.,塔拉索纳,S.,戈麦斯-卡夫雷罗,D.,塞韦拉,A.,麦弗逊,A.,Szcześniak,MW,加夫尼,DJ,的Elo,LL,张,X.和Mortazavi ,A.(2016年)。RNA-seq数据分析最佳实践调查。Genome Biol 17:13 。
新墨西哥州戴维森(Davidson)和奥什拉克(Oshlack)A.(2014)。紧身胸衣:能够从头组装转录组进行差异基因表达分析。Genome Biol 15(7):410。
Duarte,GT,Volkova,P。Yu。以及Geras'kin,SA(2019)。基于来自切尔诺贝利灾区的苏格兰松树的转录组分析,对慢性辐射的响应情况。ENVIRON Pollut 250:618-626。
Geniza,M.和Jaiswal,P.(2017年)。用于构建从头转录组装配的工具。Curr Plant Biol 11-12:41-45。
吉尔伯特(D.)(2020)。最长的蛋白质,最长的转录本或最多的表达,可用于转录组的准确基因重建?bioRxiv 。土井:https://doi.org/10.1101/829184
Grabherr,MG,Haas,BJ,Yassour,M.,Levin,JZ,Thompson,DA等。(2011)。三位一体:从RNA-Seq数据重建没有基因组的全长转录组。Nat Biotechnol 29(7):644-652。
Grüning,B.,Dale,R.,SjödinA,Chapman,BA,Rowe,J.等。(2018)。Bioconda:针对生命科学的可持续且全面的软件发行。Nat Methods 15(7):475-476。
Haas,B. J,Papanicolaou,A.,Yassour,M.,Grabherr,M.,Blood,PD,Bowden,J.等。(2013)。从RNA-Seq重新构建转录本序列:参考生成和Trinity分析。纳特Protoc 8(8):1494年- 1512
Honaas,LA,Wafula,EK,Wickett,NJ,明镜,JP,章,Y.,磨边机,PP,特曼,NS,皮雷斯,JC,Leebens-麦克,JH和dePamphilis,CW(2016)。选择优质的从头转录组:从最佳植物基因组中汲取的经验教训。PLoS One 11(1):e0146062。
琼斯,P.,Binns的,D.,昌,HY,弗雷泽,M.,李,W.,McAnulla,C.,McWilliam,H.,马斯兰,J.,米切尔,A.,核子,G.,等等 (2014)。InterProScan 5:基因组规模的蛋白质功能分类。生物信息学30(9):1236-1240。
Krogh,A.,Larsson,B.,von Heijne,G.和Sonnhammer,EL(2001)。使用隐马尔可夫模型预测跨膜蛋白拓扑:在完整基因组中的应用。分子生物学杂志305(3):567-580。
Lagesen,K.,哈林,P.,Rødland,EA,Staerfeldt,HH,Rognes,T和Ussery,DW(2007)。RNAmmer:核糖体RNA基因的一致和快速注释。Nucleic Acids Res 35(9):3100-3108。
Langmead,B.和Salzberg,S.(2012年)。与Bowtie 2的快速缺口阅读比对。Nat Methods 9(4):357-359。
Li B.,Fillmore,N.,Bai Y.,Collins,M.,Thomson,JA,Stewart,R. and Dewey CN(2014)。从RNA-Seq数据评估从头转录组装配。Genome Biol 15(12):553。
Li H.,Handsaker,B.,Wysoker,A.,Fennell,T.,Ruan,J.,Homer,N.,Marth,G.,Abecasis,G.和Durbin,R.,1001 Genome Project Data Processing子组。(2009)。序列比对/图谱格式和SAMtools。生物信息学25(16):2078-2079。
Liu,J.,Li,G.,Chang,Z.,Yu,T.,Liu,B.,McMullen,R.,Chen,P. and Huang,X.(2016年)。BinPacker:基于包装的从头转录组组装,来自RNA-seq数据。PLoS Comput Biol 12(2):e1004772。
Love,MI,Huber,W. and Anders,S.(2014年)。使用DESeq2对RNA-seq数据的倍数变化和分散度进行适度估计。Genome Biol 15:550。
Maere,S.,Heymans,K.和Kuiper,M.(2005)。BiNGO:一个Cytoscape插件,用于评估生物网络中基因本体论类别的过度表达。生物信息学21(16):3448-3449。
McCarthy,DJ,Chen,Y.和Smyth,G.K .(2012)。关于生物学变异的多因素RNA-Seq实验的差异表达分析。核酸研究40(10):4288-4297
田纳西州的彼得森,南卡罗来纳州的布鲁纳克,G。冯·海涅和H.尼尔森(2011)。SignalP 4.0:区分跨膜区域的信号肽。Nat Methods 8(10):785-786。
赖斯,P。,朗登,I。和布莱斯比,A。(2000年)。EMBOSS:欧洲分子生物学开放软件套件。遗传学趋势16(6):276 - 277。
Shannon,P.,Markiel,A.,Ozier,O.,Baliga,NS,Wang,JT,Ramage,D.,Amin,N。,Schwikowski,B。和Ideker ,T。(2003)。Cytoscape:用于生物分子相互作用网络集成模型的软件环境。Genome Res 13(11):2498-2504。
西蒙,FA,Waterhouse,RM,Ioannidis,P.,Kriventseva,EV和Zdobnov,EM(2015)。BUSCO:使用单拷贝直系同源物评估基因组装配和注释完整性。生物信息学31(19):3210-3212。
Sundell,D.,Mannapperuma,C.,Netotea,S.,Delhomme,N.,Lin,YC,Sjödin,A.,Van de Peer,Y.,Jansson,S.,Hvidsten,TR和Street,NR(2015 )。植物基因组整合浏览器资源:PlantGenIE.org。新Phytol 208(4):1149-1156。
谢燕。,吴刚。,唐健,罗,R.,帕特森,J.,刘S.,黄W.,何G.,顾S.,李,S.,等。(2014)。SOAPdenovo-Trans:具有短RNA-seq读取的从头转录组组装。生物信息学30(12):1660-1666。
Zerbino,DR和Birney,E.(2008)。Velvet:使用de Bruijn图进行从头开始的短读汇编的算法。Genome Res 18(5):821-829。
登录/注册账号可免费阅读全文
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2021 The Authors; exclusive licensee Bio-protocol LLC.
引用:Duarte, G. T., Yu., P. V. and Geras’kin, S. A. (2021). A Pipeline for Non-model Organisms for de novo Transcriptome Assembly, Annotation, and Gene Ontology Analysis Using Open Tools: Case Study with Scots Pine. Bio-protocol 11(3): e3912. DOI: 10.21769/BioProtoc.3912.
提问与回复
提交问题/评论即表示您同意遵守我们的服务条款。如果您发现恶意或不符合我们的条款的言论,请联系我们:eb@bio-protocol.org。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。