参见作者原研究论文

本实验方案简略版
Dec 2020
Advertisement

本文章节


 

Protocol for RNA-seq Expression Analysis in Yeast
酵母RNA-seq表达分析方法    

引用 收藏 提问与回复 分享您的反馈 Cited by

Abstract

Genome-wide sequencing of RNA (RNA-seq) has become an inexpensive tool to gain key insights into cellular and disease mechanisms. Sample preparation and sequencing are streamlined and allow the acquisition of hundreds of gene expression profiles in a few days; however, in particular, data processing, curation, and analysis involve numerous steps that can be overwhelming to non-experts. Here, the sample preparation, sequencing, and data processing workflow for RNA-seq expression analysis in yeast is described. While this protocol covers only a small portion of the RNA-seq landscape, the principal workflow common to such experiments is described, allowing the reader to adapt the protocol where necessary.


Graphic abstract:


Basic workflow of RNA-seq expression analysis.


Keywords: mRNA (mRNA), Sequence analysis (序列分析), Yeast (酵母), Relative expression levels (相对表达水平), Next-generation sequencing (新一代测序), Systems biology (系统生物学), Whole genome (全基因组)

Background

Sequencing of RNA (RNA-seq) has – with the emergence of next-generation sequencing – become a powerful tool to measure the presence and quantity of RNA in a given cell population or even within a single cell. Since its initial uses (Bainbridge et al., 2006; Cheung et al., 2006; Emrich et al., 2007; Weber et al., 2007; Nagalakshmi et al., 2008), RNA-seq has seen a large variety of applications: from gene expression analysis by quantitating the relative amounts of RNA sequence reads to the discovery of novel transcripts or splice variants, ribosome profiling, or the detection of single nucleotide polymorphisms. Even though a manifold of specific RNA-seq uses exists, the basic workflow remains the same: RNA molecules are extracted, amplified, sequenced, aligned to the genome of the host organism, and, subsequently, the data is analyzed. Sample requirements are relatively low; typically, 1 μg down to 10 ng input RNA is sufficient for downstream amplification and library generation. For single-cell RNA-seq, as little as 10 pg is required since the low amount of input material is amplified prior to library generation (Haque et al., 2017). RNA-seq can be applied to any population of extracted RNA, independent of the source organism. Due to the wide applicability of RNA-seq, sample preparation kits that vary in complexity are commercially available (from RNA extraction to whole RNA-seq library generation, including the computational analysis).


An integral part of RNA-seq is sequencing of the extracted RNA population. Most commonly, sequencing is performed by the detection of fluorescently labeled nucleic acids bound to the surface of flowcells, e.g., using platforms such as Illumina and PAC Biosystem. To this end, the RNA fragments are converted into a cDNA library and amplified, and flowcell adapters are introduced. During each sequencing cycle, DNA polymerases attach fluorescently labeled nucleotides to the flowcell-bound library molecules, which are then detected by the sequencer, typically generating read lengths of 150–300 bp to several Kbp (for Illumina and PAC Biosystem, respectively). More recently emerging is sequencing by passage of nucleic acids through protein nanopores embedded in membranes (e.g., by Nanoporetech) (Logsdon et al., 2020), allowing for the sequencing of much longer fragments (up to Mbp). At the time of writing, the most commonly available sequencers (e.g., Illumina or PAC Biosystem) cost around $100 k for the instrument alone, whereas table-top sequencers using the nanopore technology are considerably cheaper (~$10 k), promising wider applicability in the near future. Due to the considerable cost of the most commonly available sequencing systems, resources are often shared among labs or institutes and managed by trained professionals that ensure the acquisition and integrity of high-quality sequencing data.


While it extends beyond the scope of this manuscript to describe all the applications of RNA-seq, this protocol aims to provide a workflow for RNA-seq expression analysis that can be used as a reference backbone, which the reader can adapt to their specific needs (e.g., RNA extraction from a different source or the addition of splice-aware alignment steps for genomes of higher eukaryotes). RNA-seq expression analysis is a powerful and commonly used tool to identify genes that are up- or downregulated in a stressed sample (e.g., in the presence of genomic mutations, UV light, drugs, chemical or nutrient stress) as compared with a relaxed sample (e.g., wild-type cell population). A gene is “upregulated” or “downregulated,” respectively, when more or less of its RNA is measured (i.e., expressed in the cell) under the stressed conditions as compared with the wild type.


Here, the workflow for RNA-seq expression analysis in S. cerevisiae is described, from cell growth to RNA extraction, library generation, data processing, and analysis. This protocol focuses on using commonly available lab resources wherever possible and utilizes open source and free-of-cost software packages provided by the bioinformatics community. This workflow has proven to be robust and useful for the analysis of gene expression profiles in libraries of histone point mutants in yeast (Braberg et al., 2020).

Materials and Reagents

  1. 1.5 ml Eppendorf tubes (e.g., Eppendorf, catalog number: 0030120086)

  2. Petri dishes, plastic, 10-cm diameter (e.g., Falcon, catalog number: 353003)

  3. Sterile pipette tips

  4. Toothpicks (autoclaved)

  5. Dry ice

  6. Agar (e.g., Becton Dickinson, catalog number: 214030)

  7. Bacto Peptone (e.g., Becton Dickinson, catalog number: 211677)

  8. CHCl3 (Acid phenol, e.g., ThermoFisher, catalog number: AM9720)

  9. DEPC-ddH2O (Diethyl pyrocarbonate-treated water, e.g., Invitrogen, catalog number: 750024)

  10. EDTA (Ethylenediaminetetraacetic acid disodium salt dihydrate, e.g., Sigma-Aldrich, catalog number: E6635)

  11. EtOH (Ethanol, e.g., Sigma-Aldrich, catalog number: 459836)

  12. Formamide (e.g., Sigma-Aldrich, catalog number: 11814320001)

  13. Glucose (e.g., Molekula, catalog number: 13002238)

  14. HCl (Hydrochloric acid, e.g., Sigma-Aldrich, catalog number: 320331)

  15. NaAc (Anhydrous sodium acetate, e.g., Sigma-Aldrich, catalog number: S2889)

  16. NaOH (Sodium hydroxide pellets, e.g., Sigma-Aldrich, catalog number: 1064980500)

  17. SDS (Dodecyl sulfate sodium salt, e.g., Merck, catalog number: 13760)

  18. Tris (2-Amino-2-(hydroxymethyl)-1,3-propanediol, e.g., Sigma-Aldrich, catalog number: T1503)

  19. Yeast extract (e.g., Serva, catalog number: 24540)

  20. Yeast Extract Peptone Dextrose (YEPD) media (see Recipes)

  21. 1 M Tris-HCl solution, pH 7.5 (see Recipes)

  22. 0.5 M EDTA solution, pH 8.0 (see Recipes)

  23. 20% SDS solution (see Recipes)

  24. Tris-EDTA-SDS (TES) solution (see Recipes)

  25. 3 M NaAc solution, pH 5.2 (see Recipes)

Equipment

  1. Autoclave

  2. Centrifuge and table-top centrifuge

  3. Vortex

  4. Flasks, autoclavable

  5. Incubator

  6. pH meter

  7. Pipettes (1-ml, 200-μl, 20-μl, 2-μl)

  8. Stir bar and stir plate, magnetic

  9. Thermocycler

Software

  1. bbmap (BBMap – Bushnell, B.; Version 38.90; sourceforge.net/projects/bbmap

  2. Biocmanager (https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html; Version 3.12; https://bioconductor.org/install/)

  3. bioconda (Grüning et al., 2018; http://bioconda.github.io/user/install.html)

  4. bowtie2 (Langmead and Salzberg, 2012; Version 2.4.2; http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)

  5. bwa (Burrows-Wheeler Aligner, Li and Durbin, 2009; Version 0.7.17; http://bio-bwa.sourceforge.net/)

  6. DESeq2 (Love et al., 2014; Version 1.30.1, http://bioconductor.org/packages/release/bioc/html/DESeq2.html)

  7. fastqc (Andrews, 2010; FastQC: a quality control tool for high throughput sequence data; Version 0.11.9; https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  8. htseq-count (Anders et al., 2014; Version 0.11.1; https://htseq.readthedocs.io/en/release_0.11.1/install.html)

  9. Integrated Genomics Viewer (Robinson et al., 2011; Version 2.9.2; https://software.broadinstitute.org/software/igv/download)

  10. R (R Core Team, 2017; Version 4.0.4; https://cran.r-project.org/bin/windows/base)

  11. samtools (Li et al., 2009; Version 1.11; http://www.htslib.org/download/)

  12. tophat (Langmead et al., 2009; Version 2.1.1; https://ccb.jhu.edu/software/tophat/index.shtml)

  13. trimmomatic (Bolger et al., 2014; Version 0.40; http://www.usadellab.org/cms/?page=trimmomatic)

Procedure

  1. Sample preparation and RNA extraction

    It is of utmost importance when handling RNA that all materials and reagents are RNase-free. Furthermore, it must be noted that the computation of relative expression values described in Procedure D requires at least three biological replicates. Accordingly, for example, if the expression levels of a mutant strain are to be compared with a wild-type strain, six RNA samples need to be prepared (three for each strain), which can then be used to create six RNA-seq libraries.

    Here, an efficient and reliable method to extract RNA despite the robust yeast cell wall is described (Collart et al., 2001):

    1. With a sterile toothpick or pipette tip, pick single colonies of S. cerevisiae and inoculate 2 ml YEPD liquid media for growth at 30°C overnight.

    2. Inoculate 10 ml liquid YEPD with 50 μl overnight culture.

    3. Harvest the cells in the mid-log phase (OD600 ~1.0) by centrifugation, and transfer to a 1.5-ml Eppendorf tube. Resuspend in 300 μl DEPC-ddH2O, fast-spin in a table-top centrifuge (up to 9,500 × g), remove the supernatant, flash-freeze on dry ice, and store at -80°C.

    4. Resuspend the cell pellet in 400 µl TES solution. Add 400 µl acid phenol (CHCl3), cap the tube, and vortex vigorously for 10 s (avoid leakage and handle carefully!). Incubate for 60 min at 65°C, vortexing every 15 min (Collart and Oliviero, 2001).

    5. Place on ice for 5 min. Spin in a microfuge at 18,000 × g for 10 min at 4°C. Transfer the aqueous top layer to a clean tube (avoiding the white protein phase). Add 400 µl CHCl3 and vortex vigorously for 10 s. Spin in a microfuge at 18,000 × g for 10 min at 4°C. Transfer the aqueous top layer to a clean tube (pipette carefully and avoid the CHCl3 layer).

    6. Add a 1/10 volume of 3 M NaAc, pH 5.2, and 2.5 volumes EtOH (-20°C). Precipitate at -80°C for at least 60 min. Spin in a microfuge at 18,000 × g for 10 min at 4°C. Carefully remove the supernatant and wash the pellet by vortexing in 70% EtOH (-20°C). Spin in a microfuge at 18,000 × g for 10 min at 4°C.

    7. Resuspend the pellet in 100% formamide (at 4°C). Try an equal volume of liquid to pellet first, and move up from there. Most RNA should dissolve instantly. To aid solubilization, allow to sit at room temperature for 15 min, pipetting every 5 min. If the sample needs to be very concentrated, store at 4°C overnight.

    8. Determine the concentration by diluting 1/100 in H2O and measuring at OD260/280 (OD260 1 ≈ 40 µg/ml for RNA). Remember to add formamide at a 1/100 dilution to the blank.


  2. RNA-seq library generation and sequencing

    Before sample preparation and submission to a sequencing facility, it is strongly recommended to discuss the aims of the project with the trained personnel. For successful library generation, the input RNA concentration is critical, commonly ranging from 1 μg down to 10 ng per sample. While it is possible to generate RNA-seq libraries from scratch (i.e., producing adaptors, buffers, polymerase, etc., using your own materials), it is strongly recommended to use commercially available kits that require minimum common lab resources and are, most importantly, more reliable in the hands of researchers unfamiliar with RNA-seq library formation.

      During RNA-seq library generation, platform-specific adapters are attached to the extracted RNA molecules; therefore, the library kit must be chosen according to the sequencing platform to be used. Here, the QuantSeq 3’ mRNA-seq Library Prep Kit FWD for Illumina (Lexogen) was used for the generation of single-end (i.e., fragments will be sequenced from one end only), 50-bp reads, sufficient for RNA-seq expression analysis in yeast. For more complex eukaryotic genomes containing larger amounts of introns, and when longer reads are required, consider paired-end library kits and sequencing (i.e., fragments will be sequenced from both ends), after consultation with the sequencing facility staff.

    1. Generate cDNA libraries containing sequencer- and sample-specific adapters by carefully following the steps described in the manufacturer’s manual.

    2. Check the quality of the generated libraries and measure the cDNA concentration.

    3. Sequence the cDNA library. Here, an Illumina HiSeq 4000 sequencer was used.

    4. Check the quality of the raw read data, typically supplied in fastq format (Figure 1). The most common checks involve the number of reads per sample (should be the same order of magnitude for all sequenced samples, which means the files should be similar in size), the GC content (should match the overall GC content of the host organism), and the overall base quality. Several quality control tools exist; here, fastqc was used (see also Batut, 2021). It is strongly recommended that quality control is performed after each processing step to ensure the overall integrity of the data.



      Figure 1. An example read sequenced on an Illumina platform in FASTQ format. Line 1 contains the basic read information, line 2 contains the actual sequence, and line 4 contains the quality score for each base in Phred33 or Phred64 code.


  3. Processing of raw sequence data

    C1. Preparation of data processing

    The following steps describe the setup for the computational workflow described in Step C2, as well as the data analysis described in Procedure D. This workflow uses open source programs available on Linux operating systems (and its derivatives). While it is possible to process sequence files on Mac- or Windows-operated instruments, the reader is strongly recommended to use Linux-based utilities due to their wide applicability, timely updates, and community-based troubleshooting.

      For most of the processing steps described in Step C2 and D, multiple tools exist; in particular, for the acquisition of genome assembly and gene annotations (15), creation of index files (16), adapter trimming (17), read alignment (18), and read filtering based on quality (19). While it extends beyond the scope of this manuscript to describe all the tools in detail, alternatives to the programs used in this protocol are suggested.

      Several of the tools used here are available through so-called package managers, such as Bioconda or Biocmanager, allowing for easy installation of software and dependencies of most recent versions; hence, it is recommended to follow the installation order described here.


    1. Install the Conda package manager via gitHub and R, bowtie2, and samtools using the specific channel, Bioconda.


      curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

      sh Miniconda3-latest-Linux-x86_64.sh

      conda install -c bioconda R

      conda install -c bioconda bowtie2

      conda install -c bioconda samtools


    2. Install Biocmanager and the DESeq2-package from within R.


      R

      if (!requireNamespace(“BiocManager”, quietly = TRUE))

      install.packages(“BiocManager”)

      BiocManager::install()

      BiocManager::install(c(“DESeq2”))


    C2. Processing of raw sequence read files

    After passing quality checks, the sequence reads now undergo pre-processing and eventually, alignment to the genome of the host organism. First, genome assembly, gene annotation, and the genome index need to be prepared (C2.1 and C2.2). The sequence reads contain adapter contamination, random primer sequences, and low-quality tail reads, which need to be removed (C2.3) before the alignment of filtered reads to the genome of the host organism (C2.4). Finally, reads are filtered based on their quality score (C2.5) and indexed for downstream analysis (C2.6).


    1. Download the S. cerevisiae genome assembly and gene annotation. Here, UCSC versions, sacCer3.fa and sacCer3.ensGene.gtf, were used, respectively (downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/). Besides the UCSC genome browser, other platforms allow the download of genome assemblies and gene annotations (e.g., NIH/NCBI or Ensembl). Since every platform maintains a slightly different naming convention, it is important that genome assembly and gene annotation are acquired from the same platform to avoid program errors during processing.

    2. Create the index files based on the genome assembly (“sacCer3.fa”) from Step C2.1. Here, the index output files are stored with the base filename “sacCer3.” This index will be used by the alignment program in Step C2.4; hence, the indexing step must be adapted to the aligner of choice, and an example for bwa is given below. Caution: bwa is not a splice-aware alignment tool. If splice events need to be considered during analysis, aligners like tophat need to be used for index creation as well as alignment (Step C2.4).


      bwa index sacCer3.fa


    3. Remove the random primer sequence, adapter contamination, and low-quality tails. Here, bbmap was used for the library kit described in Step B1, according to the manufacturer’s recommended settings (see command below, from www.lexogen.com/quantseq-data-analysis). bbmap is a fast, splice-aware global alignment tool for RNA and DNA sequencing reads. The script “bbduk.sh” is used together with the polyA-tail sequence (polyA.fa.gz) and Illumina-specific adapter sequence information (truseq.fa.gz), located in the installation folder bbmap/resources/).

      Adapter sequences and low-quality reads can also be removed using different tools such as “trimmomatic” (see the software section). Independent of the software used, parameters need to be adjusted according to the library kit and sequencing platform used. As a result of trimming, the size of the output file should be slightly smaller than that of the input file.


      bbmap/bbduk.sh in=sample1.fastq out=sample1_trimmed.fastq ref=bbmap/resources/polyA.fa.gz,bbmap/resources/truseq.fa.gz k=13 ktrim=r forcetrimleft=11 useshortkmers=t mink=5 qtrim=t trimq=10 minlength=20; done


    4. Create alignments of the pre-processed sequence reads from Step C1.1 using an alignment tool, such as tophat, bbmap, bowtie2, or bwa. Here, bwa was used. Depending on the sequence file size (i.e., the number of reads), genome size, and CPU used, this step can take several minutes to several hours. On an Intel® Zeon® CPU E5-2699 v3 @ 2.3GHz, alignment took about 3 s per 100 k reads.

      The following command will align the input file reads (sample1_trimmed.fastq) to the genome index from Step C2.2 (sacCer3) and write the results to “sample1_trimmed_aligned.sam.”


      bwa mem sacCer3.fa sample1_trimmed.fastq > sample1_trimmed_aligned.sam


    5. Filter the data based on their quality by MAPQ filtering using samtools. Here, all reads with an average base read quality score less than 50 (i.e., the probability of correct mapping is > 99.999%, Figure 1) were removed from the mapped read files generated in Step C2.4. As a result of the quality filtering, the output file size will be smaller than the input file size. If no, or very few, reads remain, try filtering with a less stringent quality score (e.g., 20). If this recovers the number of reads, downstream analysis may still be possible, albeit less reliable.


      samtools view -bq 50 sample1_trimmed_aligned.sam > sample1_trimmed_aligned_mapq50.bam


    6. Sort the filtered, aligned reads from Step C2.5 and create the index files using samtools. This will create an index file with the same name as the input file, including the additional ending of “.bai.”

      Aside from the quality check of the output file using tools such as fastqc, the mapped reads can be visualized using the Integrated Genomic Browser (IGV). In addition to ensuring that the overall mapping of reads is correct, tools like IGV allow confirmation of the presence of intended genomic mutations or gene deletions (Figure 2).


      samtools sort sample1_trimmed_aligned_mapq50.bam -o sample1_trimmed_aligned_mapq50_sorted.bam samtools index sample1_trimmed_aligned_mapq50_sorted.bam



      Figure 2. Snapshot of the IGV browser visualization. Here, reads (grey bars) mapping to the genomic region of Set2 (blue, YJL168C) in yeast are compared between wild type (upper lane) and the ∆Set2 mutant (lower lane).


  4. Data analysis, calculation of expression values, and visualization of results

    The sorted and indexed files prepared in C) contain all the reads that were successfully aligned to the host’s genome (here, S. cerevisiae), numbering typically from several hundreds of thousands to millions of reads per replicate. This vast amount of information is a major hurdle for analysis by the researcher. Differential gene expression (DGE) analysis aims to determine which, if any, genes show a higher or lower amount of aligned reads across the tested conditions. To this end, reads belonging to a feature (i.e., a gene) are summed for each replicate, and differential expression values are calculated across conditions considering the variance within a condition among replicates; hence, it is critical for DGE that several replicates of the same condition are considered (typically, n = 3). Gene expression values are usually reported as log2-fold changes, in conjunction with adjusted P-values describing the significance of the change (cut-offs vary, but typically P-values < 0.05 are considered reliable).

      The number of aligned reads can differ strongly between replicates due to technical reasons (e.g., fluctuations in the amount of input RNA, variations in temperature of the thermocycler during library amplification, or differences in the binding capacity of the flowcell lanes); hence, reads must be normalized across replicates and conditions. Several normalization methods for the calculation of DGE values exist, such as Reads Per Kilobase of transcript per Million mapped reads (FPKM), Fragments Per Kilobase of transcript per Million mapped reads (RPKM), Transcripts per Million reads (TPM), or counts per feature (i.e., gene) (Dillies et al., 2013).

      Here, the count-based normalization by DESeq2 was used, based on the assumption that most genes are not differentially expressed across conditions; therefore, the counts per feature are extracted from each file generated in Step C2.6 using htseq (Step D1), combined, and indexed (Steps D2 and D3). Finally, the counts are normalized, and DGE values are calculated using DESeq2 (Steps D4 and D5). The results are visually represented using MA plots, where log-fold changes are plotted against the mean expression values (Step D5).


    1. Extract the counts for each sample using htseq-count. Here, the aligned, filtered, and sorted reads (e.g., sample1_trimmed_aligned_mapq50_sorted.bam) from Step C2.6 and the gene annotation file (sacCer3.ensGene.gtf) from Step C2.1 were used. This command generates a .txt file containing the number of reads assigned to each gene annotated in the gtf-file.


      htseq-count -f bam sample1_trimmed_aligned_mapq50_sorted.bam sacCer3.ensGene.gtf > sample1_trimmed_aligned_mapq50_sorted_counts.txt


    2. Count-based expression values are calculated using R and Dseq2; this requires the count data to be assembled in a text document (here, “counts.txt”) as well as in an index file (here, “table.txt,” Step D3).

      Generate a “counts.txt”-file that contains the counts for each replicate of a given sample (here, MUT_X) as well as the reference sample (here, WT_X) generated in Step D1 as columns in a tab-delimited txt document (Figure 3). As a quality check, it is recommended to check several lines (i.e., genes) for consistency (i.e., similar read counts among replicates of a certain condition). Importantly, the read counts are not yet normalized to the total number of read counts in each sample, and respective variations are expected.



      Figure 3. Example of a counts.txt-file in tab-delimited format. The first column designates the names of open reading frames (ORFs), and the first row indicates the names of the wild-type and mutant replicates. The numerical matrix contains the number of reads mapped in each replicate to the respective ORF.


    3. Generate a “table.txt”-file for each sample, indexing each column of data (Figure 4) in tab-delimited format.



      Figure 4. Example of a table.txt-file in tab-delimited format. Replicate names, as designated in counts.txt from Step D3, are indexed by their common condition (e.g., wild type or mutant).


    4. In R, load the Dseq2 library, the combined counts-file from Step D3, and the table-file from Step D4.


      library(DESeq2)

      count_table <- read.delim(‘counts.txt’,sep=’\t’,header=TRUE,row.names=’region_name’)

      sample_table <- read.delim(‘table.txt’,sep=’\t’,header=TRUE,row.names=’sample_name’)


    5. Write the RNA-seq expression and p-values to file using DESeq2. The generated .txt file (wt_mutant_p-values.txt) contains the log2-fold expression and p-values for the respective mutant in tab-delimited format and can now be used for further analysis or visualization. For data inspection, an MA-plot is generated (Figure 5). In MA-plots generated by DESeq2, significant hits are colored in red; hence, the first quality check is how many data points are colored in black (i.e., since most genes are not differentially expressed, most data points should be colored in black).


      dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ condition)

      dds <- DESeq(dds)

      res <- results(dds)

      resOrdered <- res[order(res$padj),]

      plot <- plotMA(res, main = ‘mutant‘, ylim = c(-2,2), xlab = ‘mean count‘)

      write.table(as.data.frame(resOrdered),sep=‘\t‘,quote=FALSE,file=‘wt_mutant_p-values.txt‘)



      Figure 5. Example of MA-plot analysis as generated by DESeq2. Genes that are statistically significantly up- or downregulated are marked in red above and below the x-axis, respectively.

Recipes

  1. Yeast Extract Peptone Dextrose (YEPD) media

    For each liter of YEPD, autoclave a mixture of 20 g Bacto Peptone, 10 g yeast extract, and 950 ml H2O. Add 50 ml 40% (w/v) glucose, mix and cool before use.

    For YEPD plates, add 24 g agar to the solution before autoclaving. Place the autoclaved solution on a magnetic stir plate, add a stir bar and 50 ml 40% (w/v) glucose, and cool the solution while stirring. Pour warm media into Petri dishes, allow to cool until solid, and store at 4°C until use.

  2. 1 M Tris-HCl solution, pH 7.5

    Dissolve 121.14 g Tris in 800 ml H2O.

    Adjust the pH to 7.5 with HCl.

    Bring the final volume to 1 L with deionized H2O.

    Autoclave and store at room temperature.

  3. 0.5 M EDTA solution, pH 8.0

    Add 18.6 g EDTA to 80 ml H2O (use DEPC-treated H2O).

    Mix on a magnetic stirrer until dissolved.

    Adjust the pH to 8.0 with NaOH (~2 g NaOH pellets).

    Dispense into aliquots and sterilize by autoclaving.

  4. 20% SDS solution

    Dissolve 20 g SDS in 90 ml H2O (use DEPC-treated H2O).

    Heat to 68°C and mix with a magnetic stirrer until dissolved.

  5. Tris-EDTA-SDS (TES) solution

    10 mM Tris-HCl pH 7.5

    10 mM EDTA pH 8.0

    0.5% SDS

  6. 3 M NaAc solution, pH 5.2

    Add 24.6 g sodium acetate to 80 ml H2O.

    Mix on a magnetic stirrer until dissolved.

    Adjust the pH to 5.2 with glacial acetic acid.

    Bring the volume to 100 ml with H2O.

Acknowledgments

I would like to thank Dr. Pavel Sinitcyn, Dr. Assa Yeroslaviz, and Dr. Rin Ho Kim from the Next-Generation Sequencing Core Facility at MPI Biochemistry for critical reading of the manuscript.

This protocol is based on the RNA-seq expression analysis performed in Braberg et al. (2020).

References

  1. Andrews, S. (2010). FastQC: a quality control tool for high throughput sequence data.
  2. Andrews, S., Ply, P. T. and Huber, W. (2014). HTSeq – A Python framework to work with high-throughput sequencing data. bioRxiv. doi: https://doi.org/10.1101/002824.
  3. Bainbridge, M. N., Warren, R. L., Hirst, M., Romanuik, T., Zeng, T., Go, A., Delaney, A., Griffith, M., Hickenbotham, M., Magrini, V., Mardis, E. R., Sadar, M. D., Siddiqui, A. S., Marra, M. A. and Jones, S. J. (2006). Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach. BMC Genomics 7: 246.
  4. Batut, B. (2021). Quality Control (Galaxy Training Materials).
  5. Bolger, A. M., Lohse, M. and Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15): 2114-2120.
  6. Braberg, H., Echeverria, I., Bohn, S., Cimermancic, P., Shiver, A., Alexander, R., Xu, J., Shales, M., Dronamraju, R., Jiang, S., Dwivedi, G., Bogdanoff, D., Chaung, K. K., Huttenhain, R., Wang, S., Mavor, D., Pellarin, R., Schneidman, D., Bader, J. S., Fraser, J. S., Morris, J., Haber, J. E., Strahl, B. D., Gross, C. A., Dai, J., Boeke, J. D., Sali, A. and Krogan, N. J. (2020). Genetic interaction mapping informs integrative structure determination of protein complexes. Science 370(6522).
  7. Cheung, F., Haas, B. J., Goldberg, S. M., May, G. D., Xiao, Y. and Town, C. D. (2006). Sequencing Medicago truncatula expressed sequenced tags using 454 Life Sciences technology. BMC Genomics 7: 272.
  8. Collart, M. A. and Oliviero, S. (2001). Preparation of yeast RNA. Curr Protoc Mol Biol Chapter 13: Unit13 12.
  9. Dillies, M. A., Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloe, D., Le Gall, C., Schaeffer, B., Le Crom, S., Guedj, M., Jaffrezic, F. and French StatOmique, C. (2013). A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief Bioinform 14(6): 671-683.
  10. Grüning, B., Dale, R., Sjodin, A., Chapman, B. A., Rowe, J., Tomkins-Tinch, C. H., Valieris, R., Koster, J. and Bioconda, T. (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences.Nat Methods 15(7): 475-476.
  11. Emrich, S. J., Barbazuk, W. B., Li, L. and Schnable, P. S. (2007). Gene discovery and annotation using LCM-454 transcriptome sequencing.Genome Res 17(1): 69-73.
  12. Haque, A., Engel, J., Teichmann, S. A. and Lonnberg, T. (2017). A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med 9(1): 75.
  13. Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat Methods 9(4): 357-359.
  14. Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.Genome Biol 10(3): R25.
  15. Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14): 1754-1760.
  16. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and Genome Project Data Processing, S. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25(16): 2078-2079.
  17. Logsdon, G. A., Vollger, M. R. and Eichler, E. E. (2020). Long-read human genome sequencing and its applications. Nat Rev Genet 21(10): 597-614.
  18. Love, M. I., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.Genome Biol 15(12): 550.
  19. R Core Team. (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  20. Robinson, J. T., Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G. and Mesirov, J. P. (2011). Integrative genomics viewer. Nat Biotechnol 29(1): 24-26.
  21. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M. and Snyder, M. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320(5881): 1344-1349.
  22. Weber, A. P., Weber, K. L., Carr, K., Wilkerson, C. and Ohlrogge, J. B. (2007). Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol 144(1): 32-42.

简介

[摘要] RNA的全基因组测序(RNA - SEQ)已成为获得关键洞察一种廉价的工具细胞ular和疾病的机制。简化了样品制备和测序,并允许在几天内获得数百个基因表达谱;ħ H但是,特别是,数据处理,策,和分析涉及许多步骤即可以压倒非专家。在此,样品制备,测序,和数据处理的工作流程,RNA - SEQ表达分析 在酵母中进行了描述。虽然该协议覆盖的RNA的只有一小部分- SEQ景观,的原理与人的工作流程常见于这样的实验进行说明,以使读者在必要时相适应的协议。

图文摘要:

RNA-seq 表达分析的基本工作流程。


[背景] RNA 测序(RNA - seq)具有 -与下一个的出现-代测序-成为衡量在给定的细胞群体或甚至存在和RNA的量的有力工具内的单个小区。自从最初使用(Bainbridge等人,2006;Cheung等人,2006;Emrich等人,2007;Weber等人,2007;Nagalakshmi等人,2008)以来,RNA - seq 已经看到了各种各样的应用程序:从基因表达分析通过孔定量TAT荷兰国际集团的RNA序列的相对量来读取新的转录物或剪接变体,核糖体图谱的发现,或该单核苷酸多态性的检测。即使特定RNA的歧管- SEQ用途存在小号,基本工作流程是一样的:RNA分子被提取,扩增,测序,对准到宿主生物体的基因组中,并且,随后,该数据被分析。样品要求相对较低;典型地,1微克至10毫微克输入RNA是足够用于下游扩增和文库生成。对于单细胞RNA - SEQ ,少至10微克的需要,因为输入材料的低量被放大之前,文库产生(哈克等人,2017 )。RNA - seq 可以应用于任何提取的 RNA 群体,而与来源生物无关。由于 RNA - seq的广泛适用性,复杂性不同的样品制备试剂盒可在市场上买到(从 RNA 提取到整个 RNA - seq 文库生成,包括计算分析)。
RNA - seq 的一个组成部分是对提取的 RNA 群体进行测序。最常见地,测序被执行的检测的荧光标记的核酸结合至流动池的表面例如,使用的平台,如Illumina公司和PAC Biosystem公司。为此,将 RNA 片段转化为 cDNA 文库并进行扩增,并引入流通池接头。在每个测序周期中,DNA 聚合酶将荧光标记的核苷酸连接到流动池结合的文库分子上,然后由测序仪检测,通常会产生 150 – 300 bp 到几个K bp 的读取长度(分别适用于 Illumina 和 PAC Biosystem)。近来出现的是通过嵌入在膜(蛋白纳米孔的核酸测序通道例如,通过Nanoporetech)(劳格斯登等人,2020),允许的更长的片段的测序(高达MBP) 。在写这篇文章的时候,最常用的音序器(例如,Illumina公司或PAC生物系统公司)的成本大约$ 100 K为仪器独自一人,而台式测序仪使用纳米孔技术是相当便宜(〜$ 10 K) ,承诺在更广泛的适用性不久的将来。由于最常用的测序系统的相当大的成本,资源往往实验室或研究机构之间共享,由训练有素的专业人员,以确保收购管理和诚信高-质量测序数据。
虽然它延伸超出了本手稿的范围来描述所有的应用程序的RNA - SEQ,该协议的目的是提供一种工作流用于RNA - SEQ表达分析,可以用来作为参考骨架,其中所述阅读器能适应其特定需求(例如,从RNA提取一个不同的源或所述另外的用于高等真核生物基因组的剪接感知对准步骤)。RNA - SEQ表达分析是一个强大的和常用的工具,以确定被上调或强调样品中下调(基因例如,在存在基因组突变,UV光,药物,化学品或营养应力)作为比较用轻松的样品(例如,野生-类型的细胞群)。的基因为“上调”或“下调,”分别当更或其RNA的少被测量(即,在细胞中表达)之下的应力状态小号作为比较与野生型。
这里,用于RNA工作流-在SEQ表达分析酿酒酵母进行了说明,从细胞生长到RNA提取,文库产生,数据处理,和分析。该协议侧重于尽可能使用常用的实验室资源,并利用生物信息学社区提供的开源和免费软件包。此工作流已被证明是用于基因表达谱在酵母组蛋白点突变体的文库的分析健壮和有用(Braberg等人,2020)。

关键字:mRNA, 序列分析, 酵母, 相对表达水平, 新一代测序, 系统生物学, 全基因组

 
材料和试剂
 
1. 1.5 mL Eppendorf管小号(例如,微量离心,目录号:0030120086 )      
2.培养皿,塑料10厘米直径(例如,隼,目录编号:353003)      
3.无菌移液器吸头      
4.牙签(高压灭菌)      
5.干我CE      
6.琼脂(例如,Becton Dickinson,目录号:214030)      
7. Bacto Peptone(例如,Becton Dickinson,目录号:211677)      
8.氯仿3 (酸性p苯酚的制备,例如,赛默飞,目录号:AM9720)      
9. DEPC-ddH 2 O(焦碳酸二乙酯-处理过的水,例如,Invitrogen,目录号:750024)      
10. EDTA(乙二胺四乙酸二钠盐二水合物,例如,Sigma-Aldrich公司,目录号:E6635)   
11.乙醇(乙醇,例如,Sigma-Aldrich公司,目录号:459836)   
12.甲酰胺(例如,Sigma-Aldrich,目录号:11814320001)   
13.葡萄糖(例如,Molekula,目录号:13002238)   
14.盐酸(盐酸一个CID,例如,Sigma-Aldrich公司,目录号:320331)   
15.醋酸钠(甲nhydrous小号憎恨乙酸盐,例如,Sigma-Aldrich公司,目录号:S2889)   
16.氢氧化钠(钠氢氧化物小丸,例如,Sigma-Aldrich公司,目录号:1064980500)   
17. SDS(十二烷基硫酸钠,例如,Merck,目录号:13760)   
18.三(2-氨基-2-(羟甲基)-1,3-丙二醇,例如,Sigma-Aldrich公司,目录号:T1503)   
19.酵母ë XTRACT(例如,赛瓦,目录号:24540)   
20.酵母提取物蛋白胨葡萄糖(YEPD)培养基(见配方)   
21. 1 M Tris -H Cl 溶液,pH 7.5(参见配方)   
22. 0.5 M EDTA 溶液,pH 8.0(见配方)   
23. 20% SDS 溶液(见配方)   
24. Tris-EDTA-SDS(TES)溶液(见配方)   
25. 3 M NaAc 溶液,pH 5.2(见配方)   
 
设备
 
1.高压釜      
2.离心机和台式离心机      
3.涡流      
4.烧瓶,可高压灭菌      
5.孵化器      
6. pH计      
7.移液器(1 - ml、200 - μl、20 - μl、2 - μl)      
8.搅拌棒和小号TIR板,磁性      
9.热循环仪      


软件
 
bbmap(BBMap – Bushnell, B.;版本 38.90;sourceforge.net/projects/bbmap)请在参考资料部分包含此参考资料。
Biocmanager ( https://cran.r-project.org/web/packages/BiocManager/vignettes/BiocManager.html ; 版本 3.12; https://bioconductor.org/install/ )
bioconda(Grüning等人,2018 年;http : //bioconda.github.io/user/install.html )
bowtie2(Langmead 和 Salzberg,2012 年;2.4.2 版;http : //bowtie-bio.sourceforge.net/bowtie2/index.shtml )
bwa(Burrows-Wheeler Aligner,Li 和 Durbin ,2009 年;0.7.17 版;http: //bio-bwa.sourceforge.net/ )
DESeq2 (Love等人,2014 年;1.30.1 版,http: //bioconductor.org/packages/release/bioc/html/DESeq2.html )
fastqc(安德鲁斯,2010 年); FastQC:高通量序列数据的质量控制工具 ;0.11.9 版本;https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
htseq-count(Anders等人,2014 年;0.11.1 版;https: //htseq.readthedocs.io/en/release_0.11.1/install.html )
集成基因组学查看器(Robinson等人,2011 年;2.9.2 版;https: //software.broadinstitute.org/software/igv/download )
R(R 核心团队,2017 年;4.0.4 版;https: //cran.r-project.org/bin/windows/base )
samtools(Li等人,2009 年;1.11 版;http: //www.htslib.org/download/ )
tophat(Langmead等人,2009 年;2.1.1 版;https : //ccb.jhu.edu/software/tophat/index.shtml )
修剪(Bolger等人,2014 年;0.40 版;http: //www.usadellab.org/cms/?page=trimmomatic )
 
程序
 
样品制备和 RNA 提取
处理RNA时,这是最重要的是所有的材料和试剂是RN一个本质-免费。此外,必须注意的是,在所述的相对表达值的计算P rocedure d至少需要三个生物replic吃秒。因此,例如,如果一个突变菌株的表达水平进行比较与野生-型菌株,六个RNA样品需要准备(3每个菌株),然后其可用于创建6个RNA - SEQ库.
这里,有效和可靠的方法来提取RNA,尽管所述健壮酵母细胞壁被描述(Collart等人。,2001) :
1.用无菌牙签或移液管尖端,挑的单个菌落的酿酒酵母和接种2 ml的YEPD过夜,在30℃生长的液体培养基。      
2.用 50 μl 过夜培养物接种 10 ml 液体 YEPD 。      
3.收获的细胞的对数中期(OD 600 〜1.0)通过离心,并转移到一个1.5 -毫升Eppendorf管中。重悬于300微升DEPC-的DDH 2 O,快速自旋在一个台式离心机(高达9500 ×克),除去在干冰上上清液,闪速冷冻,并储存在-80℃。      
4.重悬在400细胞沉淀微升TES溶液。添加400微升酸p苯酚的制备(氯仿3 ),帽管,并涡旋剧烈10小号(避免渗漏和处理小心LY !)。在 65°C 下孵育 60分钟,每 15分钟涡旋一次(Collart 和 Oliviero,2001 年)。      
5.在冰上放置 5分钟。旋在一微离心在18 ,000 ×克10在4℃下分钟。传送所述含水顶层到一个干净的试管(避免了白蛋白相)。加入400微升氯仿3和涡大力为10和小号。旋在一微离心在18 ,000 ×克10在4℃下分钟。传送所述含水顶层(小心移液管并避免氯仿到一个干净的试管3层)。      
6.添加一个1/10体积3的中号醋酸钠,pH值5.2 ,和2.5体积超微电极乙醇(-20℃)。在- 80°C 下沉淀至少 60分钟。旋在一微离心在18 ,000 ×克10在4℃下分钟。小心取出的上清液和洗涤的通过涡旋在70%的乙醇沉淀(-20℃)。旋在一微离心在18 ,000 ×克10在4℃下分钟。      
7.重悬的在100%的甲酰胺沉淀(在4℃)。首先尝试将等量的液体制成颗粒,然后从那里向上移动。大多数 RNA 应立即溶解。为了帮助溶解,在室温下静置 15分钟,每 5分钟移液一次。如果所述样本需要进行非常集中,储存于4℃过夜。      
8.确定的由浓度稀释H中1/100 2 ö和测量一吨OD 260/280 (OD 260 1≈40微克/ ml的RNA)。记得添加甲酰胺在1/100稀释到空白。      
 
RNA - seq 文库生成和测序
在样品制备和提交到测序设施之前,强烈建议与受过培训的人员讨论项目的目标。对于成功的文库生成,输入 RNA 浓度至关重要,通常每个样品的浓度范围从 1 μg 到 10 ng。虽然有可能产生RNA -从头序列库(即,生产适配器,缓冲液,聚合酶,等等。,用你自己的材料),强烈建议使用需要最低限度的共同实验室资源和市售的试剂盒,最重要的是,在不熟悉 RNA - seq 文库形成的研究人员手中更可靠。
  在 RNA - seq 文库生成过程中,特定平台的接头会连接到提取的 RNA 分子上;吨herefore,所述文库试剂盒必须根据所使用的测序平台选择。在此,所述QuantSeq 3'的mRNA -s当量库Prep试剂盒FWD对于Illumina公司(Lexogen)用于单端的生成(即,50的片段将被从一端仅测序)- bp的读取,足以用于RNA -酵母中的seq表达分析。对于含有较大量的内含子的更复杂的真核基因组,并且当读出较长的需要,考虑配对末端文库试剂盒和测序(即,碎片将从两端进行测序),与sequenc协商后荷兰国际集团设施的工作人员。
1.仔细按照制造商手册中描述的步骤,生成包含测序仪和样品特异性接头的 cDNA 文库。      
2.检查生成的文库的质量并测量 cDNA 浓度。      
3.对 cDNA 文库进行测序。这里使用了 Illumina HiSeq 4000 测序仪。      
4.检查原始读出的数据,典型地在FASTQ格式(供给的质量图URE 1 )。最常见的检查涉及的数目每样本的读取(应的大小的顺序相同的所有测序样品,这意味着该文件小号在尺寸上应类似),GC含量(应当匹配宿主生物体的总体GC含量) ,以及整体基础质量。存在多种质量控制工具;在这里,使用fastqc(见也Batut ,2021)。强烈建议在每个处理步骤之后进行质量控制,以确保数据的整体完整性。      
 
 
图 1.在 Illumina 平台上以 FASTQ 格式进行测序的示例。大号INE 1包含的基本读出的信息,线2包含实际序列,和第4行中包含质量分数在Phred33或Phred64每个基地鳕ë 。
 
原始序列数据的处理
Ç 1 。数据处理的准备
下面的步骤描述在描述的计算工作流设置小号TEP Ç 2 ,以及在所描述的数据分析程序D.此工作流程我们上课在Linux操作系统上(和其衍生物)可用的开放源程序。WH ILE就可以在Mac上处理顺序文件-或Windows - OPERAT编仪器,读者强烈建议使用基于Linux的公用事业由于其广泛的适用性,及时更新,并以社区为基础的故障排除。
  对于大多数的处理在步骤中描述小号TEP Ç 2和d ,多个工具存在; 特别是,用于获取基因组组装和基因注释(15),创建的索引文件(16),适配器修整(17),读出的对准(18) ,和读滤波基于质量(19)。虽然它扩展超出了本手稿描述所有范围的具体工具,以替代在这个协议中使用的程序提出了建议。
  此处使用的一些工具可通过所谓的包管理器获得,例如 Bioconda 或 Biocmanager,允许轻松安装最新版本的软件和依赖项;^ h ENCE ,建议遵循此处介绍的安装顺序。
 
1.通过GIT中安装康达包管理ħ UB和R,bowtie2 ,以及使用该特定信道samtools ,Bioconda。         
 
  curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  sh Miniconda3-latest-Linux-x86_64.sh
  conda install -c bioconda R
  conda install -c bioconda bowtie2
  conda install -c bioconda samtools
 
2.从 R 中安装 Biocmanager 和 DESeq2 包。         
 
     电阻
     if (!requireNamespace(“BiocManager”, quiet = TRUE))
          install.packages(“BiocManager”)
     BiocManager::install()
     BiocManager::install(c(“DESeq2”))
 
C2 。原始序列读取文件的处理
通过质量检查之后,该序列现在读取经历预处理,并最终,对准到宿主生物体的基因组中。首先,基因组装配,基因注释,和基因组索引需要准备(C2 。1和C2 。2 )。序列读数包含适配器污染,随机引物序列,和低-质量尾读取,需要被移除(C2 。3的过滤对准读取到宿主生物(基因组之前)C2 。4 )。最后,读出基于它们的质量得分(过滤C2 。5 )和索引用于下游分析(C2 。6 )。
 
1.下载的酿酒酵母基因组组装和基因注释。在这里,UCSC版本,sacCer3.fa和sacCer3.ensGene.gtf ,分别使用(从下载http://hgdownload.cse.ucsc.edu/goldenPath/sacCer3/bigZips/)。除了在UCSC基因组浏览器,其他平台允许的基因组组装和基因注释(下载例如,美国国立卫生研究院/ NCBI或ENSEMBL)。由于每个平台维持一个稍微不同的命名约定,重要的是,基因组组装一个第二基因注释是一个ç quired在加工过程中在同一个平台,以避免程序错误。         
2.创建的基于基因组上的组件(“索引文件sacCer3.fa从步骤”)C2 。1 . 在这里,索引输出文件以基本文件名“ sacCer3 ”存储。” 该索引将由步骤C2 中的对齐程序使用。4 ; ħ ENCE,索引步骤必须适于所选择的对准器,以及一个用于BWA例子在下面给出。警告:bwa 不是拼接感知对齐工具。如果剪接事件需要分析期间要考虑的,将被用于创建索引以及对准(步骤等顶帽需要对准C2 。4 )。         
 
  bwa 指数 sacCer3.fa
 
3.取出随机引物序列,接头污染,低-质量尾巴。在这里,bbmap 用于步骤B1 中描述的库套件,根据制造商的推荐设置(请参阅下面的命令,来自 www.lexogen.com/quantseq-data-analysis)。bb map 是用于 RNA 和 DNA 测序读取的快速、剪接感知全局比对工具。脚本“bbduk.sh”与 polyA-tail 序列 ( polyA.fa.gz ) 和 Illumina 特定的适配器序列信息 ( truseq.fa.gz ) 一起使用,位于安装文件夹bbmap/resources/ 中。         
衔接子序列和低-质量读取也可以使用不同的工具,例如“trimmomatic”(见被移除的软件部分)。独立于所使用的软件,参数需要根据所使用的库套件和测序平台进行调整。作为修剪的结果,所述的尺寸输出文件应略小于该的输入文件。
 
                bbmap/bbduk.sh in=sample1.fastq out=sample1_trimmed.fastq ref=bbmap/resources/polyA.fa.gz,bbmap/resources/truseq.fa.gz k=13 ktrim=r forcetrimleft=11 useshortkmers=t mink= 5 qtrim=t trimq=10 minlength=20;完毕
 
4.创建来自步骤C1的预处理序列读数的比对。1使用的对准工具,如顶帽,bbmap,bowtie2 ,或BWA。这里使用了 bwa。根据序列文件大小(即,所述的读取次数),基因组大小,和CPU使用,此步骤可能需要几分钟到几个小时。在英特尔®吉翁® CPU E5-2699 V3 @ 2.3GHz的,对齐包换花了约3小号每100 ķ读取。         
以下命令将输入文件读数 (sample1_trimmed.fastq) 与步骤C2 中的基因组索引对齐。2 (sacCer3) 并将结果写入“sample1_trimmed_aligned.sam” 。”
 
  bwa mem sacCer3.fa sample1_trimmed.fastq > sample1_trimmed_aligned.sam
 
5.使用 samtools通过MAPQ 过滤根据数据质量过滤数据。这里,所有与平均碱基阅读读取质量分数小于50(即,正确的映射的概率为> 99.999%,图URE 1)我们再从步骤生成的映射读取文件除去C2 。4 . 由于质量过滤,输出文件大小将小于输入文件大小。如果没有,或者很少,读取仍然存在,尝试用更少的严格的质量得分过滤(例如,20)。如果这恢复了读取次数,下游分析仍然是可能的,尽管不太可靠。         
 
  samtools 查看 -bq 50 sample1_trimmed_aligned.sam > sample1_trimmed_aligned_mapq50.bam
 
6.对来自步骤C2的过滤、对齐的读数进行排序。5 ,创造了用samtools索引文件。这将创建一个与输入文件同名的索引文件,包括“.bai .bai”的附加结尾。”         
除了在使用工具如fastqc输出文件的质量检查,该短片段可以使用集成的基因组浏览器(IGV)进行可视化。除了确保总体映射的读取是正确的,像IGV工具允许确认的通货膨胀预期基因组突变或基因缺失(存在图URE 2)。
 
  samtools 排序 sample1_trimmed_aligned_mapq50.bam -o sample1_trimmed_aligned_mapq50_sorted.bam 
  samtools 索引 sample1_trimmed_aligned_mapq50_sorted.bam
 
图 2. IGV 浏览器可视化的快照。在这里,映射到酵母中 Set2(蓝色,YJL168C)基因组区域的读数(灰色条)在野生型(上泳道)和 ∆Set2 突变体(下泳道)之间进行了比较。
 
数据分析,表达值的计算,和结果的可视化
)在C制备排序,索引文件包含所有的读取都成功地对齐到宿主基因组(在这里,酿酒酵母),从数十万到数百万一般编号的读取每replic吃。如此庞大的信息量是研究人员进行分析的主要障碍。差异基因表达 (DGE) 分析旨在确定哪些基因(如果有)在测试条件下显示更高或更低的对齐读数。为此,读取属于一个特征(即,一个基因)被求和对每个副本TE,和差异表达值跨越条件下计算的考虑荷兰国际集团之间副本的条件内的方差TE小号; ħ ENCE,对于DGE关键的几个复制品TE相同的条件下的s被认为(通常,N = 3)。基因表达值通常报道为log2倍变化,在与调整对-结合v alues描述的变化(截止值变化,但通常的意义P -v alues <0.05被认为是可靠的)。
  的对准的读取次数可之间强烈不同复制品TE原因可能会技术原因(例如,在波动的输入RNA,变化的量在温度的文库扩增过程中热循环仪,或在不同的的流通池通道的结合能力); ħ ENCE,读取必须跨复制品被归一化TE秒和条件。对于DGE值的计算几个归一化方法存在,例如读取每百万映射转录物的每千碱基读取(FPKM),片段每千碱基的每百万映射转录读取(RPKM),每百万转录读取(TPM) ,或每计数特征(即,基因)(Dillies等人,2013年)。
  在这里,基于大多数基因在不同条件下没有差异表达的假设,使用了 DESeq2 基于计数的归一化;吨herefore ,所述每个功能计数从在步骤生成的各文件中提取C2 。6使用htseq(步骤D1 ),组合,和索引(步骤小号D2和D3 )。最后,该计数被归一化,和DGE值使用DESeq2(步骤计算小号D4和D5 )。第r esults使用MA图,其中log倍变化作图的平均表达值可视表示(步骤D5 )。
 
1.提取物的使用htseq计数每个样品计数。在此,对齐,过滤,和排序读取(例如,从sample1_trimmed_aligned_mapq50_sorted.bam)步骤C2 。6和我们使用了来自步骤C2)1的基因注释文件 (sacCer3.ensGene.gtf) 。此命令生成s a 。txt文件,其中包含分配给 gtf 文件中注释的每个基因的读取数。      
 
  htseq-count -f bam sample1_trimmed_aligned_mapq50_sorted.bam sacCer3.ensGene.gtf > sample1_trimmed_aligned_mapq50_sorted_counts.txt
 
2.使用R和Dseq2计算基于计数的表达值;Ť他的需要计数在文本文档中被组装的数据(在这里,“counts.txt”),以及在一个索引文件(在这里,“table.txt ,”步骤D3 )。         
生成“counts.txt” -file包含š的计数为每个副本TE以及在步骤生成的参考样品(在此,WT_X)给定样品(这里,MUT_X)的D1为制表符分隔的txt列文件(图 URE 3)。作为质量检查,它建议检查几行(即,基因)的一致性(即,间复制品类似读取计数TE的特定条件或多个)。重要的是,所述读出的计数还没有归一化至读计数的每个样品中的总数量,以及相应的变化被预期。
 
 
图3 。制表符分隔格式的 counts.txt 文件示例。第一列表示开放阅读框(ORF)的名称,并在第一行表示野生的名称-型和突变重复。数值矩阵包含映射到每个副本te到相应 ORF的读取数。
 
3.每个样品产生一个“table.txt” -file ,索引数据的每一列(图URE制表符分隔的格式4)。      
 
图4 。制表符分隔格式的 table.txt 文件示例。复制品TE名称,作为指定在从步骤counts.txt D3 ,通过它们的共同的情况(索引例如,野生型或突变体)。
 
4.在R,加载Dseq2库,从步骤组合的计数,文件D3 ,并且从步骤表文件D4 。      
 
  图书馆(DESeq2)
  count_table <- read.delim('counts.txt',sep='\t',header=TRUE,row.names='region_name')
  sample_table <- read.delim('table.txt',sep='\t',header=TRUE,row.names='sample_name')
 
5.写的RNA - SEQ表达和p v使用DESeq2 alues到文件中。该摹enerated 。TXT文件(wt_mutant_p-values.txt )包含š所述的log 2倍为制表符分隔的格式的各突变体表达和p值,现在可以用于进一步的分析或可视化。对于数据检查,一个Ñ生成MA-图(图URE 5)。在DESeq2生成的 MA 图中,显着的命中用红色表示;因此,第一个质量检查是有多少数据点用黑色着色(即,由于大多数基因没有差异表达,大多数数据点应该用黑色着色)。      
 
dds <- DESeqDataSetFromMatrix(countData = count_table,colData = sample_table,design = ~ 条件)
dds <- DESeq(dds)
res <- 结果(dds)
resOrdered <- res[order(res$padj),]
plot <- plotMA(res, main = 'mutant', ylim = c(-2,2), xlab = 'mean count')
write.table(as.data.frame(resOrdered),sep='\t',quote=FALSE,file='wt_mutant_p-values.txt')
 
图 5. DESeq2 生成的 MA 曲线分析示例。统计上显着上调或下调的基因分别在 x 轴上方和下方标记为红色。
 
食谱
 
1.酵母提取物蛋白胨葡萄糖 (YEPD) 培养基      
对于每升YEPD的,高压灭菌20克细菌蛋白胨,10g的混合物ÿ东提取物,和950毫升水2 O.加入50mL 40%(W / V)葡萄糖,混合并在使用前冷却。
对于YEPD平板,添加24克一个GAR高压灭菌前的溶液中。放置在磁力搅拌板上的高压灭菌的溶液中,添加一个搅拌棒和50毫升40%(W / V)葡萄糖,并冷却所述溶液,同时搅拌。倾温暖媒体中,以P ETRI菜,允许冷却,直到固体,并在4℃直至使用存储。
2. 1 M Tris -H Cl 溶液,pH 7.5      
将 121.14 g Tris 溶解在 800 ml H 2 O 中。
调整的pH值至7.5,用HCl。
带来的终体积为1升的去离子与ħ 2 ö 。
高压灭菌并在室温下储存。
3. 0.5 M EDTA 溶液,pH 8.0      
将 18.6 g EDTA 添加到 80 ml H 2 O(使用 DEPC -处理过的 H 2 O)。
在磁力搅拌器上混合直至溶解。
用 NaOH(~2 克 NaOH 颗粒)将 pH 值调节到 8.0。
分装成等分试样并高压灭菌。
4. 20% SDS 溶液      
将 20 g SDS溶解在 90 ml H 2 O 中(使用 DEPC -处理过的 H 2 O)。
加热至 68°C 并用磁力搅拌器混合直至溶解。
5. Tris-EDTA-SDS (TES) 溶液      
10 mM Tris -H C l pH 7.5
10 mM EDTA pH 8.0
0.5% SDS
6. 3 M NaAc 溶液,pH 5.2      
将 24.6 克醋酸钠添加到 80 毫升 H 2 O 中。
在磁力搅拌器上混合直至溶解。
用冰醋酸将 pH 值调节到 5.2。
带来的体积为100毫升ħ 2 O.
 
致谢
 
我想感谢帕维尔Sinitcyn博士,博士阿萨Yeroslaviz ,从下一代测序核心设施和凛镐博士在MPI生物化学的手稿的批判性阅读。   
该协议基于 Braberg等人进行的 RNA-seq 表达分析。(2020)。
 
参考
请确认以黄色突出显示的参考文献。
 
1.安德鲁斯 (2010)。FastQC:高通量序列数据的质量控制工具。      
2. Andrews, S., Ply, PT 和 Huber, W. (2014)。HTSeq – 一个处理高通量测序数据的 Python 框架。生物Rxiv 。doi:https://doi.org/10.1101/002824。      
3. Bainbridge, MN, Warren, RL, Hirst, M., Romanuik, T., Zeng, T., Go, A., Delaney, A., Griffith, M., Hickenbotham, M., Magrini, V., Mardis, ER, Sadar, MD, Siddiqui, AS, Marra, MA 和 Jones, SJ (2006)。使用合成测序方法分析前列腺癌细胞系 LNCaP 转录组。BMC 基因组学7:246。      
4.巴图,B 。(2021)。质量控制(银河培训材料)。      
5. Bolger, AM、Lohse, M. 和 Usadel, B.(2014 年)。Trimmomatic:用于 Illumina 序列数据的灵活修剪器。生物信息学30(15):2114-2120。      
6. Braberg, H., Echeverria, I., Bohn, S., Cimermancic, P., Shiver, A., Alexander, R., Xu, J., Shales, M., Dronamraju, R., Jiang, S ., Dwivedi, G., Bogdanoff, D., Chaung, KK, Huttenhain, R., Wang, S., Mavor, D., Pellarin, R., Schneidman, D., Bader, JS, Fraser, JS, Morris , J., Haber, JE, Strahl, BD, Gross, CA, Dai, J., Boeke, JD, Sali, A. 和 Krogan, NJ (2020)。遗传相互作用作图为蛋白质复合物的综合结构确定提供信息。科学370(6522)。                    
7. Bushnell, B. bbmap,版本 38.90。sourceforge.net/projects/bbmap。      
8. Cheung, F., Haas, BJ, Goldberg, SM, May, GD, Xiao, Y. 和 Town, CD (2006)。使用 454 Life Sciences 技术对蒺藜苜蓿进行测序表达测序标签。BMC 基因组学7:272。      
9. Collart, MA 和 Oliviero, S. (2001)。酵母RNA的制备。Curr Protoc Mol Biol第 13 章:第 13 单元 12。                    
10. Dillies, MA, Rau, A., Aubert, J., Hennequet-Antier, C., Jeanmougin, M., Servant, N., Keime, C., Marot, G., Castel, D., Estelle, J., Guernec, G., Jagla, B., Jouneau, L., Laloe, D., Le Gall, C., Schaeffer, B., Le Crom, S., Guedj, M., Jaffrezic, F. 和法国 StatOmique, C. (2013)。Illumina高通量RNA测序数据分析归一化方法的综合评估。简要生物信息14(6): 671-683。                 
11. Grüning, B., Dale, R., Sjodin, A., Chapman, BA, Rowe, J., Tomkins-Tinch, CH, Valieris, R., Koster, J. 和 Bioconda, T. (2018)。Bioconda:用于生命科学的可持续和全面的软件分发。Nat 方法15(7): 475-476。   
12. Emrich, SJ, Barbazuk, WB, Li, L. 和 Schnable, PS (2007)。使用 LCM-454 转录组测序进行基因发现和注释。基因组研究17(1): 69-73。   
13. Haque, A.、Engel, J.、Teichmann, SA 和 Lonnberg, T.(2017 年)。用于生物医学研究和临床应用的单细胞 RNA 测序实用指南。基因组医学9(1):75。   
14. Langmead, B. 和 Salzberg, SL (2012)。使用 Bowtie 2 进行快速间隙读取对齐。 Nat Methods 9(4): 357-359。   
15. Langmead, B.、Trapnell, C.、Pop, M. 和 Salzberg, SL (2009)。短 DNA 序列与人类基因组的超快速和记忆效率比对。基因组生物学10(3):R25。   
16. Li, H. 和 Durbin, R. (2009)。使用 Burrows-Wheeler 变换进行快速准确的短读对齐。生物信息学25(14):1754-1760。   
17. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. 和基因组项目数据处理,S.(2009 年)。序列比对/映射格式和 SAMtools。生物信息学25(16):2078-2079。   
18. Logsdon, GA, Vollger, MR 和 Eichler, EE (2020)。长读长人类基因组测序及其应用。Nat Rev Genet 21(10): 597-614。   
19. Love, MI、Huber, W. 和 Anders, S. (2014)。使用 DESeq2 对 RNA-seq 数据的倍数变化和离散度进行适度估计。基因组生物学15(12):550。   
20. R 核心团队。(2017)。R:一种用于统计计算的语言和环境。R 统计计算基金会,奥地利维也纳。   
21. Robinson, JT, Thorvaldsdottir, H., Winckler, W., Guttman, M., Lander, ES, Getz, G. 和 Mesirov, JP (2011)。综合基因组学查看器。Nat Biotechnol 29(1): 24-26。   
22. Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M. 和 Snyder, M. (2008)。由 RNA 测序定义的酵母基因组的转录景观。科学320(5881):1344-1349。                 
23. Weber, AP, Weber, KL, Carr, K., Wilkerson, C. 和 Ohlrogge, JB (2007)。使用大规模平行焦磷酸测序对拟南芥转录组进行采样。植物生理学144(1): 32-42。   
 
版权所有 © 20 2 1作者;独家被许可人 Bio-protocol LLC。1                                                                                                                             
登录/注册账号可免费阅读全文
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2021 The Authors; exclusive licensee Bio-protocol LLC.
引用: Readers should cite both the Bio-protocol article and the original research article where this protocol was used:
  1. Bohn, S. (2021). Protocol for RNA-seq Expression Analysis in Yeast. Bio-protocol 11(18): e4161. DOI: 10.21769/BioProtoc.4161.
  2. Braberg, H., Echeverria, I., Bohn, S., Cimermancic, P., Shiver, A., Alexander, R., Xu, J., Shales, M., Dronamraju, R., Jiang, S., Dwivedi, G., Bogdanoff, D., Chaung, K. K., Huttenhain, R., Wang, S., Mavor, D., Pellarin, R., Schneidman, D., Bader, J. S., Fraser, J. S., Morris, J., Haber, J. E., Strahl, B. D., Gross, C. A., Dai, J., Boeke, J. D., Sali, A. and Krogan, N. J. (2020). Genetic interaction mapping informs integrative structure determination of protein complexes. Science 370(6522).
提问与回复
提交问题/评论即表示您同意遵守我们的服务条款。如果您发现恶意或不符合我们的条款的言论,请联系我们:eb@bio-protocol.org。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。