参见作者原研究论文

本实验方案简略版
Sep 2019

本文章节


 

Detecting Differentially Methylated Promoters in Genes Related to Disease Phenotypes Using R
利用R检测与疾病表型相关基因的差异甲基化启动子   

引用 收藏 提问与回复 分享您的反馈 Cited by

Abstract

DNA methylation in gene promoters plays a major role in gene expression regulation, and alterations in methylation patterns have been associated with several diseases. In this context, different software suites and statistical methods have been proposed to analyze differentially methylated positions and regions. Among them, the novel statistical method implemented in the mCSEA R package proposed a new framework to detect subtle, but consistent, methylation differences. Here, we provide an easy-to-use pipeline covering all the necessary steps to detect differentially methylated promoters with mCSEA from Illumina 450K and EPIC methylation BeadChips data. This protocol covers the download of data from public repositories, quality control, data filtering and normalization, estimation of cell type proportions, and statistical analysis. In addition, we show the procedure to compare disease vs. normal phenotypes, obtaining differentially methylated regions including promoters or CpG Islands. The entire protocol is based on R programming language, which can be used in any operating system and does not require advanced programming skills.

Keywords: Methylation (甲基化作用), Differentially methylated region (差异甲基化区域), Illumina 450K (Illumina 450K), Epigenome-wide association analysis (表观基因组范围的关联分析), Promoter (启动子), Gene set enrichment analysis (基因集合富集分析)

Background

DNA methylation plays an important role in many cellular processes and is currently being widely studied to gain a better understanding of human development and disease (Robertson, 2005). Most epigenome-wide association studies (EWAS) search for associations between DNA methylation and disease (Flanagan, 2015). For this aim, Illumina’s BeadChip arrays are widely used to measure DNA methylation in humans. Methylation in promoters is associated with gene expression repression (Boyes and Bird, 1992). The mechanisms of expression repression include impeding the binding of transcription factors and recruiting transcription repressors (Cedar and Bergman, 2012). Aberrant DNA methylation in these regions has been linked to several diseases, including cancer (Ehrlich and Lacey, 2013) and autoimmune disorders (Dozmorov et al., 2014). There are several R packages designed to detect differentially methylated regions that apply de novo and predefined strategies, as previously reviewed (Martorell-Marugán et al., 2019). Most of the predefined methods can be applied directly to detect differentially methylated promoters. On the contrary, de novo methods search for differentially methylated regions along the entire genome that should be annotated in order to detect which regions are located at promoters.


In this protocol, we present the complete data and statistical analysis pipeline to detect differentially methylated promoters in disease phenotypes from Illumina BeadChip data based on the mCSEA R package (Martorell-Marugán et al., 2019), which applies a predefined regions strategy. We used previously published data from patients with two rare neurodevelopmental diseases: Williams syndrome (WS) and 7q11.23 duplication syndrome (Dup7), as well as typically developing (TD) patients. Methylation in blood cells was measured in all the samples by the authors of the original study (Strong et al., 2015). This pipeline can be easily adapted to study other genomic regions such as gene body specific methylation patterns or CpG islands (CGIs). The complete code for this protocol is available as Supplemental_script file.

Equipment

  1. Personal computer with Windows, MacOS, or Unix-based operating system

Software

  1. R software environment 4.0.2. (https://www.r-project.org/)

  2. RStudio integrated development environment 1.3.1056 (not required, but strongly recommended, https://rstudio.com/)

  3. GEOquery R package 2.56.0 (Davis and Meltzer, 2007), https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html

  4. Minfi R package 1.34.0 (Aryee et al., 2014), https://www.bioconductor.org/packages/release/bioc/html/minfi.html

  5. M3C R package 1.10.0 (John et al., 2020), https://www.bioconductor.org/packages/release/bioc/html/M3C.html

  6. ggfortify R package 0.4.11 (Yuan et al., 2016), https://cran.r-project.org/web/packages/ggfortify/index.html

  7. DMRcate R package 2.2.3 (Peters et al., 2015), https://www.bioconductor.org/packages/release/bioc/html/DMRcate.html

  8. wateRmelon R package 1.32.0 (Pidsley et al., 2013), https://www.bioconductor.org/packages/release/bioc/html/wateRmelon.html

  9. FlowSorted.Blood.450k R package 1.26.0, http://bioconductor.org/packages/release/data/experiment/html/FlowSorted.Blood.450k.html

  10. mCSEA R package 1.8.0 (Martorell-Marugán et al., 2019), http://bioconductor.org/packages/release/bioc/html/mCSEA.html


Datasets
  1. Methylation data generated previously (Strong et al., 2015). GEO identifier: GSE66552 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66552)

Procedure

  1. Install the necessary R packages

    1. The required packages to run this pipeline can be installed from CRAN and from Bioconductor repositories using the following commands in an R terminal:


      install.packages("ggfortify")

      if (!requireNamespace("BiocManager", quietly = TRUE))

          install.packages("BiocManager")

      BiocManager::install(c("GEOquery", "minfi", "M3C", "DMRcate", "watermelon",

                         "FlowSorted.Blood.450k", "mCSEA"))


  2. Download raw data

    1. Figure 1 shows a diagram of the data analysis workflow. Illumina’s BeadChip arrays raw data are idat files containing the fluorescence intensities at each microarray spot. These idat files can be downloaded from the public Gene Expression Repository GEO (Edgar et al., 2002) using the GEOquery package (Davis and Meltzer, 2007). In this protocol, we are going to analyze data reported by Strong et al. (2015) that can be retrieved using the GEO identifier GSE66552. In an R session, execute the following code:


      library(GEOquery)

      getGEOSuppFiles("GSE66552")


    2. When the download finishes, there will be a folder in the current directory with the compressed data. Use this code to uncompress the data into the IDAT folder:


      untar ("GSE66552/GSE66552_RAW.tar", exdir = "IDAT")


    3. Finally, save the metadata of the experiment into a variable called pheno:


      GEOData <- getGEO("GSE66552")

      pheno <- pData(phenoData(GEOData[[1]]))



      Figure 1. Data analysis workflow. Black boxes contain the inputs, outputs, and intermediate files. Blue boxes contain the steps described in this protocol.


  3. Read idat files

    1. Minfi package (Aryee et al., 2014) can be used to read the idat files and manipulate the methylation data. For this aim, a target dataframe should be created from the experiment metadata. Table 1 contains the first 5 rows of this dataset to show the structure that this object should have.


      library(minfi)

      basenames_split <- strsplit(pheno$supplementary_file, "suppl/")

      basenames <- sapply(basenames_split, function(x) {return(x[2])})

      targets <- data.frame("Sample_Name" = pheno$title, "Basename" = basenames)


      Table 1. First 5 rows of the target dataframe

      Sample_Name Basename
      TD control (Sample_1) GSM1624830_9376525009_R03C02_Grn.idat.gz
      TD control (Sample_2) GSM1624831_9376525033_R01C01_Grn.idat.gz
      WS (Sample_16) GSM1624832_8795207093_R01C01_Grn.idat.gz
      WS (Sample_17) GSM1624833_8795207093_R03C01_Grn.idat.gz
      WS (Sample_18) GSM1624834_8795207093_R04C01_Grn.idat.gz


    2. Read the data and store it in a RGChannelSet object that will be used for later analyses. RGChannelSet is the class used by minfi to store the raw methylation data. The sample names in this object may also be specified in this step.


      rgchannelset <- read.metharray.exp(targets = targets, base = "IDAT")

      sampleNames(rgchannelset) <- targets$Sample_Name


  4. Perform quality control (QC)

    1. To check that there are no technical problems affecting any samples, QC steps should be performed. Minfi package includes a function to explore the median intensities of unmethylated and methylated channels of each sample. Low intensities may indicate problems with a sample. Figure 2A shows the output plot of this QC. In this case, there are no samples failing this QC. If any sample fails this QC, it should be discarded from the analyses.


      msetraw <- preprocessRaw(rgchannelset)

      qc <- getQC(msetraw)

      plotQC(qc)


    2. Principal Components Analysis (PCA) is also a recommended QC step to detect potential outliers. To this end, M3C (John et al., 2020) and ggfortify (Yuan et al., 2016) R packages can be used. Figure 2B shows the PCA plot for our example. As can be observed, there are no outlier samples and the groups are partially separated. The observable mixing between groups may be due to covariates such as gender, age, or ancestry of the patients.


      library(M3C)

      library(ggfortify)

      betas.raw <- getBeta(msetraw)

      phenoPCA <- data.frame("Group" = pheno$`group:ch1`, stringsAsFactors = F)

      pcaResult <- prcomp(t(na.omit(betas.raw)), scale. = TRUE)

      autoplot(pcaResult, data = phenoPCA, colour = "Group")



      Figure 2. Quality control of methylation data. (A) Minfi QC with the median unmethylated and methylated signal intensities of each sample. (B) PCA plot representing the first two principal components. Samples are colored by experimental group.


  5. Preprocess data

    1. To correct for technical biases and improve the data quality, it is essential to filter some probes and apply normalization methods. Minfi preprocessNoob function applies Noob (normal-exponential out-of-band), a background correction with dye-bias normalization (Triche et al., 2013). The output of this normalization is a MethylSet object containing the methylated and unmethylated signals for each CpG probe. This object can be transformed to a RatioSet class where β-values are stored instead of signals. Finally, a GenomicRatioSet object containing the β-values mapped to the genome should be created. The GenomicRatioSet is the necessary class for further analyses.


      mset <- preprocessNoob(rgchannelset)

      rset <- ratioConvert(mset)

      grset <- mapToGenome(rset)


    2. Subsequently, a detection P-value can be obtained for each probe and sample. This P-value is calculated by comparing the probe signal to the background signal measured using negative control probes. A detection P-value >0.01 indicates an untrustworthy signal. A filter step can be applied to remove those probes with a detection P-value >0.01 in a percentage of samples (in this case, at least 10% of samples).


      detectionPvalues <- detectionP(rgchannelset)

      detectionPvalues <- detectionPvalues[rownames(grset),]

      propGood <- apply(detectionPvalues, 1, function(x) {

      length(x[which(x<0.01)])/ncol(detectionPvals)

      })

      grset <- grset[which(propGood >= 0.9),]


    3. It is recommended to remove those probes that are located in sex chromosomes (only if both male and female samples are included) (Maksimovic et al., 2017), probes containing single nucleotide polymorphisms (SNPs) (Naeem et al., 2014), and cross-reactive probes (Chen et al., 2013). DMRCate package (Peters et al., 2015) contains the convenient function rmSNPandCH, which can be used to filter all these probes in one step:


      library(DMRcate)

      beta.noob <- getBeta(grset)

      beta.noob.filtered <- rmSNPandCH(beta.noob, rmXY = TRUE)


    4. Finally, Beta-Mixture Quantile (BMIQ) normalization may be applied to mitigate the systematic differences between type I and type II probe designs present in 450K and EPIC arrays (Teschendorff et al., 2013). For this purpose, the wateRmelon package (Pidsley et al., 2013) may be used:


      library(wateRmelon)

      annot <- getAnnotation(rgchannelset)

      probeType <- as.data.frame(annot[rownames(beta.noob.filtered),c("Name","Type")])

      probeType <- ifelse(probeType$Type %in% "I",1,2)

      beta.normalized <- apply(beta.noob.filtered, 2, function(x) {

      wateRmelon::BMIQ(x, design.v = probeType)$nbeta

      })


  6. Estimate cell type heterogeneity

    1. One of the major sources of variability in blood methylation data is the different cell type proportions across samples (McGregor et al., 2016). In this context, reference data from the FlowSorted.Blood.450k package can be used to estimate these proportions in each sample. In the case of data generated with the EPIC platform, FlowSorted.Blood.EPIC package should be used instead.


      library(FlowSorted.Blood.450k)

      cellCounts <- estimateCellCounts(rgchannelset)


    2. The cellCounts object is a matrix that can be merged with the experimental metadata to consider these estimations as covariates in further analyses:


      pheno = merge(pheno, cellCounts, by.x = "title", by.y = "row.names")

      rownames(pheno) <- pheno$title


  7. Identify differentially methylated promoters

    1. One of the most common methylation analyses is to detect regions that show different methylation levels across phenotypes, known as Differentially Methylated Regions (DMRs). To this end, different statistical procedures have been developed, most of which are based on well-established methodologies such as linear models (Ritchie et al., 2015). In this protocol, we use the mCSEA package (Martorell-Marugán et al., 2019) to perform a DMRs analysis focusing on promoter regions. The first step is to rank CpG probes according to their differential methylation across conditions using linear models. In these models, available covariates (e.g., the estimated cell type proportions) can be included to adjust their possible effect on the data. Here, we are first analyzing the WS vs. TD control comparison:


      library(mCSEA)

      phenoComp <- pheno[,c("group:ch1","CD8T", "CD4T", "NK", "Bcell", "Mono", "Gran")]

      rankedCpGs1 <- rankProbes(beta.normalized, phenoComp,

                     refGroup = "TD control", caseGroup = "WS",

                       covariates = c("CD8T", "CD4T", "NK", "Bcell", "Mono", "Gran"),

                       continuous = c("CD8T", "CD4T", "NK", "Bcell", "Mono", "Gran"))


    2. The following step is used to calculate the differentially methylated promoters. The output object is a dataframe with gene symbols as row names, estimated P-values in the pval column, and P-values adjusted by the Benjamini-Hochberg method (Benjamini and Hochberg, 1995) in the padj column. In addition, given that the mCSEA algorithm is based on Gene Set Enrichment Analysis (GSEA) (Subramanian et al., 2005), the Enrichment Score (ES) is provided for each gene. The sign of this score indicates whether the promoter is hypermethylated (positive sign) or hypomethylated (negative sign) in the case group as compared with the controls. Finally, the Normalized ES (NES) and the amount of CpGs in each promoter are reported in the NES and size columns, respectively. Table 2 shows the first 20 rows of results_promoters1 object. In the case of EPIC data, the parameter platform = “EPIC” should be added to the mCSEATest function. Furthermore, the regionsTypes parameter can be modified to “genes” or “CGI” to analyze the gene bodies and CGIs, respectively.


      set.seed(123)

      results_mCSEA1 <- mCSEATest(rankedCpGs1, beta.normalized,

      phenoComp, regionsTypes = "promoters")

      results_promoters1 <- results_mCSEA1$promoters[order(results_mCSEA1$promoters$padj),

                                                    -c(3,7)]


      Table 2. Top 20 differentially methylated promoters in WS vs. TD control comparison. Promoters are sorted by their adjusted P-value.

      Gene P-value Adjusted P-value ES NES Size
      MYEOV2 1.00E-10 1.09E-07 -0.89 -3.18 26
      KIAA1949 1.00E-10 1.09E-07 -0.65 -2.79 69
      DDR1 1.02E-10 1.09E-07 -0.65 -2.69 56
      FGF20 1.00E-10 1.09E-07 -0.97 -2.59 10
      LOC650226 1.00E-10 1.09E-07 0.99 2.13 7
      FLJ37307 1.00E-10 1.09E-07 0.99 2.14 7
      HCN1 1.00E-10 1.09E-07 0.99 2.25 8
      PRSS50 1.00E-10 1.09E-07 1.00 2.26 8
      CRISP2 1.00E-10 1.09E-07 0.99 2.31 9
      PRDM9 1.00E-10 1.09E-07 1.00 2.33 9
      RBPJL 1.00E-10 1.09E-07 0.98 2.38 10
      AURKC 1.00E-10 1.09E-07 0.95 2.39 11
      PCDHB15 1.00E-10 1.09E-07 0.91 2.46 15
      PLD6 1.00E-10 1.09E-07 0.99 2.63 14
      HOXA4 1.00E-10 1.09E-07 0.87 2.67 24
      DNAH6 1.25E-10 1.25E-07 0.94 2.43 12
      ANKRD30B 3.07E-10 2.89E-07 0.91 2.43 14
      HORMAD2 3.31E-10 2.94E-07 -0.92 -2.63 13
      HOXB6 3.81E-10 3.21E-07 -0.76 -2.77 29
      LYPD5 4.39E-10 3.52E-07 0.86 2.47 18


    3. Perform the same analysis for Dup7 vs. TD control comparison:


      rankedCpGs2 <- rankProbes(beta.normalized, phenoComp,

               refGroup = "TD control", caseGroup = "Dup7",

               covariates = c("CD8T", "CD4T", "NK", "Bcell", "Mono", "Gran"),

               continuous = c("CD8T", "CD4T", "NK", "Bcell", "Mono", "Gran"))

      set.seed(123)

      results_mCSEA2 <- mCSEATest(rankedCpGs2, beta.normalized,

      phenoComp, regionsTypes = "promoters")

      results_promoters2 <- results_mCSEA2$promoters[order(results_mCSEA2$promoters$padj),

                                                    -c(3,7)]


    4. Furthermore, methylation values can be represented for any promoter in their genomic context. Figure 3A shows the result of this plot for the CRISP2 gene, whose promoter is hypermethylated in WS patients and hypomethylated in Dup7 patients. To construct these plots, use the mCSEAPlot function:


      mCSEAPlot(results_mCSEA1, "promoters", "CRISP2", makePDF = F, leadingEdge = F)


    5. Finally, a GSEA plot can also be generated for each promoter. In this plot, all the analyzed CpGs are ordered by their differential methylation on the x axis, and vertical lines mark the CpGs belonging to the represented promoter. Figure 3B contains the CRISP2 GSEA plot, which can be generated with the mCSEAPlotGSEA function:


      mCSEAPlotGSEA(rankedCpGs1, results_mCSEA1, "promoters", "CRISP2")



      Figure 3. mCSEA plots for the CRISP2 gene. (A) Genomic context of the CRISP2 promoter. Points represent the β values for each CpG and sample. Lines show the mean methylation of each experimental group. (B) GSEA plot for the CRISP2 promoter in WS vs. TD control comparison. Vertical lines mark the location of CRISP2 promoter CpGs along the list of analyzed CpGs (horizontal black line).


    Conclusions:

    In this protocol, we show how to perform the whole analysis of Illumina BeadChip methylation data to identify differentially methylated promoters in disease samples using R packages. This protocol can be directly applied to any case-control study design. Furthermore, it can be easily adopted to study differential methylation in other regions such as gene bodies and CGIs. We are confident that this workflow will be useful for researchers studying methylation alterations associated with different disorders.

Acknowledgments

Jordi Martorell-Marugán is partially funded by Ministerio de Economía, Industria y Competitividad. This work was partially supported by FEDER/Junta de Andalucía-Consejería de Economía y Conocimiento (grant number: CV20-36723) and Consejería de Salud, Junta de Andalucía (Grant PI‐0173‐2017). We acknowledge the original research paper in which this protocol was derived (Martorell-Marugán et al., 2019).

Competing interests

The authors declare that they have no competing interests.

References

  1. Aryee, M.J., Jaffe, A.E., Corrada-Bravo, H., Ladd-Acosta, C., Feinberg, A.P., Hansen, K.D., and Irizarry, R.A. (2014). Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30: 1363-1369.
  2. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc 57: 289-300.
  3. Boyes, J., and Bird, A. (1992). Repression of genes by DNA methylation depends on CpG density and promoter strength: evidence for involvement of a methyl-CpG binding protein. EMBO J 11(1): 327-333.
  4. Cedar, H., and Bergman, Y. (2012). Programming of DNA Methylation Patterns. Annu Rev Biochem 81: 97-117.
  5. Chen, Y., Lemire, M., Choufani, S., Butcher, D.T., Grafodatskaya, D., Zanke, B.W., Gallinger, S., Hudson, T.J., and Weksberg, R. (2013). Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics 8: 203-209.
  6. Davis, S., and Meltzer, P.S. (2007). GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics 23: 1846-1847.
  7. Dozmorov, M.G., Wren, J.D., and Alarcón-Riquelme, M.E. (2014). Epigenomic elements enriched in the promoters of autoimmunity susceptibility genes. Epigenetics 9: 276-285.
  8. Edgar, R., Domrachev, M., and Lash, A.E. (2002). Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30: 207-210.
  9. Ehrlich, M., and Lacey, M. (2013). DNA Hypomethylation and Hemimethylation in Cancer. Adv Exp Med Biol 754:31-56.
  10. Flanagan, J.M. (2015). Epigenome-wide association studies (EWAS): past, present, and future. Methods Mol Biol Clifton NJ 1238: 51-63.
  11. John, C.R., Watson, D., Russ, D., Goldmann, K., Ehrenstein, M., Pitzalis, C., Lewis, M., and Barnes, M. (2020). M3C: Monte Carlo reference-based consensus clustering. Sci Rep 10: 1816.
  12. Maksimovic, J., Phipson, B., and Oshlack, A. (2016). A cross-package Bioconductor workflow for analysing methylation array data. F1000Res 5: 1281.
  13. Martorell-Marugán, J., González-Rumayor, V., and Carmona-Sáez, P. (2019). mCSEA: detecting subtle differentially methylated regions. Bioinforma Oxf Engl 35: 3257-3262.
  14. McGregor, K., Bernatsky, S., Colmegna, I., Hudson, M., Pastinen, T., Labbe, A., and Greenwood, C.M.T. (2016). An evaluation of methods correcting for cell-type heterogeneity in DNA methylation studies. Genome Biol 17: 84.
  15. Naeem, H., Wong, N.C., Chatterton, Z., Hong, M.K.H., Pedersen, J.S., Corcoran, N.M., Hovens, C.M., and Macintyre, G. (2014). Reducing the risk of false discovery enabling identification of biologically significant genome-wide methylation status using the HumanMethylation450 array. BMC Genomics 15: 51.
  16. Peters, T.J., Buckley, M.J., Statham, A.L., Pidsley, R., Samaras, K., V Lord, R., Clark, S.J., and Molloy, P.L. (2015). De novo identification of differentially methylated regions in the human genome. Epigenetics Chromatin 8: 6.
  17. Pidsley, R., Y Wong, C.C., Volta, M., Lunnon, K., Mill, J., and Schalkwyk, L.C. (2013). A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics 14: 293.
  18. Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies.Nucleic Acids Res 43: e47-e47.
  19. Robertson, K.D. (2005). DNA methylation and human disease. Nat Rev Genet 6: 597-610.
  20. Strong, E., Butcher, D.T., Singhania, R., Mervis, C.B., Morris, C.A., De Carvalho, D., Weksberg, R., and Osborne, L.R. (2015). Symmetrical Dose-Dependent DNA-Methylation Profiles in Children with Deletion or Duplication of 7q11.23. Am J Hum Genet 97: 216-227.
  21. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci 102: 15545-15550.
  22. Teh, A.L., Pan, H., Lin, X., Lim, Y.I., Patro, C.P.K., Cheong, C.Y., Gong, M., MacIsaac, J.L., Kwoh, C.-K., Meaney, M.J., et al. (2016). Comparison of Methyl-capture Sequencing vs. Infinium 450K methylation array for methylome analysis in clinical samples. Epigenetics 11: 36-48.
  23. Teschendorff, A.E., Marabita, F., Lechner, M., Bartlett, T., Tegner, J., Gomez-Cabrero, D., and Beck, S. (2013). A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinforma Oxf Engl 29: 189-196.
  24. Triche, T.J., Weisenberger, D.J., Van Den Berg, D., Laird, P.W., and Siegmund, K.D. (2013). Low-level processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Res 41: e90-e90.
  25. Yuan, T., Horikoshi, M., and Li, W. (2016). ggfortify: Unified Interface to Visualize Statistical Results of Popular R Packages. R J 8: 474-485.

简介

[摘要]基因启动子中的DNA甲基化在基因表达调节中起主要作用,并在甲基化模式的改变已经与多种疾病相关联。在这种情况下,不同的软件套件和统计方法已经被提出来ANALY ž è差异甲基化的立场和地区。其中,在mCSEA R软件包中实施的新颖统计方法提出了一个检测细微但一致的甲基化差异的新框架。在这里,我们提供了一个易于使用的管道,涵盖了所有必要的步骤,以检测来自mCSEA的差异甲基化启动子。 Illumina 450K和EPIC甲基化BeadChip数据。该协议覆盖了下载的数据从公共仓库,质量控制,数据滤波和归一化的,细胞类型的比例的估计,和统计分析。此外,我们展示了比较疾病与疾病的程序。正常表型,获得包括启动子或CpG岛在内的差异甲基化区域。整个协议是基于R的编程语言,它可以在任何操作系统一起使用,并且不需要高级编程技能。

[背景] DNA甲基化起着在许多细胞过程中起重要作用,目前正在广泛研究到g AIN更好的理解人类发育和疾病的(罗伯逊,2005) 。大多数表观基因组范围的关联研究(EWAS)都在寻找DNA甲基化与疾病之间的关联(Flanagan,2015)。为了这个目的,Illumina的BeadChip阵列被广泛用于测量人类的DNA甲基化。启动子中的甲基化与基因表达抑制有关(Boyes and Bird,1992)。表达抑制的机制包括阻止转录因子的结合和募集转录抑制子(Cedar and Bergman,2012)。在这些区域中异常的DNA甲基化已被链接到几种疾病,包括癌症(艾氏和雷斯,2013年)和自身免疫性DIS顺序小号(Dozmorov等人,2014) 。如先前所评论(Martorell-Marugánet al。,2019),有几种R软件包设计用于检测差异甲基化区域,它们应用了从头和预定义的策略。大多数预定义的方法可以直接应用于检测差异甲基化的启动子。相反,从头方法沿整个基因组搜索差异甲基化区域,该区域应加注解,以检测哪些区域位于启动子上。

在这个协议中,我们提出完整的数据和统计分析流水线来检测差异在从Illumina的疾病表型甲基化的启动子BeadChip芯片基于所述数据mCSEA ř包(马托雷尔-马鲁甘等人,2019) ,其申请IES一个预定义区域的策略。我们使用了先前发表的来自两种罕见神经发育疾病患者的数据:威廉姆斯综合征(WS)和7q11.23复制综合征(Dup7),以及典型的发育性(TD)患者。原始研究的作者对所有样品中的血细胞甲基化进行了测量(Strong等人,2015)。这个管道可以容易地适应研究其他基因组区域,例如基因BOD ý特定甲基化模式或CpG岛(CGI程序)。该协议的完整代码可作为Supplemental_script文件获得。

关键字:甲基化作用, 差异甲基化区域, Illumina 450K, 表观基因组范围的关联分析, 启动子, 基因集合富集分析



设备


个人电脑与Windows,MacOS的,或基于Unix的操作系统


软件


R软件环境4.0.2。(https://www.r-project.org/)
RStudio集成开发环境1.3.1056(不是必需的,但强烈建议使用https://rstudio.com/)
GEOquery R包2.56.0 (Davis和Meltzer,2007),https: //www.bioconductor.org/packages/release/bioc/html/GEOquery.html
Minfi R package 1.34.0 (Aryee et al。,2014),https: //www.bioconductor.org/packages/release/bioc/html/minfi.html
M3C R包1.10.0 (John等,2020),https://www.bioconductor.org/packages/release/bioc/html/M3C.html
ggfortify R包0.4.11 (Yuan et al。,2016),https: //cran.r-project.org/web/packages/ggfortify/index.html
DMRcate R包2.2.3 (Peters等人,2015),https: //www.bioconductor.org/packages/release/bioc/html/DMRcate.html
wateRmelon R软件包1.32.0 (Pidsley等,2013),https: //www.bioconductor.org/packages/release/bioc/html/wateRmelon.html
FlowSorted.Blood.450k R软件包1.26.0,http: //bioconductor.org/packages/release/data/experiment/html/FlowSorted.Blood.450k.html
mCSEA R软件包1.8.0 (Martorell-Marugánet al。,2019),http: //bioconductor.org/packages/release/bioc/html/mCSEA.html


数据集


先前生成的甲基化数据(Strong等人,2015)。GEO标识符:GSE66552(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE66552)


程序


安装必要的R软件包
可以在R终端中使用以下命令从CRAN和Bioconductor存储库安装运行此管道所需的软件包:


install.packages (“ ggfortify ”)


if (!requireNamespace (“ BiocManager ”,静默= TRUE))


    install.packages (“ BiocManager ”)


BiocManager :: install(c(“ GEOquery ”,“ minfi ”,“ M3C ”,“ DMRcate ”,“西瓜”,           

                    “ FlowSorted.Blood.450k”,“ mCSEA ”))


下载原始数据
图1示出的图的数据分析的工作流程。Illumina的BeadChip阵列原始数据是idat文件,其中包含每个微阵列点的荧光强度。这些IDAT文件可以从公众基因表达进行下载[R epository GEO (埃德加等人,2002年)使用该GEOquery包(Davis和梅尔策,2007年)。在这个协议中,我们将到ANALY ž由强报告Ë数据等。(2015),可以使用GEO标识符GSE66552检索 。在R会话中,执行以下代码:


图书馆(GEOquery )


getGEOSuppFiles (“ GSE66552”)


下载完成后,当前目录中将存在一个包含压缩数据的文件夹。使用以下代码将数据解压缩到IDAT文件夹中:


解压缩(“ GSE66552 / GSE66552_RAW.tar”,exdir =“ IDAT”)


最后,将实验的元数据保存到名为pheno的变量中:


GEOData <-getGEO(“ GSE66552”)


现象<-pData(phenoData(GEOData [[1]]))




图1.数据分析工作流。黑盒子包含输入,输出,和中间文件。蓝色框包含此协议中描述的步骤。


读取idat文件
Minfi软件包(Aryee等,2014)可用于读取idat文件并处理甲基化数据。当t他的目的,目标数据框应该从实验的元数据来创建。表1包含此数据集的前5行,以显示此对象应具有的结构。


图书馆(minfi )


basenames_split < -strsplit (pheno $ supplementary_file ,“ suppl /”)


basenames < -sapply (basenames_split ,function(x){return(x [2])})


目标< -data.frame (“ Sample_Name ” = pheno $ title ,“ Basename ” = basenames )


表1.第5行的所述目标数据帧


读取数据并将其存储在RGChannelSet对象中,该对象将用于以后的分析。RGChannelSet是minfi用于存储原始甲基化数据的类。此对象中的样本名称也可以在此步骤中指定。


rgchannelset < -read.metharray.exp (目标=目标,基准=“ IDAT”)


sampleNames (rgchannelset )<- targets $ Sample_Name


执行质量控制(QC)
为确保没有影响任何样品的技术问题,应执行质量控制步骤。Minfi程序包包含一项功能,可探索每个样品的未甲基化通道和甲基化通道的中值强度。低强度可以指示问题小号与样品。图2 A显示了此QC的输出图。在这种情况下,没有任何样品未通过此质量控制。如果任何样品未通过此质量控制,则应将其从分析中丢弃。


msetraw < - preprocessRaw (rgchannelset )


qc < -getQC (msetraw )


plotQC (qc)


主成分小号分析(PCA)也是推荐的QC步骤,检测潜在的异常值。为此,可以使用M3C (John等人,2020)和ggfortify (Yuan等人,2016)R包。图2B显示了我们示例的PCA图。如可以观察到,存在有没有异常值样品小号和基团被部分地分离。组之间可观察到的混合可能是由于协变量,如性别,年龄,或患者的血统。


图书馆(M3C)


图书馆(ggfortify )


betas.raw < -getBeta (msetraw )


phenoPCA < -data.frame (“ Group” = pheno $`group:ch1`,stringsAsFactors = F )


pcaResult < -prcomp (t(na.omit (betas.raw )),scale。= TRUE)


自动绘图(pcaResult ,data = phenoPCA ,color =“ Group”)






图2.甲基化数据的质量控制。(A)Minfi QC ,每个样品的中位未甲基化和甲基化信号强度。(B)表示前两个主要成分的PCA图。样品按实验组着色。


预处理数据
牛逼Ø正确的技术偏见和提高数据质量,必须过滤掉一些探头和应用规范化方法。Minfi preprocessNoob函数应用Noob(正态指数带外),这是一种通过染料偏差归一化进行的背景校正(Triche等人,2013)。标准化输出是一个MethylSet对象,其中包含每个CpG探针的甲基化和未甲基化信号s 。可以将该对象转换为RatioSet类,在其中存储β值而不是信号。最后,应创建一个包含映射到基因组的β值的GenomicRatioSet对象。该GenomicRatioSet是用于进一步分析必要的类。


MSET < - preprocessNoob (rgchannelset )


rset < -ratioConvert (mset )


grset < -mapToGenome (rset )


随后,可以获得每个探针和样品的检测P值。通过将探针信号与使用阴性对照探针测得的背景信号进行比较,可以计算出此P值。甲d etection P -值> 0.01指示š的不可信值得信号。过滤步骤可以应用于与以除去那些探针一个检测P样品中的百分比-值> 0.01(在这种情况下,样品的至少10%)。


detectionPvalues < - detectionP (rgchannelset )


detectionPvalues < - detectionPvalues [ rownames (grset ),]


propGood < -apply (detectionPvalues ,1,function(x){


长度(x [which (x <0.01)])/ ncol (detectionPvals )


})


grset < -grset [which(propGood > = 0.9),]           



我t被建议以除去那些探针位于性染色体(仅当被包括男性和女性的样本)(Maksimovic等人。,2017),含有单核苷酸多态性(SNP)的探针(纳伊姆等人,2014),和交叉反应性探针(陈等人。,2013)。DMRCate包(彼得斯。等人,2015)中包含的便利功能rmSNPandCH ,其可用于在一个步骤中所有这些探针筛选:


图书馆(DMRcate )


beta.noob < -getBeta (grset )


beta.noob.filtered < -rmSNPandCH (beta.noob ,rmXY = TRUE)


最后,可以使用Beta-Mixture Quantile(BMIQ)归一化来减轻450K和EPIC阵列中存在的I型和II型探针设计之间的系统差异(Teschendorff等,2013)。对于第是目的,所述西瓜包(Pidsley 。等人,2013) ,可以使用:


图书馆(wateRmelon )


annot < -getAnnotation (rgchannelset )


probeType < -as.data.frame (annot [rownames(beta.noob.filtered),c(“ Name”,“ Type”)])


probeType < -ifelse (probeType $ Type %in%“ I”,1,2)


beta.normalized < -apply (beta.noob.filtered ,2,function(x){


wateRmelon :: BMIQ(x,design.v = probeType )$ nbeta


})


估计细胞类型异质性
血液甲基化数据变异性的主要来源之一是样本中不同的细胞类型比例(McGregor等,2016)。在这种情况下,可以使用FlowSorted.Blood.450k程序包中的参考数据来估计每个样本中的这些比例。在所述与EPIC平台产生的数据的情况下,FlowSorted.Blood.EPIC包应该被代替使用。


库(FlowSorted.Blood.450k)


cellCounts < - estimateCellCounts (rgchannelset )


所述cellCounts对象是一个可与实验合并的矩阵人的元数据以考虑这些估计作为在进一步的分析协变量:


pheno = merge(pheno ,cellCounts ,by.x =“ title”,by.y =“ row.names ”)


行名(pheno )<- pheno $ title


识别差异甲基化的启动子
最常见的甲基化分析之一是检测跨表型显示不同甲基化水平的区域,称为差异甲基化区域(DMR)。为此,不同的统计人程序已经开发出来,大部分的这些基于成熟的方法,如线性模型(里奇等人,2015) 。在这个协议中,我们使用的mCSEA包(马托雷尔-马鲁甘等人,2019)来进行的DMRS分析集中于启动子区。第一步是使用线性模型根据不同条件下CpG探针的甲基化差异对其进行排名。在这些模型中,可以包括可用的协变量(例如,估计的细胞类型比例)以调整其对数据的可能影响。在这里,我们是第一ANALY ž荷兰国际集团的WS VS 。TD控件比较:


图书馆(mCSEA )


phenoComp < -pheno [,c(“ group:ch1”,“ CD8T”,“ CD4T”,“ NK”,“ Bcell ”,“ Mono”,“ Gran”)]]


rankCpGs1 < -rankProbes (beta.normalized ,phenoComp ,


                                                                                                refGroup =“ TD控件”,caseGroup =“ WS”,

                                                                                                协变量= c(“ CD8T”,“ CD4T”,“ NK”,“ Bcell ”,“ Mono”,“ Gran”),


                                                                                                连续= c(“ CD8T”,“ CD4T”,“ NK”,“ Bcell ”,“ Mono”,“ Gran”))


下面步骤是使用于计算差异甲基化的启动子。输出对象是一个数据帧与基因符号行的名字,在估计的P值的PVAL柱,和P-值通过调整所述的Benjamini -Hochberg方法(和的Benjamini霍赫贝格,1995)中所述padj柱。此外,鉴于THA吨的mCSEA算法是基于基因集富集分析(GSEA) (萨勃拉曼尼亚等人,2005) ,富集得分(ES)被提供给每个基因。这个分数的符号指示是否启动子超甲基化的(正号)或低甲基化(负号)的情况下组中作为比较用的控制。最后,将归一化的ES(NES)和量的CpG中的每个启动子中报告的NES和大小的列,分别。表2显示了results_promoters1对象的前20行。在该EPIC数据的情况下,参数平台=“EPIC”应该被添加到mCSEATest功能。此外,该regionsTypes参数可以被修改以“基因”或“CGI”到ANALY Ž e本基因体和CGI ,分别。


种子(123)


results_mCSEA1 < -mCSEATest (rankCpGs1,beta.normalized ,


phenoComp ,regionsTypes =“发起人”)


results_promoters1 <-results_mCSEA1 $ promoters [order(results_mCSEA1 $ promoters $ padj),


                                               - C(3,7)]


表2. WS vs. WS中排名前20位的差异甲基化启动子。TD控制比较。启动子按其调整后的P值排序。


对Dup7与进行相同的分析。TD控件比较:


rankCpGs2 < -rankProbes (beta.normalized ,phenoComp ,


                                                                                                refGroup =“ TD控件”,caseGroup =“ Dup7”,

                                                                                                协变量= c(“ CD8T”,“ CD4T”,“ NK”,“ Bcell ”,“ Mono”,“ Gran”),


                                                                                                连续= c(“ CD8T”,“ CD4T”,“ NK”,“ Bcell ”,“ Mono”,“ Gran”))


种子(123)


results_mCSEA2 < -mCSEATest (rankCpGs2,beta.normalized ,

phenoComp ,regionsTypes =“发起人”)


results_promoters2 <-results_mCSEA2 $ promoters [order(results_mCSEA2 $ promoters $ padj),


                                               - C(3,7)]


此外,可以在其基因组背景下代表任何启动子的甲基化值。图3A显示了该CRISP2基因图的结果,该基因的启动子在WS患者中高甲基化,在Dup7患者中低甲基化。要构建这些图,请使用mCSEAPlot函数:


mCSEAPlot (results_mCSEA1, “启动子”, “CRISP2”,makePDF = F,前缘= F)


最后,还可以为每个启动子生成一个GSEA图。在该图中,所有的ANALY Ž ED的CpG被排序它们差异性甲基化的x轴,和垂直线标记的CpG属于表示启动子。图3B包含CRISP2 GSEA图,可以使用mCSEAPlotGSEA函数生成该图:


mCSEAPlotGSEA (排名为CpGs1,results_mCSEA1,“发起人”,“ CRISP2”)




图3. mCSEA地块进行了CRISP2基因。(A)基因组的上下文中的CRISP2启动子。点代表每个CpG和样本的β值。线显示每个实验组的平均甲基化。(B)GSEA为情节的在WS CRISP2启动子VS 。TD控制比较。垂直线标记CRISP2启动子的位置的CpG沿ANALY列表ž ED的CpG (水平黑线)。


结论:


在该协议中,我们展示了如何使用R软件包对Illumina BeadChip甲基化数据进行整体分析,以鉴定疾病样品中差异甲基化的启动子。该协议可以直接应用于任何病例对照研究设计。此外,它可以很容易地用于研究其他区域(如基因体和CGI)的差异甲基化。我们相信,这种工作流程将对研究与不同疾病相关的甲基化改变的研究人员有用。


致谢


尔迪Martorell的-马鲁甘被部分资金部:德Economía ,INDUSTRIA ÿ Competitividad 。这项工作是由部分支持FEDER /军政府的Andalucía- Consejería德Economía Ÿ Conocimiento (批准号:CV20-36723 )和Consejería德Salud的,安达卢西亚委员会(格兰特PI - 0173 - 2017年)。我们承认原有的研究论文中,该协议是派生(Martorell的-马鲁甘等人。,2019) 。


利益争夺


作者宣称他们没有竞争利益。


参考


1. Aryee,MJ,Jaffe,AE,Corrada-Bravo,H.,Ladd-Acosta,C.,Feinberg,AP,Hansen,KD,and Irizarry,RA(2014)。Minfi:一种灵活而全面的Bioconductor软件包,用于分析Infinium DNA甲基化微阵列。生物信息学30 :1363 - 1369年。     

2.的Benjamini ,ÿ 。和霍赫贝格,ÿ 。(1995)。控制˚F ALSE d iscovery ř吃:一个p ractical和p owerful一个接近角到米ultiple吨esting。JR统计志57 :289 - 300。     

3. Boyes ,J 。和Bird ,A 。(1992)。DNA甲基化对基因的抑制取决于CpG密度和启动子强度:甲基CpG结合蛋白参与的证据。EMBO J.11 (1):327-333。     

4. Cedar,H.和Bergman,Y.(2012)。DNA甲基化模式的编程。Annu启生物化学81 :97 - 117。     

5. Chen,Y.,Lemire,M.,Choufani,S.,Butcher,DT,Grafodatskaya,D.,Zanke,BW,Gallinger,S.,Hudson,TJ和Weksberg,R.(2013)。在Illumina Infinium HumanMethylation450微阵列中发现交叉反应探针和多态CpG。表观遗传学8 :203 - 209。     

6. Davis,S.和Meltzer,PS(2007)。GEOquery:基因表达综合(GEO)与BioConductor之间的桥梁。生物信息学23 :1846年- 1847年。     

7. Dozmorov,MG,Wren,JD和Alarcón-Riquelme,ME(2014年)。表观基因组元素富含自身免疫敏感性基因的启动子。表观遗传学9 ,276-285。     

8. R. Edgar,M. Domrachev和AE,Lash(2002)。基因表达综合:NCBI基因表达和杂交阵列数据存储库。核酸研究。30 :207 - 210。     

9. Ehrlich,M.和Lacey,M.(2013)。癌症中的DNA次甲基化和半甲基化。生物医学进展754:31-56。     

10.弗拉纳根,JM(2015)。表观基因组范围的关联研究(EWAS):过去,现在和将来。方法分子生物学克利夫顿NJ 1238 :51 - 63。 

11.约翰·华润,华盛顿·沃森,罗斯,华盛顿,戈德曼,爱伦斯坦,皮茨卡里斯,路易斯和米纳斯·巴恩斯(2020)。M3C:基于蒙特卡罗参考的共识聚类。科学代表10 :1816。 

12. Maksimovic,J.,Phipson,B.和Oshlack,A.(2016)。用于分析甲基化阵列数据的交叉包装Bioconductor工作流程。F1000Res 5:1281。 

13.马尔托雷尔-马鲁甘,J.,冈萨雷斯- Rumayor,V.,和卡莫纳-Sáez研究,P。(2019)。mCSEA:检测微妙的差异甲基化区域。Bioinforma OXF英格兰35 :3257 - 3262。 

14. K. McGregor,S。Bernatsky,I。Colmegna,M。Hudson,T。Pastinen,A。Labbe和CMT Greenwood(2016)。对DNA甲基化研究中纠正细胞类型异质性的方法的评估。基因组生物学杂志17 :84。 

15. Naeem,H.,Wong,NC,Chatterton,Z.,Hong,MKH,Pedersen,JS,Corcoran,NM,Hovens,CM和Macintyre,G.(2014年)。减少错误发现的风险,从而可以使用HumanMethylation450阵列鉴定生物学上重要的全基因组甲基化状态。BMC基因组学15 :51。 

16.彼得斯,新泽西州,巴克利,密西根州,斯坦森,阿拉巴马州,皮兹利,R。,萨马拉斯,K.,V·洛德,R。,克拉克,SJ和莫洛伊(PL)(2015年)。从头鉴定人类基因组中差异甲基化的区域。表观遗传染色质8 :6。 

17. R. Pidsley,R. Y Wong,CC,Volta,M.,Lunnon,K.,Mill,J.和Schalkwyk,LC(2013)。一种数据驱动的方法来预处理Illumina 450K甲基化阵列数据。BMC基因组学14 :293。 

18.缅因州里奇(Ritchie,ME),皮普森(B. limma为RNA测序和微阵列研究提供了差异表达分析的能力。核酸研究43 :e47 - e47。 

19.罗伯逊,KD(2005)。DNA甲基化与人类疾病。纳特冯遗传学6 :597 - 610。 

20. Strong,E.,DT,Butcher,DT,Singhania,R.,Mervis,CB,Morris,CA,De Carvalho,D.,Weksberg,R。和Osborne,LR(2015年)。删除或重复7q11.23儿童的对称剂量依赖性DNA甲基化分布图。上午Ĵ人类遗传学97 :216 - 227。 

21. Subramanian,A.,Tamayo,P.,Mootha,VK,Mukherjee,S.,Ebert,BL,Gillette,MA,Paulovich,A.,Pomeroy,SL,Golub,TR,Lander,ES等。(2005)。基因集富集分析:一种基于知识的方法来解释全基因组表达谱。PROC国家科科学院院刊102 :15545 - 15550。 

22. Teh,AL,Pan,H.,Lin,X.,Lim,YI,Patro,CPK,Cheong,CY,Gong,M.,MacIsaac,JL,Kwoh,C.-K.,Meaney,MJ等al。(2016)。在临床样品中进行甲基化组分析的甲基捕获测序与Infinium 450K甲基化阵列的比较。表观遗传学11 :36 - 48。 

23. AE的Teschendorff,F 。的Marabita,M。的Lechner,T。的Bartlett,J。的Tegner,D。的Gomez-Cabrero和S.的Beck(2013)。一种用于纠正Illumina Infinium 450 k DNA甲基化数据中探针设计偏倚的β混合物分位数归一化方法。Bioinforma OXF英格兰29 :189 - 196。 

24. Tche的Triche,DJ的Weisenberger,D。的Van Den Berg,PW的Laird和KD的Siegmund(2013)。Illumina Infinium DNA甲基化BeadArrays的低级处理。核酸研究41 :e90 - e90。 

25. Yuan T.,Horikoshi M.和Li W.(2016)。ggfortify:统一界面,用于可视化常用R包的统计结果。RJ 8 :474 - 485。  
登录/注册账号可免费阅读全文
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright: © 2021 The Authors; exclusive licensee Bio-protocol LLC.
引用:Martorell-Marugán, J. and Carmona-Sáez, P. (2021). Detecting Differentially Methylated Promoters in Genes Related to Disease Phenotypes Using R. Bio-protocol 11(11): e4033. DOI: 10.21769/BioProtoc.4033.
提问与回复
提交问题/评论即表示您同意遵守我们的服务条款。如果您发现恶意或不符合我们的条款的言论,请联系我们:eb@bio-protocol.org。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。