参见作者原研究论文

本实验方案简略版
Jun 2018

本文章节


 

Reference-free Association Mapping from Sequencing Reads Using k-mers
利用k-mers进行测序读取的无参考关联映射   

引用 收藏 提问与回复 分享您的反馈 Cited by

Abstract

Association mapping is the process of linking phenotypes with genotypes. In genome wide association studies (GWAS), individuals are first genotyped using microarrays or by aligning sequenced reads to reference genomes. However, both these approaches rely on reference genomes which limits their application to organisms with no or incomplete reference genomes. To address this, reference free association mapping methods have been developed. Here we present the protocol of an alignment free method for association studies which is based on counting k-mers in sequenced reads, testing for associations between k-mers and the phenotype of interest, and local assembly of the k-mers of statistical significance. The method can map associations of categorical phenotypes to sequence and structural variations without requiring prior sequencing of reference genomes.

Keywords: Association mapping (关联映射), Genome wide association studies (GWAS) (全基因组关联分析), Reference free (无参考), k-mer (k-mer)

Background

Association mapping, i.e., the process of associating genotypes to phenotypes is most frequently performed in the form of genome wide association studies (GWAS) with single nucleotide polymorphisms (SNP). Microarrays are used to genotype individuals at a large number of known SNP locations and each SNP is tested for association with the phenotype of interest. But this approach requires prior sequencing of a reference genome and determining the locations of the SNPs. Moreover, this precludes mapping associations to structural variations such as insertion-deletions (indels) and copy number variations, and to variations outside of the reference genome.

With advances in sequencing technologies, the use of whole genome sequenced reads for association mapping is increasingly becoming more widespread. This is most commonly done by mapping the reads to a reference genome, calling variants, and testing for association between the variants and the phenotype. However, this approach also requires a reference genome and regions missing from the reference are not included in the study.

To address these issues, a number of reference free methods for association mapping have been developed. They are based on testing for association between k-mers, i.e., contiguous sequence of length k in sequenced reads and the phenotype. Sheppard et al., 2013, Earle et al., 2016, Lees et al., 2016 and Jaillard et al., 2018 presented methods for association mapping in bacterial genomes where high plasticity makes application of reference based methods difficult. Rahman et al., 2018 introduced a method for mapping associations to categorical phenotypes applicable to organisms with large genomes and more recently Voichek et al., 2020 presented a method for both categorical and quantitative phenotypes.

Here we present the protocol of the reference free association mapping tool HAWK, which was developed by Rahman et al., 2018 and extended by Mehrab et al., 2020. It works by first counting k-mers in reads from each individual using Jellyfish (Marçais and Kingsford, 2011). Then likelihood ratio test is used to find k-mers with significantly different counts in case and control samples. Next, population structure is determined using Eigenstrat (Patterson et al., 2006, Price et al., 2006). Finally, k-mers associated with the phenotype are identified and the k-mers are locally assembled to get a sequence for each associated loci. The results found by HAWK were found to be largely in agreement with reference based methods. Moreover, HAWK was able to map associations to structural variants and to variants in regions not present in the reference. It is worth reiterating that the method is applicable for any genetic diseases or traits. However, it currently only supports categorical, i.e., binary phenotypes although work is ongoing to extend it to quantitative phenotypes.

Equipment

  1. Computer (We recommend at least 16GB RAM and multiple cores)

Software

  1. HAWK ( Rahman et al., 2018 ; Mehrab et al., 2020 )

    The primary software for association mapping from sequencing reads using k-mers. The software is available for download at https://github.com/atifrahman/HAWK/releases and installation instructions are at https://github.com/atifrahman/HAWK.

  2. Modified version of Jellyfish/Jellyfish 2 (Marçais and Kingsford, 2011)

    For k-mer counting. Modified versions available for download at https://github.com/atifrahman/HAWK/tree/master/supplements and installation instructions are in the README.md file.

  3. Modified version of EIGENSTRAT ( Patterson et al., 2006 , Price et al., 2006 )

    For population structure determination. A modified version can be downloaded from https://github.com/atifrahman/HAWK/tree/master/supplements and installation instructions are available in the README file.

  4. ABySS ( Simpson et al., 2009 )

    To assemble k-mers of statistical significance. Download and installation instructions are at https://github.com/bcgsc/abyss.

  5. GNU sort with parallel support. Usually included with Linux distribution

  6. R. Available for download at https://cran.r-project.org/

Procedure

The overview of the procedure of association mapping from sequencing reads with k-mers using HAWK is shown in Figure 1. The steps are described in more details below. More information on the process is available at https://github.com/atifrahman/HAWK/blob/master/README.md.



Figure 1. Overview of association mapping from sequencing reads with k-mers using HAWK

  1. Counting and sorting k-mers

    1. Count k-mers in each sample by running Jellyfish or Jellyfish 2. To do this, modify the example scripts available at https://github.com/atifrahman/HAWK/tree/master/supplements.

    2. countKmers_jf2 or countKmers_jf1 can be used to count and sort k-mers using Jellyfish 2 or Jellyfish respectively when the reads from each sample are in a separate folder whereas countKmers_jf2_sra or countKmers_jf1_sra can be used when reads first need to be downloaded from the sequence read archive (SRA).

    3. The scripts will write the names of the sorted k-mer count files in 'sorted_files.txt' and the total k-mer counts in each sample in 'total_kmer_counts.txt'.


  2. Running HAWK

    1. Create a file named 'gwas_info.txt' containing three columns separated by tabs in each row giving a sample ID, male/female/unknown denoted by M/F/U and Case/Control status of the sample for each sample.

    2. Copy the files 'sorted_files.txt' and 'total_kmer_counts.txt' into a folder as well as the file 'gwas_info.txt'

    3. Copy the scripts 'runHawk' and 'runAbyss' into the folder, edit the variables hawkDir, eigenstratDir and isDiploid, and run ./runHawk

      This will complete the following steps:

      1. Identifying k-mers associated with cases and controls. This will be done by running ‘hawk’ and will generate the initial lists of k-mers associated with cases and controls in files ‘case_out_w_bonf.kmerDiff’ and ‘control_out_w_bonf.kmerDiff’ respectively as well as the files to be used by Eigenstrat to determine population structure.

      2. Determining population structure. Population structure will be detected by running Eigenstrat. Optionally, the population structure can be investigated by running the R script pca_plot.R. This will read the ‘gwas_eigenstrat.evec’ file generated by Eigenstrat and output the principal component analysis (PCA) plot in ‘pca_plot.eps’. Adjust the variables PC1 and PC2 to select along which principal components the data will be plotted.

      3. Correcting for population structure. The p-values for the k-mers identified in Step B3a will be adjusted for confounding factors such as population structure, total number of k-mers and sex. The k-mers found significantly associated with cases and controls will be in files 'case_kmers.fasta' and 'control_kmers.fasta'. Additional information about the k-mers will be in ‘pvals_case_top_merged.txt’ and ‘pvals_control_top_merged.txt.’


  1. Assembling k-mers

    1. Edit the variable ‘abyssDir’ in the script ‘runAbyss’ and run ./runAbyss
      This will assemble the k-mers to generate one sequence for each associated region and the sequences associated with cases and controls will be in ‘case_abyss.25_49.fa’ and ‘control_abyss.25_49.fa’ respectively.

  2. Downstream analysis

    Once the k-mers or the assembled sequences are obtained, they can be analyzed in a number of ways.

    1. To obtain summary stats such as average p-values, average counts of constituent k-mers and average number of times they are present in cases and controls edit the HAWK directory, input filename, and whether the sequences are from case or control in the script ‘runKmerSummary’ (at https://github.com/atifrahman/HAWK/tree/master/supplements) and run ./runKmerSummary (see https://github.com/atifrahman/HAWK for details).

    2. If no reference genome is available, BLAST ( Altschul et al., 1990 ) the sequences to check for hits to sequences in related organisms and analyze the matched sequences.

    3. If a reference genome is available, the k-mers can be mapped to the reference using a tool such as Bowtie 2 (Langmead and Salzberg, 2012) and their positions and p-values can be visualized using Manhattan plots. Edit the shell script ‘runBowtie2’ and the R script ‘manhattan_plasmid.R’, available at https://github.com/atifrahman/HAWK/tree/master/ecoli_analysis to align the k-mers to a reference and generate Manhattan plots respectively.

Data analysis

To identify the k-mers present significantly more times in cases or controls compared to the other, HAWK assumes that k-mer counts are Poisson distributed and performs a likelihood ratio test. Population structure is determined by running Eigenstrat to do a principal component analysis on the binary matrix denoting presence or absence of a random set of k-mers. Correction of population structure and other confounders is done by fitting logistic regression models of the phenotype against the confounders as well as a k-mer count vector and the confounders and adjusted p-values for the k-mers identified in the first step are identified. Bonferroni correction is performed to correct for multiple testing. See Rahman et al., 2018 and Mehrab et al., 2020 for a detailed description of the methods and supporting results.

    Next we present an example data analysis using the pipeline. We use the E. coli ampicillin resistance dataset from Earle et al., 2016 , which was also analyzed by Rahman et al., 2018 . The dataset contains sequenced reads from 241 strains, of which 189 were ampicillin resistant and the rest were not. First, Jellyfish 2 was used to count k-mers in reads from each sample. Of the 176,284,643 distinct k-mers in total, the first step of HAWK identified 4,752,738 and 4,007,202 k-mers to be associated with cases and controls respectively before correcting for confounders. Next, Eigenstrat was run on 342,988 randomly chosen k-mers to detect population structure. Figure 2A shows the PCA plot of the samples along the first two principal components revealing population stratification.

    We then adjust the p-values using the first ten principal components and total number of k-mers in each sample. After correcting for confounders, we get 4,125 k-mers associated with cases and none associated with controls. The k-mers associated with cases were then assembled with ABySS, revealing 11 sequences.

    The k-mers found associated with ampicillin resistance were mapped to the E. coli strain DTU-1 genome [GenBank: CP026612.1] and the E. coli strain KBN10P04869 plasmid pKBN10P04869A sequence [GenBank: CP026474.1] using Bowtie 2. Manhattan plots in Figures 2B and 2C show -log10 ⁡(p-valuesof the k-mers against their locations in E. coli strain DTU-1 genome and plasmid pKBN10P04869A sequence respectively. The vertical lines denote locations of the β-lactamase TEM-1 gene. We observe that no k-mers map to the E. coli strain DTU-1 genome. However, k-mers map to the plasmid pKBN10P04869A sequence near the β-lactamase TEM-1 gene, the presence of which is known to provide ampicillin resistance.



Figure 2. Association mapping of ampicillin resistance in E. coli (A) Plots of the first two principal components of the E. coli strains in the ampicillin resistance dataset. Manhattan plots showing -log10⁡(p-values) of k-mers associated with ampicillin resistance and their locations in (B) Escherichia coli strain DTU-1 genome and (C) plasmid pKBN10P04869A sequence.

Notes

  1. By default, HAWK uses Bonferroni correction to address the issue of multiple testing. If the study is underpowered for Bonferroni correction, the Benjamini-Hochberg procedure can be used instead. For this after executing ‘runHawk’, run ./runBHCorrection. The resulting k-mers will be in ‘case_kmers_bh_correction.fasta’ and ‘control_kmers_bh_correction.fasta’.

  2. HAWK uses first two principal components found using Eigenstrat, sex of samples and sequencing depth in the form of total k-mer counts as confounders, by default. To change the default settings, edit the variables ‘noPC’ and ‘useSexConfounder’ in the script ‘runHawk’. In order to provide additional confounders, create a file with the number of lines equal to the number of samples and in each line specify the covariates, given by numbers and separated by spaces or tabs. Edit the variable ‘covFile’ in ‘runHawk’ with the name of the confounder file.

Recipes

The HAWK pipeline can be used to find sex specific k-mers. In order to do this, in the file 'gwas_info.txt', provide the sample IDs in the first column, write U's in the second column and specify contain Case/Control status depending on whether the sample is Male/Female in the third column. For example if SRR3050845 and SRR3050847 are female and SRR3050846 is male, the file will be:

SRR3050845      U      Control

SRR3050846      U      Case

SRR3050847      U      Control

Acknowledgments

Lior Pachter, and Atif Rahman were funded in part by NIH R21 HG006583. This paper describes protocol of a method originally presented in the paper “Association mapping from sequencing reads using k-mers” by Atif Rahman, Ingileif Hallgrímsdóttir, Michael Eisen and Lior Pachter, and extended in “A faster implementation of association mapping from k-mers” by Zakaria Mehrab, Jaiaid Mobin, Ibrahim Asadullah Tahmid and Atif Rahman.

Competing interests

The authors declare no competing interests.

References

  1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol 215(3): 403-410.
  2. Earle, S. G., Wu, C. H., Charlesworth, J., Stoesser, N., Gordon, N. C., Walker, T. M., Spencer, C. C., Iqbal, Z., Clifton, D. A., Hopkins, K. L. and Woodford, N. (2016). Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nature Microbiol 1(5): 1-8.
  3. Jaillard, M., Lima, L., Tournoud, M., Mahe, P., van Belkum, A., Lacroix, V. and Jacob, L. (2018). A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events. PLoS Genet 14(11): e1007758.
  4. Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie. Nat Methods 9(4): 357-359.
  5. Lees, J. A., Vehkala, M., Valimaki, N., Harris, S. R., Chewapreecha, C., Croucher, N. J., Marttinen, P., Davies, M. R., Steer, A. C., Tong, S. Y., Honkela, A., Parkhill, J., Bentley, S. D. and Corander, J. (2016). Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat Commun 712797.
  6. Marçais, G. and Kingsford, C. (2011). A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6): 764-770.
  7. Mehrab, Z., Mobin, J., Tahmid, I. A. and Rahman, A. (2020). A faster implementation of association mapping from k-mers. bioRxiv. .
  8. Patterson, N., Price, A. L. and Reich, D. (2006). Population structure and eigenanalysis. PLoS Genet 2(12): e190.
  9. Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A. and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8): 904-909.
  10. Rahman, A., Hallgrimsdottir, I., Eisen, M. and Pachter, L. (2018). Association mapping from sequencing reads using k-mers. Elife 7: e32920.
  11. Sheppard, S. K., Didelot, X., Meric, G., Torralbo, A., Jolley, K. A., Kelly, D. J., Bentley, S. D., Maiden, M. C., Parkhill, J. and Falush, D. (2013). Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter. Proc Natl Acad Sci U S A 110(29): 11923-11927.
  12. Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. and Birol, I. (2009). ABySS: a parallel assembler for short read sequence data. Genome Res 19(6): 1117-1123.
  13. Voichek, Y. and Weigel, D. (2020). Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nat Genet 52(5): 534-540.

简介

[摘要]关联映射是将表型与基因型联系起来的过程。在全基因组关联研究(GWAS)中,首先使用微阵列或通过将测序序列与参考基因组比对对个体进行基因分型。然而,这两种方法都依赖于参考基因组,这限制了它们对没有或不完整参考基因组的生物的应用。为了解决这个问题,已经开发了无参考关联映射方法。在这里,我们提出了一种用于关联研究的无比对方法的协议,该协议基于对测序读物中的k- mers进行计数,测试k- mers与感兴趣的表型之间的关联以及具有统计意义的k- mers的局部组装。该方法可以将分类表型的关联映射到序列和结构变异,而无需事先测序参考基因组。

[背景] 。关联映射,即,关联基因型表型的过程中最频繁的全基因组关联研究(GWAS)与单核苷酸多态性(SNP)的形式进行的。微阵列用于在大量已知的SNP位置对个体进行基因分型,并测试每个SNP与目标表型的关联。但是这种方法需要事先对参考基因组进行测序并确定SNP的位置。此外,这排除了将映射关联映射到结构变异,例如插入缺失(indels)和拷贝数变异,以及在参考基因组之外的变异。

随着测序技术的进步,将全基因组测序读段用于关联作图越来越广泛。这通常是通过将读段映射到参考基因组,调用变体并测试变体与表型之间的关联来完成的。但是,这种方法也需要参考基因组,并且参考中缺少的区域不包括在研究中。

为了解决这些问题,已经开发了许多用于关联映射的无参考方法。它们是基于测试对于K-之间的关联聚体,即,在塞克长度为k的连续序列NCED读取和表型。Sheppard等。,2013年,Earle等人。,2016年,Lees等。,2016年;Jaillard等人。,2018年提出了在细菌基因组中进行关联作图的方法,其中高可塑性使基于参考的方法难以应用。Rahman等。于2018年推出了一种将关联映射到适用于具有大型基因组的生物的分类表型的方法,最近又出现了Voichek等。,2020年提出了一种用于分类和定量表型的方法。

在这里,我们介绍由Rahman等人开发的参考自由关联映射工具HAWK的协议。,2018年,由Meh rab等人扩展。,2020年。它的工作原理是首先使用水母对每个人的读码中的k- mers计数(Marçais和Kingsford ,2011年)。然后,使用似然比检验来找到病例和对照样品中计数明显不同的komer 。接下来,使用本征斯特来确定种群结构(Patterson等,2006; Price等,2006)。最后,K-聚体相关的与表型鉴定和K-聚体被当地组装以获得序列为每个相关的位点。HAWK发现的结果与基于参考的方法基本一致。此外,HAWK可以将关联映射到结构变体和参考文献中不存在的区域中的变体。值得重申的是,该方法适用于任何遗传疾病或性状。但是,它目前只支持分类,即,二进制的表型虽然工作正在进行将其扩展到定量的表型。

关键字:关联映射, 全基因组关联分析, 无参考, k-mer

设备




计算机(我们建议至少16GB RAM和多核)


软件




HAWK(Rahman等人,2018 ; M ehrab等人,2020)
使用k- mers从测序读段进行关联映射的主要软件。该软件可从https://github.com/atifrahman/HAWK/releases下载,安装说明可在https://github.com/atifrahman/HAWK下载。


水母/水母2的修改版(Marçais和Kingsford ,2011年)
对于K- Mer的计数。可从https://github.com/atifrahman/HAWK/tree/master/supplements下载的修改版本和安装说明位于README.md文件中。


EIGENSTRAT的修改版本(Patterson等,2006; Price等,2006)
用于确定人口结构。A M odified版本可以从以下地址下载https://github.com/atifrahman/HAWK/tree/master/supplements和安装说明自述文件中可重新。


ABySS (Simpson等,2009)
组装具有统计意义的聚合物。下载和安装说明位于https://github.com/bcgsc/abyss。


具有并行支持的GNU排序。通常包含在Linux发行版中
R.可从https://cran.r-project.org/下载




程序




图1显示了使用HAWK用k- mers对测序读段进行关联映射的过程的概述。下面将更详细地介绍这些步骤。有关该过程的更多信息,请访问https://github.com/atifrahman/HAWK/blob/master/README.md。








图1.使用HAWK使用k- mers从测序读段进行关联映射概述




计数和分类的k聚体
计数的k聚体通过运行水母或水母2。为此在每个样品中,修改可用的示例脚本在https://github.com/atifrahman/HAWK/tree/master/supplements。
countKmers_jf2或countKmers_jf1可用于计数和分类的k聚体使用水母2或水母分别当来自每个样本的读出是在一个单独的文件夹,而countKmers_jf2_sra或countKmers_jf1_sra读取时首先需要从序列读取存档下载可以使用( SRA)。
该脚本会写的名字的排序的K- Mer的计数文件“sorted_files.txt”和T他牛逼otal K- Mer的计数小号在“total_kmer_counts.txt”每个样本。


运行HAWK
创建一个名为“ gwas_info.txt”的文件,该文件包含三行,每行中的选项卡将其分隔开,以给出样品ID,由M / F / U表示的男性/女性/未知数以及每个样品的病历/对照状态。
将文件“ sorted_files.txt”和“ total_kmer_counts.txt”以及文件“ gwas_info.txt”复制到一个文件夹中
脚本“复制runHawk ”和“ runAbyss ”到该文件夹,编辑变量hawkDir ,eigenstratDir和isDiploid ,和运行。/运行鹰
这将完成以下步骤:


一种。识别与病例和对照相关的婴儿。这将通过运行“鹰”来完成,将产生的k的初始列表聚物与案件相关联,并控制在文件的case_out_w_bonf.kmerDiff '和“ control_out_w_bonf.kmerDiff分别”以及在文件中使用的EIGENSTRAT确定人口结构。       


b。确定人口结构。运行Eigenstrat将检测人口结构。可选,T他的人口结构可以通过运行将R脚本进行调查pca_plot.R 。这将读出的“ gwas_ eigenstrat.evec由产生”文件Eige nstrat和输出在“主成分分析(PCA)情节pca_plot.eps ”。调整可变小号PC1和PC2以选择沿其主成分的数据将被绘制。      


C。纠正人口结构。S tep B3 a中确定的k- mers的p值将根据混杂因素(例如人口结构,k- mers的总数和性别)进行调整。第k聚体与病例和对照WIL发现显著相关联升在文件的case_ kmers.fasta '和“ control_kmers.fasta ”。关于K-更多信息聚体将在“pvals_case_top_merged.txt”和“pvals_control_top_merged.txt。”       




组装单体
编辑变量“ abyssDir ”脚本“ runAbyss ”和运行。/ runAbyss             
这将组装k- mers为每个相关区域生成一个序列,与案例和控件相关的序列将分别位于“ case_abyss.25_49.fa”和“ control_abyss.25_49.fa”中。




下游分析
              一旦获得了k聚体或组装的序列,就可以用多种方法对其进行分析。


为了获得摘要统计诸如平均p值,构成的k的平均计数聚体和次它们存在于病例和对照编辑霍克目录,文件名输入平均数目序列,和是否从在脚本的情况下或控制' runKmerSummary '(位于https://github.com/atifrahman/HAWK/tree/master/supplements)并运行./ runKmerSummary (有关详细信息,请参见https://github.com/atifrahman/HAWK)。
如果没有参考基因组可用,则进行BLAST(Altschul等,1990)的序列以检查相关生物中的序列是否命中并分析匹配的序列。
如果可获得参考基因组,则可使用Bowtie 2(Langmead和Salzberg ,2 012)之类的工具将k- mers映射到参考基因,并使用曼哈顿图将其位置和p值可视化。编辑外壳脚本“runBowtie2”和将R脚本“ manhattan_plasmid.R ”,可在https://github.com/atifrahman/HAWK/tree/master/ecoli_analysis对齐的K-聚体到参考和分别产生曼哈顿曲线。


数据分析




为了识别的k聚体相对于其他的情况下或对照呈现显著多次,HAWK假定K-聚体计数泊松分布和执行似然比检验。通过运行Eigenstrat对二进制矩阵进行主成分分析来确定总体结构,该矩阵表示存在或不存在k- mers随机集。群体结构和其他混杂因素的校正是通过针对混杂因素拟合表型的逻辑回归模型,以及做了K-聚体计数矢量和用于第k的混杂因素和调整的p值聚体在第一步骤中识别的被识别。执行Bonferroni校正以校正多个测试。参见Rahman等。,2018和Mehrab等。,2020年对方法和支持结果的详细描述。


  接下来,我们提供使用管道的示例数据分析。我们使用Earle等人的大肠杆菌氨苄青霉素抗性数据集。,2016年,也由Rahman等人进行了分析。,2018年。数据集包含来自241个菌株的测序读物,其中189个对氨苄青霉素具有抗性,其余则不具有抗性。首先,使用水母2对每个样品的读数中的komer进行计数。所述176284643不同的k的聚体在总,霍克的第一步骤中识别4752738个4007202 K-聚体将被分别校正之前对混杂因素与病例和对照相关联。接着,EIGENSTRAT已于342988随机选择的k运行聚体检测人口结构。˚F igure 2个甲节目吨他PCA沿揭示群体分层的前两个主成分绘制样品。


  然后,我们使用前十个主成分和每个样本中k- mers的总数来调整p值。校正混杂因素后,我们得到4,125的k聚体的案件,并与对照相关的无关联。第k聚体与相关联的情况下,然后用组装深渊,揭示11组的序列。


  使用Bowtie 2 ,将与氨苄青霉素抗性相关的k- mers定位到大肠杆菌DTU-1基因组[GenBank:CP026612.1]和大肠杆菌KBN10P04869质粒pKBN10P04869A序列[GenBank:CP026474.1]。在FIGUR地块ES 2.0乙和2 ç显示第k的聚体对它们在位置大肠杆菌分别菌株DTU-1基因组和质粒pKBN10P04869A序列。垂直线表示的位置的β内酰胺酶TEM-1基因。我们观察到没有kmers映射到大肠杆菌DTU-1菌株基因组。然而,k聚体定位于β-内酰胺酶TEM-1基因附近的质粒pKBN10P04869A序列,已知该序列的存在可提供氨苄青霉素抗性。








图2.大肠杆菌中氨苄青霉素抗性的关联图。(A)氨苄青霉素抗性数据集中大肠杆菌菌株前两个主要成分的图。曼哈顿图显示了与氨苄青霉素抗性相关的k-mers及其在(B )大肠杆菌DTU-1基因组和(C )质粒pKBN10P04869A序列中的位置。 




笔记




默认情况下,HAWK使用Bonferroni校正来解决多次测试的问题。如果研究不足以进行Bonferroni校正,则可以使用Benjamini -Hochberg程序代替。为此,在执行“ runHaw k ”后,运行。/ runBHCorrection 。将所得的K-聚体小号将在“ case_kmers_bh_ correction.fasta ”和“ control_kmers_bh_correction.fasta ”。
HAWK使用前两个主成分使用找到ë igenstrat ,SE在总的k的形式的样品和测序深度的X聚物计为混杂因素,默认情况下。要更改默认设置,编辑变量“ NOPC ”和“ useSexConfounder ”脚本“ runHawk ”。为了提供更多的混杂因素,请创建一个行数等于样本数的文件,并在每行中指定协变量,该协变量由数字给定并由空格或制表符分隔。使用混杂文件的名称在“ runHawk ”中编辑变量“ covFile ” 。


菜谱




HAWK管道可用于查找特定性别的kmers 。为此,请在文件“ gwas_info.txt”中的第一列中提供样本ID,在第二列中写入U,并根据第三列中的样本是Male / Female来指定包含Case / Control状态。 。对于例如,如果SRR3050845和SRR3050847是女性,SRR3050846是男的,该文件将是:


SRR3050845 U控制                                                               


SRR3050846 U机箱                                   


SRR3050847 U控制                                   




致谢




Lior Pachter和Atif Rahman由NIH R21 HG006583部分资助。本文描述的方法的协议中的论文“从测序协会映射读取使用K-原始呈现聚体”,由与Atif拉赫曼Ingileif Hallgrímsdóttir ,迈克尔·艾森和利奥尔帕切特,和在“A延伸更快从K-执行关联映射的链节”由Zakaria Mehrab ,Jaiaid Mobin ,Ibrahim Asadullah Tahmid和Atif Rahman撰写。




利益争夺




作者宣称没有利益冲突。




参考文献




Altschul ,SF,Gish,W.,Miller,W.,Myers,EW和Lipman,DJ(1990)。基本的局部比对搜索工具。分子生物学杂志215(3):403-410。
厄尔(SG),吴(CH),查尔斯沃思(J.),斯托瑟(Stoesser),北卡罗来纳州(Gordon),北卡罗来纳州(Walker),TM,斯宾塞(Spencer),CC,伊克巴尔(Iqbal)Z.,克里夫顿(Clifton),DA,霍普金斯(Kopkins),KL和伍德福德(N. ) 。在控制种群结构时识别谱系效应可提高细菌关联研究的能力。自然微生物1(5):1-8。
Jaillard ,M.,Lima,L.,Tournoud ,M.,Mahe ,P.,van Belkum ,A.,Lacroix,V. and Jacob,L.(2018年)。用于细菌全基因组关联研究的快速而不可知的方法:弥合kmers与遗传事件之间的鸿沟。PLoS Genet 14(11):e1007758。
Langmead,B.和Salzberg ,SL(2012)。Bowtie 2的快速缺口阅读比对。Nat Methods 9(4):357-359。
Lees,JA,Vehkala ,M.,Valimaki ,N.,Harris,SR,Chewapreecha ,C.,Croucher,NJ,Marttinen ,P.,Davies,MR,Steer,AC,Tong,SY,Honkela ,A.,Parkhill ,J.,Bentley,SD和Corander ,J.(2016)。序列元素富集分析可确定细菌表型的遗传基础。Nat Commun 7:12797。
Marçais ,G.和Kingsford,C.(2011)。一种快速,无锁的方法,可有效地并行统计k- mers的出现。生物信息学27(6):764-770。
Mehrab ,Z.,Mobin ,J.,Tahmid ,IA和Rahman,A.(2020年)。来自k-mers的关联映射的更快实现。bioRxiv 。doi :https ://doi.org/10.1101/2020.04.14.040675 。
Patterson,N.,Price,AL和Reich,D.(2006)。人口结构与特征分析。PLoS Genet 2(12):e190。
价格,AL,帕特森,新泽西,普伦吉,RM,Weinblatt ,ME,Shadick ,NA和帝国,D.(2006)。主成分分析纠正了全基因组关联研究中的分层问题。Nat Genet 38(8):904-909。
Rahman,A.,Hallgrimsdottir ,I.,Eisen,M. and Pachter ,L.(2018年)。使用k- mers从测序读段进行关联映射。Elife 7:e32920。
Sheppard,SK,Didelot ,X.,Meric ,G.,Torralbo ,A.,Jolley,KA,Kelly,DJ,Bentley,SD,Maiden,MC,Parkhill,J.和Falush ,D.(2013)。全基因组关联研究确定维生素B5的生物合成为弯曲杆菌中的宿主特异性因子。PROC国家科科学院科学USA 110(29):11923-11927。
辛普森(Jimp),黄(Kong),杰克曼(Jackman),SD,谢恩(Schein),杰伊(Jose ),琼斯(SJ)和比罗尔(Birol )(2009)。ABySS :一个并行汇编程序,用于读取短序列数据。Genome Res 19(6):1117-1123。
Voichek ,Y.和Weigel,D.(2020年)。鉴定没有完整基因组的植物中表型变异的遗传变异。Nat Genet 52(5):534-540。
  • English
  • 中文翻译
免责声明 × 为了向广大用户提供经翻译的内容,www.bio-protocol.org 采用人工翻译与计算机翻译结合的技术翻译了本文章。基于计算机的翻译质量再高,也不及 100% 的人工翻译的质量。为此,我们始终建议用户参考原始英文版本。 Bio-protocol., LLC对翻译版本的准确性不承担任何责任。
Copyright Mehrab et al. This article is distributed under the terms of the Creative Commons Attribution License (CC BY 4.0).
引用: Readers should cite both the Bio-protocol article and the original research article where this protocol was used:
  1. Mehrab, Z., Mobin, J., Tahmid, I. A., Pachter, L. and Rahman, A. (2020). Reference-free Association Mapping from Sequencing Reads Using k-mers. Bio-protocol 10(21): e3815. DOI: 10.21769/BioProtoc.3815.
  2. Rahman, A., Hallgrimsdottir, I., Eisen, M. and Pachter, L. (2018). Association mapping from sequencing reads using k-mers. Elife 7: e32920.
提问与回复
提交问题/评论即表示您同意遵守我们的服务条款。如果您发现恶意或不符合我们的条款的言论,请联系我们:eb@bio-protocol.org。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。

如果您对本实验方案有任何疑问/意见, 强烈建议您发布在此处。我们将邀请本文作者以及部分用户回答您的问题/意见。为了作者与用户间沟通流畅(作者能准确理解您所遇到的问题并给与正确的建议),我们鼓励用户用图片的形式来说明遇到的问题。