RNA-seq data were quantified using Salmon 0.91 to calculate read counts and transcripts per million in a transcript- and gene-wise fashion, using the quasi-mapping offline algorithm (54) on the GRCh38 (National Center for Biotechnology Information) database. edgeR was used for differential gene expression analysis (DEA), using generalized linear regression methods, to identify pattern of differential expression following two different schemes:

1) A factorial analysis based on the definition of one group of scrambled and one group of KD samples to identify genes dysregulated similarly across short hairpins characterized by different efficiencies.

2) A numerical analysis in which log-normalized [Trimmed Mean of M-values (TMM)] BAZ1B levels, as quantified by RNA-seq, was used as independent variable.

All analyses were performed dropping individual variations (~individual+KD or ~individual+BAZ1B) to account for the genetic background of each individual. In particular, this design is expected to permit the identification of genes, which change expression level upon KD even in situations in which genotype-specific makeups would lead BAZ1B-dependent genes to have unique expression levels in scramble lines. In the factorial analysis, DEGs were identified and characterized by filtering for fold change (FC) > 1.25 and FDR < 0.05 unless explicitly indicated.

To our knowledge, performing a regression analysis at a gene-specific level has never been performed. We were able to do this because of the availability of a large set of samples (11 individuals) and because of the two short hairpins robustly respectively reducing BAZ1B expression levels, respectively by ~40 and ~70% in all individuals lines. To validate the quality of our numerical differential expression analysis, we took advantage of HipSci data (55, 56) and iPSCpoweR tools (29). We took 50 of 105 possible combinations of 13 random individual RNA-seq data from the healthy HipSci cohort, representing both sexes and having at least two technical replicates per individual. Unfortunately, HipSci does not contain at least 13 individuals with three clones per individual. Thus, we performed four alternative DEAs with edgeR (table S4) on the 50 different random combinations of 13 individuals identified (200 DEAs in total, on 22 samples, two clones per individual), using the same model matrix used for the regression analysis (~individual+BAZ1B) and using BAZ1B levels of scramble and sh2 lines. All analyses identified very low number of spurious DEGs (fig. S2E). Thus, we used the “Edg2” pipeline (table S4) because it does not discard genes with higher variability (Edg2 and Edg4 versus Edg1 and Edg3), and it is based on a better suited algorithm (Edg2 versus Edg4). With our model matrix, filtering by P < 0.01 (and FDR < 0.25), using Edg2 on a random HipSci data, we obtained an average of 93.32 DEGs (on average) with a median equal to 43 (table S5). GO enrichments were performed using topGO R package version 2.28.0.

Master regulatory analysis was performed via hypergeometric test by measuring gene set enrichments in lists of transcription factor targets provided by the TFBS tools database (57). Both GO and transcription factor enrichment analyses were performed considering background genes expressed in at least two samples in our NCSC cohort.