Also in the Article

Computational determination of neighborhoods enriched with deletions

Procedure

“Junction” reads (reads that align to the reference virus genome but not as a continuous alignment) were grouped into clusters of fragments with similar start and end deletion positions. Specifically, deleted fragments with less than 10 positions in divergence in their start and end positions were grouped in a cluster (“elementary deletion cluster”). The size of the elementary deletion cluster is a “junction coverage” of the deletions’ narrow start and end position interval by reads with the corresponding junctions. The ratio of “junction coverage” to the sum of “junction coverage” and “continuous coverage” (derived from reads with continuous alignment that cover these deletions’ start and end intervals) is a frequency of the elementary deletion cluster.

The most abundant deletions (with frequencies higher than 0.0085%) from all high MOI replicates were selected to determine regions in the viral genome in which deletions were more predominant. Since deletions with this threshold are not uniformly distributed, enriched areas can be found unequivocally by the applied nested neighborhood algorithm in a 2-dimensional plane composed of start and end positions of deletions as X and Y coordinates. This method allows detecting the area (neighborhood) enriched by points around a certain center by sequentially extending the neighborhood’s border (distance to the center from the next closest point) and calculating the neighborhood’s fractal dimension to get more accurate p values of enrichments on each step. For methods describing the calculation of fractal dimension, refer to the “Fractal dimensions of neighborhoods” section.

Putative centers of the neighborhoods enriched by deletions were detected on the plane using a grid method. Namely, a number of deletions (“points”) were randomly selected as the grid references. For all points, distances to these reference points were calculated. For every point, a product of all-rounded logs of its Euclidean distances to the reference points was used as a hashing index. The hashing indexes of all points were sorted, and big enough islands of points (threshold ≥ 7 elementary deletion points) in the sorting that have the same hash index were considered as containing putative centers of significant enrichment. Any point of the island can be used as a center for the subsequent determination of the center’s neighborhood most significantly enriched by points/deletions. Let the null hypothesis assumption be that all points/deletions are uniformly distributed on the start/end plane (i.e., no enrichments assumption). Then probability for a number of points to be in a neighborhood of radius r from the center with volume (or an area if the space is a two-dimensional plane) Vr can be calculated from Poisson distribution. Indeed, if points are uniformly distributed in the neighborhood with radius R, R > r of volume VR., then the number of points in a neighborhood with the same center and radius r (therefore, with volume Vr) will be a random variable, having a Poisson distribution with the parameter $λ=α⋅rk$, where k is the dimension of the space, and α is the density of the uniform distribution of n points in VR $α=nRk$. Thus, n is proportional to Vr. The probability (P) of finding more or equal to m points in a neighborhood with radius r (pvalue) will be:

All deletions/points were sorted according to closeness to the selected central point. Those regions that are most enriched by points were determined as follows. In the sorting, let us consider a transition from a neighborhood with radius rt, which is equal to a distance from the center to point t in the sorting, to a neighborhood with radius $rt+1$. The Poisson p value of the enrichment of the neighborhood of radius rt containing t points is calculated from the perspective of the extended neighborhood with the radius $rt+1$ and assumptions that its t + 1 points are uniformly distributed in this volume of space of fractal dimension: the fractal dimension is defined by the sequence of distances from the t + 1 points to the center. For non-uniform dense areas, the Hausdorff fractal dimension is higher than the geometrical dimension of the plane equal to two. This higher dimension makes a drop of p value sharper on a transition from t to t + 1 than in two-dimensional space. The t-neighborhood with the most significant Poisson p value was selected as the best neighborhood, i.e., the one that is most enriched by deletions/points.

Q&A