nature methods | ADVANCE ONLINE PUBLICATION | 1
correspondence
there has been a homozygous insertion or deletion at this loca-
tion, the distribution p(Ci) will shift (Fig. 1a). If the observed
cluster is the site of a heterozygous indel, approximately half of
the observed mate pairs will be generated from the shifted distri-
bution, and the other half will come from the original, unshifted
p(Y) (Fig. 1b). MoDIL represents the random variable of the
expected size of indel (mean of insert size minus the mapped
distance) with two random variables, one for each haplotype.
Given a cluster, MoDIL identifies the two distributions, {D1,D2},
with the fixed shape of p(Y) and arbitrary means that best fits
the observed data using the Kolmogorov-Smirnov test. To find
the means of the two distributions, MoDIL uses the expectation-
maximization algorithm and appropriate Bayesian priors to pre-
vent over-fitting.
MoDIL: detecting small indels from
clone-end sequencing with mixtures
of distributions
To the Editor: Human genetic variation comes in a wide range
of sizes, from single-nucleotide polymorphisms and very small
insertions and deletions (indels) to ‘structural’ variants, in which
large segments of the genome are inserted, deleted, inverted or
duplicated. Recently several methods for the identification of
both small-size indels (<10 base pairs (bp))1 and larger ones (>50
bp)2,3 from high-throughput sequencing have been developed.
There should also be a large amount of ‘medium-sized’ variation:
insertions and deletions of 10–50 nucle-
otides. Here we describe MoDIL, mixture
of distributions indel locator, the first
method to identify 20–50-bp indels from
high-throughput sequencing data. MoDIL
is available at http://compbio.cs.toronto.
edu/modil/.
Most sequencing techniques allow for
the generation of mate pairs (two reads at
an approximately known distance (insert
size)). Mate pairs are used to locate struc-
tural variants by comparing the distance
between the mapped locations of the read
pairs