We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size sigma, compute the Hamming distance (i.e., the number of mismatches) between the pattern and the text at every location. Several randomized (1+eps)-approximation algorithms have been proposed in the literature (e.g., by Karloff (Inf. Proc. Lett., 1993), Indyk (FOCS 1998), and Kopelowitz and Porat (SOSA 2018)), with running time of the form O(eps^{-O(1)} n log n log m), all using fast Fourier transform (FFT). We describe a simple randomized (1+eps)-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results (all Monte-Carlo randomized) in different settings:

- We design the first truly
*linear-time*approximation algorithm for constant eps; the running time is O(eps^{-2}n). In fact, the time bound can be made slightly sublinear in n if the alphabet size sigma is small (by using bit packing tricks). - We apply our approximation algorithms to design a faster
*exact*algorithm computing all Hamming distances up to a threshold k; its runtime of O(n + min(nk sqrt{log m} / sqrt{m}, nk^2/m)) improves upon previous results by logarithmic factors and is linear for k <= sqrt{m}. - We alternatively design approximation algorithms with better
*eps-dependence*, by using fast rectangular matrix multiplication. In fact, the time bound is O(n polylog n) when the pattern is sufficiently long, i.e., m >= eps^{-c} for a specific constant c. Previous algorithms with the best eps-dependence require O(eps^{-1} n polylog n) time. - When k is not too small, we design a truly
*sublinear-time*algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in time O((n/k^{Omega(1)} + occ) n^{o(1)}) time, where occ is the output size. The algorithm leads to a*property tester*for pattern matching that costs O~(delta^{-1/3}n^{2/3} + delta^{-1}n/m) time and, with high probability, returns true if an exact match exists and false if the Hamming distance is more than delta*m at every location. - We design a
*streaming*algorithm that approximately computes the Hamming distance for all locations with the distance approximately less than k, using O(eps^{-2} sqrt{k}) space. Previously, streaming algorithms were known for the exact problem with O(k) space (which is tight up to the polylogn factor) or for the approximate problem with O~(eps^{-O(1)} sqrt{m}) space. For the special case of k=m, we improve the space usage to O~(eps^{-1.5} sqrt{m}).

A set of strings, called a *string dictionary*, is a basic string data structure. The most primitive query, where one seeks the existence of a pattern in the dictionary, is called a *lookup query*. Approximate lookup queries, i.e., to lookup the existence of a pattern with a bounded number of errors, is a fundamental string problem. Several data structures have been proposed to do so efficiently. Almost all solutions consider a single error, as will this result. Lately, Belazzougui and Venturini (CPM 2013) raised the question whether one can construct efficient indexes that support lookup queries with one error in optimal query time, that is, O(|p|/w + occ), where p is the query, w the machine word-size, and occ the number of occurrences.

Specifically, for the problem of one mismatch and constant alphabet size, we obtain optimal query time. For a dictionary of d strings our proposed index uses O(w d log^{1+eps}d) additional bit space (beyond the dictionary which can be maintained in compressed form). Our results are parameterized for a space-time tradeoff.

We propose more results for the case of lookup queries with one insertion/deletion on dictionaries over a constant sized alphabet. These results are especially effective for large patterns.

We present a collection of new results on problems related to 3SUM, including:

- The first truly subquadratic algorithm for
- computing the (min,+) convolution for monotone increasing sequences with integer values bounded by O(n),
- solving 3SUM for monotone sets in 2D with integer coordinates bounded by O(n), and
- preprocessing a binary string for histogram indexing (also called jumbled indexing).

- The first algorithm for histogram indexing for any constant alphabet size that achieves truly subquadratic preprocessing time and truly sublinear query time.
- A truly subquadratic algorithm for integer 3SUM in the case when the given set can be partitioned into n^{1-delta} clusters each covered by an interval of length n, for any constant delta > 0.
- An algorithm to preprocess any set of n integers so that subsequently 3SUM on any given subset can be solved in O(n^{13/7} polylog n) time.

- Preliminary arXiv version
- PDF talk slides
- In Proc. 47th ACM Symposium on Theory of Computing (STOC), pages 31-40, 2015

Jumbled indexing is the problem of indexing a text T for queries that ask whether there is a substring of T matching a pattern represented as a Parikh vector, i.e., the vector of frequency counts for each character. Jumbled indexing has garnered a lot of interest in the last four years. There is a naive algorithm that preprocesses all answers in O(n^2 |Sigma|) time allowing quick queries afterwards, and there is another naive algorithm that requires no preprocessing but has O(n log |Sigma|) query time. Despite a tremendous amount of effort there has been little improvement over these running times.

In this paper we provide good reason for this. We show that, under a 3SUM-hardness assumption, jumbled indexing for alphabets of size omega(1) requires Omega(n^{2-epsilon}) preprocessing time or Omega(n^{1-delta}) query time for any epsilon,delta>0. In fact, under a stronger 3SUM-hardness assumption, for any constant alphabet size r >= 3 there exist describable fixed constant epsilon_r and delta_r such that jumbled indexing requires Omega(n^{2-epsilon_r}) preprocessing time or Omega(n^{1-delta_r}) query time.

- PDF file | arXiv version
- In Proc. 41st International Colloquium on Automata, Languages, and Programming (ICALP), Lecture Notes in Computer Science, volume 8572, pages 114-125, 2014

The documents contained in this directory are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Timothy Chan (Last updated Oct 2020)