Timothy M. Chan's Publications: String algorithms and text indexing


Approximating pattern-to-text Hamming distances

(with
Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, and Ely Porat)

We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size sigma, compute the Hamming distance (i.e., the number of mismatches) between the pattern and the text at every location. Several randomized (1+eps)-approximation algorithms have been proposed in the literature (e.g., by Karloff (Inf. Proc. Lett., 1993), Indyk (FOCS 1998), and Kopelowitz and Porat (SOSA 2018)), with running time of the form O(eps^{-O(1)} n log n log m), all using fast Fourier transform (FFT). We describe a simple randomized (1+eps)-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results (all Monte-Carlo randomized) in different settings:

  1. We design the first truly linear-time approximation algorithm for constant eps; the running time is O(eps^{-2}n). In fact, the time bound can be made slightly sublinear in n if the alphabet size sigma is small (by using bit packing tricks).
  2. We apply our approximation algorithms to design a faster exact algorithm computing all Hamming distances up to a threshold k; its runtime of O(n + min(nk sqrt{log m} / sqrt{m}, nk^2/m)) improves upon previous results by logarithmic factors and is linear for k <= sqrt{m}.
  3. We alternatively design approximation algorithms with better eps-dependence, by using fast rectangular matrix multiplication. In fact, the time bound is O(n polylog n) when the pattern is sufficiently long, i.e., m >= eps^{-c} for a specific constant c. Previous algorithms with the best eps-dependence require O(eps^{-1} n polylog n) time.
  4. When k is not too small, we design a truly sublinear-time algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in time O((n/k^{Omega(1)} + occ) n^{o(1)}) time, where occ is the output size. The algorithm leads to a property tester for pattern matching that costs O~(delta^{-1/3}n^{2/3} + delta^{-1}n/m) time and, with high probability, returns true if an exact match exists and false if the Hamming distance is more than delta*m at every location.
  5. We design a streaming algorithm that approximately computes the Hamming distance for all locations with the distance approximately less than k, using O(eps^{-2} sqrt{k}) space. Previously, streaming algorithms were known for the exact problem with O(k) space (which is tight up to the polylogn factor) or for the approximate problem with O~(eps^{-O(1)} sqrt{m}) space. For the special case of k=m, we improve the space usage to O~(eps^{-1.5} sqrt{m}).


Fast string dictionary lookup with one error

(with
Moshe Lewenstein)

A set of strings, called a string dictionary, is a basic string data structure. The most primitive query, where one seeks the existence of a pattern in the dictionary, is called a lookup query. Approximate lookup queries, i.e., to lookup the existence of a pattern with a bounded number of errors, is a fundamental string problem. Several data structures have been proposed to do so efficiently. Almost all solutions consider a single error, as will this result. Lately, Belazzougui and Venturini (CPM 2013) raised the question whether one can construct efficient indexes that support lookup queries with one error in optimal query time, that is, O(|p|/w + occ), where p is the query, w the machine word-size, and occ the number of occurrences.

Specifically, for the problem of one mismatch and constant alphabet size, we obtain optimal query time. For a dictionary of d strings our proposed index uses O(w d log^{1+eps}d) additional bit space (beyond the dictionary which can be maintained in compressed form). Our results are parameterized for a space-time tradeoff.

We propose more results for the case of lookup queries with one insertion/deletion on dictionaries over a constant sized alphabet. These results are especially effective for large patterns.


Clustered integer 3SUM via additive combinatorics

(with
Moshe Lewenstein)

We present a collection of new results on problems related to 3SUM, including:

All these results are obtained by a surprising new technique, based on the Balog-Szemeredi-Gowers Theorem from additive combinatorics.


On hardness of jumbled indexing

(with
Amihood Amir, Moshe Lewenstein, and Noa Lewenstein)

Jumbled indexing is the problem of indexing a text T for queries that ask whether there is a substring of T matching a pattern represented as a Parikh vector, i.e., the vector of frequency counts for each character. Jumbled indexing has garnered a lot of interest in the last four years. There is a naive algorithm that preprocesses all answers in O(n^2 |Sigma|) time allowing quick queries afterwards, and there is another naive algorithm that requires no preprocessing but has O(n log |Sigma|) query time. Despite a tremendous amount of effort there has been little improvement over these running times.

In this paper we provide good reason for this. We show that, under a 3SUM-hardness assumption, jumbled indexing for alphabets of size omega(1) requires Omega(n^{2-epsilon}) preprocessing time or Omega(n^{1-delta}) query time for any epsilon,delta>0. In fact, under a stronger 3SUM-hardness assumption, for any constant alphabet size r >= 3 there exist describable fixed constant epsilon_r and delta_r such that jumbled indexing requires Omega(n^{2-epsilon_r}) preprocessing time or Omega(n^{1-delta_r}) query time.


Copyright Notice

The documents contained in this directory are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


Timothy Chan (Last updated Oct 2020)