A set of strings, called a *string dictionary*, is a basic string data structure. The most primitive query, where one seeks the existence of a pattern in the dictionary, is called a *lookup query*. Approximate lookup queries, i.e., to lookup the existence of a pattern with a bounded number of errors, is a fundamental string problem. Several data structures have been proposed to do so efficiently. Almost all solutions consider a single error, as will this result. Lately, Belazzougui and Venturini (CPM 2013) raised the question whether one can construct efficient indexes that support lookup queries with one error in optimal query time, that is, O(|p|/w + occ), where p is the query, w the machine word-size, and occ the number of occurrences.

Specifically, for the problem of one mismatch and constant alphabet size, we obtain optimal query time. For a dictionary of d strings our proposed index uses O(w d log^{1+eps}d) additional bit space (beyond the dictionary which can be maintained in compressed form). Our results are parameterized for a space-time tradeoff.

We propose more results for the case of lookup queries with one insertion/deletion on dictionaries over a constant sized alphabet. These results are especially effective for large patterns.

We present a collection of new results on problems related to 3SUM, including:

- The first truly subquadratic algorithm for
- computing the (min,+) convolution for monotone increasing sequences with integer values bounded by O(n),
- solving 3SUM for monotone sets in 2D with integer coordinates bounded by O(n), and
- preprocessing a binary string for histogram indexing (also called jumbled indexing).

- The first algorithm for histogram indexing for any constant alphabet size that achieves truly subquadratic preprocessing time and truly sublinear query time.
- A truly subquadratic algorithm for integer 3SUM in the case when the given set can be partitioned into n^{1-delta} clusters each covered by an interval of length n, for any constant delta > 0.
- An algorithm to preprocess any set of n integers so that subsequently 3SUM on any given subset can be solved in O(n^{13/7} polylog n) time.

- Preliminary arXiv version
- PDF talk slides
- In Proc. 47th ACM Symposium on Theory of Computing (STOC), pages 31-40, 2015

Jumbled indexing is the problem of indexing a text T for queries that ask whether there is a substring of T matching a pattern represented as a Parikh vector, i.e., the vector of frequency counts for each character. Jumbled indexing has garnered a lot of interest in the last four years. There is a naive algorithm that preprocesses all answers in O(n^2 |Sigma|) time allowing quick queries afterwards, and there is another naive algorithm that requires no preprocessing but has O(n log |Sigma|) query time. Despite a tremendous amount of effort there has been little improvement over these running times.

In this paper we provide good reason for this. We show that, under a 3SUM-hardness assumption, jumbled indexing for alphabets of size omega(1) requires Omega(n^{2-epsilon}) preprocessing time or Omega(n^{1-delta}) query time for any epsilon,delta>0. In fact, under a stronger 3SUM-hardness assumption, for any constant alphabet size r >= 3 there exist describable fixed constant epsilon_r and delta_r such that jumbled indexing requires Omega(n^{2-epsilon_r}) preprocessing time or Omega(n^{1-delta_r}) query time.

- PDF file | arXiv version
- In Proc. 41st International Colloquium on Automata, Languages, and Programming (ICALP), Lecture Notes in Computer Science, volume 8572, pages 114-125, 2014

The documents contained in this directory are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Timothy Chan (Last updated September 2018)