Category Archives: Uncategorized

News for June 2024

A nice diverse month, this June 2024. We have a selection of papers on quantum property testing, shortest paths, polynomial threshold functions, and nearest neighbor search. And a nice survey on a closely related topic. Onward with the papers!

Invitation to Local Algorithms by Václav Rozhoň (arXiv). This is a great survey on local algorithms, a topic (roughly) about message passing distributed algorithms on graphs. While that is not a sublinear algorithm per se, the core combinatorial question has relevance to our readers. Informally, a “local problem” is one where, if a solution is incorrect, that can be determined by looking at a small radius around some vertex. For example, consider problems like maximal independent set, matching, etc. The problem is to construct such a solution in a distributed manner by only looking at low radius balls. There is a direct connection with Local Computation Algorithms (LCAs), with the primary difference being the complexity resource. In Local Algorithms as defined, one is interested in the number of rounds of computation, corresponding to the radius of balls. In LCAs, we care about the number of vertices seen, which is the size of the balls. The survey gives a detailed discussion of these connections.

Testably Learning Polynomial Threshold Functions by Lucas Slot, Stefan Tiegel, and Manuel Wiedmer (arXiv). Consider the classic model of agnostic learning with a distributional assumption. Our aim is to learn a function \(f: X \to \{-1,1\}\). The learner gets labeled samples \((x,f(x))\), where \(x\) is drawn from an unknown distribution \(\mathcal{D}\). In agnostic learning, we aim to output a hypothesis \(\hat{f}\) such that \(\|f – \hat{f}\|\) is small, where distance is measured according to \(\mathcal{D}\). Often one makes distributional assumptions (like \(\mathcal{D}\) is Gaussian) to get non-trivial results. This is a severe limitation, since such distributional assumptions cannot be validated. A recent model of Rubinfeld-Vasilyan asks for “tester-learner” pairs that address this issue. We first run a “tester” to check the property (say, Gaussian) of the distribution. If the tester passes, then we run the learner. If \(\mathcal{D}\) satisfies the distributional property, the tester is guaranteed to pass (whp). If the tester passes, then the learner is guaranteed to output a valid hypothesis. Thus, the tester is allowed to pass distributions that may be far from satisfying the property, but in such situations, the learner outputs a correct hypothesis. This paper gives such tester-learner pairs for learning polynomial threshold functions with polynomially many samples. Previously, such learning results were not known even for degree-2 PTFs with respect to the Gaussian distribution.

A Sublinear Algorithm for Approximate Shortest Paths in Large Networks by Sabyasachi Basu, Nadia Kōshima, Talya Eden, Omri Ben-Eliezer, C. Seshadhri (arXiv). This paper brings a mix of a classic problem, a real-world observation, and a new graph class. Finding shortest paths in graphs is about as classic as it gets. There is a plethora of applied work on shortest paths algorithms for real-world networks, much of which is based on landmarking. One precomputes distances between chosen landmarks, and then uses those to output shortest path distance queries. This paper asks: is it possible to get approximation shortest paths in sublinear time? Hence, we are not even allowed to preprocess the entire graph. This paper builds on previous work in social networks about “core-periphery” structure. The core is a denser, small region of the graph; the rest of the vertices form the periphery that connect to the core. The hypothesis is that shortest paths in real-world graphs typically go via the core. Given the (small) core, one can build small sublinear data structures that quickly answer shortest path queries. The paper uses insights from previous work on core-periphery structure, and gives a practical sublinear algorithm for computing approximate shortest paths. If the graph comes from a Chung-Lu distribution, then queries can be provably answered in \(n^{o(1)}\) time.

Quantum Property Testing Algorithm for the Concatenation of Two Palindromes Language by Kamil Khadiev and Danil Serov (arXiv). An early result of property testing is that all regular languages can tested in constant time. But the simple (non-regular) language \(\mathcal{L}\) of strings formed by concatenating two palindromes requires super-constant queries. Specifically, the complexity of property testing \(\mathcal{L}\) is \(\Theta(\sqrt{n})\) (ignoring log and \(\varepsilon\) factors). Here, \(n\) denotes the string size. This paper shows that there exist quantum property testing algorithms that beat this bound. There is a quantum property tester for the language \(\mathcal{L}\) that makes \(\widetilde{O}(n^{1/3})\) queries. The paper also shows that quantum query complexity of the exact decision problem is \(\Theta(\sqrt{n})\). Thus, one gets non-trivial quantum speedup for both the property testing and exact problem.

A Bi-metric Framework for Fast Similarity Search by Haike Xu, Sandeep Silwal, and Piotr Indyk (arXiv). Nearest-neighbor (NN) search is one of the most important algorithmic tasks in modern data analysis. The usual setting is that we have a set of \(n\) points over a metric, denoted by \(D\). We preprocess the points and construct a data structure. Given a query point \(q\), we wish to find the (approximate) closest points from our data set, in time sublinear in the data size. This paper introduces a bi-metric setting. Think of \(D\) as expensive to compute, while there exists a cheap “proxy metric” \(d\). For example, if the points represent images, the expensive metric \(D\) could be some complex (but semantically accurate) metric, while \(d\) may represent a simple Euclidean metric over an embedding. Formally, \(d\) is a (large) constant factor approximation of \(D\). This paper gives an algorithm that performs nearest neighbor search, but only preprocesses over \(d\). Given, a query \(q\), it makes sublinear queries to the expensive \(D\). All in all, it is a truly sublinear algorithm in terms of \(D\), and leverages the preprocessing over the cheap metric \(d\). Technically, this paper gives a meta-algorithm that can be applied to any “base” NN algorithm. The paper proves results for two popular NN algorithms, and gives extensive empirical evidence for the underlying approach.

News for January 2021

The first month of 2021 has brought with it 5 papers, covering graph testing, Boolean function testing, and distribution testing — as well as database theory. Let’s dive into it.

Random walks and forbidden minors III: \(\mathrm{poly}(d/\varepsilon)\)-time partition oracles for minor-free graph classes, by Akash Kumar, C. Seshadhri, and Andrew Stolman (arXiv). Minor-closed bounded-degree graphs have a very nice property: denoting by \(d\) the degree bound and \(n\) the number of edges, it is always possible to partition any such graph into components of constant size, \(O_\varepsilon(1)\), just by removing a linear number of edges, merely \(\varepsilon dn\) (\(\varepsilon\) being a small parameter). This is a crucial result in many graph property testing algorithms, those which rely on something called a “partition oracle”: loosely speaking, a routine which makes “few” queries to the graph, and is able to indicate which component of an underlying partition any given vertex belongs to. But what is “few” here? The first partition oracles made \(d^{\mathrm{poly}(d,1/\varepsilon)}\) queries to the graph to answer any such request. This later got significantly improved to \(d^{\mathrm{log}(d/\varepsilon)}\). Using spectral graph theory techniques previously developed by the authors (hence the “III” in the title), this work settles the question, achieving partition oracles which as good as it gets: making only \(\mathrm{poly}(d,1/\varepsilon)\) queries to the graph! This in turns has immediate consequences for graph property testing, which the paper details.

And since we are on the topic of oracles… more exciting news on that front:

Spectral Clustering Oracles in Sublinear Time, by Grzegorz Gluch, Michael Kapralov, Silvio Lattanzi, Aida Mousavifar, Christian Sohler (arXiv). Given a graph \(G\) and an integer \(k\), one often want to partition the graph into components \(C_1,C_2,\dots,C_k\), such that each \(C_i\) is well-connected and has few edges to the other \(C_j\)’s. But can we get ultra efficient algorithms for such spectral clusterings? Specifically, can we design oracles for them: sublinear-time algorithms which provide implicit access to an underlying “good” spectral clustering \(\hat{C}_1,\hat{C}_2,\dots,\hat{C}_k\), by returning on any query vertex \(v\) the index of the cluster \(\hat{C}_i\) to which \(v\) belongs? This paper introduces the question, and answers it in the affirmative: in more detail, it provides a spectral clustering oracle which, for any \(\varepsilon>0\), has preprocessing time \(2^{\mathrm{poly}(k/\varepsilon)}n^{1/2+O(\varepsilon)}\), query time \(n^{1/2+O(\varepsilon)}\), and space \(n^{1/2+O(\varepsilon)}\); and provides access to a clustering with relative error \(O(\varepsilon\log k)\) per cluster. The paper also allows tradeoffs between query time and space, and discusses applications to the Local Computation Algorithms (LCA) model.

Next stop, distribution testing…

The Sample Complexity of Robust Covariance Testing, by Ilias Diakonikolas and Daniel M. Kane (arXiv). Suppose you have i.i.d. samples from some high-dimensional Gaussian \(\mathcal{N}(0,\Sigma)\) in \(d\) dimensions, and want to test whether the unknown covariance matrix \(\Sigma\) is the identity, versus \(\varepsilon\)-far from it (in Frobenius norm). Good news: we know how to do that, and \(\Theta(d/\varepsilon^2)\) samples are necessary and sufficient. (To learn \(\Sigma\), that’d be \(\Theta(d^2/\varepsilon^2)\).) Bad news: you don’t have i.i.d. samples from some high-dimensional Gaussian \(\mathcal{N}(0,\Sigma)\); what you have is i.i.d. samples from a noisy version of it, \((1-\alpha)\mathcal{N}(0,\Sigma) + \alpha B\), where \(B\) is an arbitrary “bad” distribution (not necessarily Gaussian itself). You still want to test whether the covariance \(\Sigma\) is the identity, but now you have that extra \(\alpha\) fraction of noisy samples, and you need to do that testing robustly… The good news is, you can still do that by learning the covariance matrix \(\Sigma\), robustly, with \(O(d^2/\varepsilon^2)\) samples. The bad news is the main result of this paper: that’s also the best you can do. That is, \(\Omega(d^2)\) samples are necessary: if you have to be robust to noise, testing is no longer easier than learning….

Onto Boolean functions!

Junta Distance Approximation with Sub-Exponential Queries, by Vishnu Iyer, Avishay Tal, and Michael Whitmeyer (ECCC). If you follow this blog, you may have seen over the past couple years a flurry of results about tolerant junta testing: “given query access to some Boolean function \(f\colon\{-1,1\}^n\to\{-1,1\}\), how close is \(f\) to only depending on \(k \ll n\) of its variables?”
This paper contains several results on this problem, including an improved bicriteria tolerant testing algorithm: an efficient algorithm to distinguish between \(\varepsilon\)-close to \(k\)-junta and \(1.1\varepsilon\)-far from \(k’\)-junta making \(\mathrm{poly}(k,1/\varepsilon)\) queries (and \(k’ = O(k/\varepsilon^2)\)). But the main result of the paper, and the one giving it its name, is for the non-relaxed version where \(k’=k\): while all previous works had a query complexity \(2^{O(k)}\), here the authors show how to break that exponential barrier, giving a fully tolerant testing algorithm with query complexity \(2^{\tilde{O}(\sqrt{k})}\)!

And finally, a foray into database theory:

Towards Approximate Query Enumeration with Sublinear Preprocessing Time, by Isolde Adler and Polly Fahey (arXiv). In this paper, the authors are concerned with the task of (approximate) query enumeration on databases, aiming for ultra efficient (i.e., sublinear-time) algorithms. Leveraging techniques from property testing (specifically, in the bounded-degree graph model), they show the following:
On input databases of bounded degree and bounded tree-width, every (fixed) first-order definable query can be enumerated approximately in time linear in the output size, after only a sublinear-time preprocessing phase.

That’s all for this month! If you noticed a paper we missed, please let us know in the comments.

News for October 2020

Sorry for the delay in writing this monthly digest: we hope you didn’t spend the week frantically refreshing your browsers! We found four papers this month: let’s dive in.

A Structural Theorem for Local Algorithms with Applications to Coding, Testing, and Privacy, by Marcel Dall’Agnol, Tom Gur, and Oded Lachish (ECCC). This paper introduces the notion of “robust (local) algorithm,” an abstraction which encompasses may types of algorithms: e.g., property testing algorithms, locally decodable codes, etc. The main result of this work is that any (possibly adaptive) \(q\)-query robust local algorithm can be transformed into a non-adaptive, sample-based one making \(n^{1-1/(q^2\log q)}\) queries (where \(n\) is the input size). Here, “sample-based” means that the algorithm doesn’t get to make arbitrary queries, but just gets to observe randomly chosen coordinates of the input. As application of this result, the authors derive new upper and lower bounds for several of the types of local algorithms mentioned above, resolving open questions from previous works.

Testing Tail Weight of a Distribution Via Hazard Rate, by Maryam Aliakbarpour, Amartya Shankha Biswas, Kavya Ravichandran, Ronitt Rubinfeld (arXiv). The authors consider, from the point of view of distribution testing, the question of deciding whether data is heavy-tailed: as would, for instance, data following a power law. The paper first sets out to formalize the question, and discusses various possible definitional choices before setting on one; after which it provides a test, and analyzes its sample complexity as a function of various parameters (such as smoothness of the unknown distribution). The authors finally back these results with empirical evaluation of their algorithm.

On Testing of Samplers, by Kuldeep S. Meel, Yash Pote, Sourav Chakraborty (arXiv). Suppose you are given a (known) distribution \(p\) over some domain \(\Omega\), and want to sample from it conditioned on some predicate \(\varphi\). Now someone comes to you with an algorithm which does exactly that, efficiently, and cheaply: great! But can you easily check that you’re not getting fooled, and that this sampler actually does what it claims? This paper provides this: an algorithm which accepts if the sampled distribution is \(\varepsilon\)-close to what it should (roughly, in a multiplicative, KL divergence sense), and rejects if it’s \(\varepsilon’\)-far (in total variation distance). The number of samples required is polynomial in \(\varepsilon’-\varepsilon\), and depends on some characteristic of \(p\) and \(\varphi\), the “tilt” (ratio between max and min probability of the conditional distribution).

Finally, an omission from late September:
Sample optimal Quantum identity testing via Pauli Measurements, by Nengkun Yu (arXiv). The abstract is concise and clear enough to speak for itself: “In this paper, we show that \(\Theta(\textrm{poly}(n)\cdot\frac{4^n}{\varepsilon^2})\) is the sample complexity of testing whether two \(n\)-qubit quantum states \(\rho\) and \(\sigma\) are identical or \(\varepsilon\)-far in trace distance using two-outcome Pauli measurements.”

Please let us know if you spotted a paper we missed!

News for January 2017

2017 starts off rather slow for property testing. Though, we have an intriguing paper to report – an experimental analysis of a classic sublinear algorithm.

Evaluating a Sublinear-time Algorithm for the Minimum Spanning Tree Weight Problem, by Gabriele Santi and Leonardo De Laurentiis (arXiv). The Chazelle-Rubinfeld-Trevisan Minimum Spanning Tree algorithm is a classic in sublinear algorithms. This algorithms provides a \((1+\varepsilon)\) approximation to the MST in time independent of the number of vertices (although it does depend on the average degree). But how this compare with Prim’s algorithm on real instances, in a real (not theoretical) computer? This intriguing paper does a detailed experimental comparison. Having done experimental graph algorithms myself, I can attest to the various challenges: how to choose a test set of graphs? How to set error parameters? Can data structure optimization on the coding side beat asymptotic improvements? This paper does a series of experiments on synthetic graph generators (such as Erdős-Rényi, Barabási-Albert, Watts-Strogatz models). They do validate the basic CRT algorithm at scale, by showing that it is faster than Prim for graphs with more than a million edges. Their experiments suggest that the sublinear-time algorithm gives little benefits when \(\varepsilon \leq 0.2\). The paper has many experiments for a variety of settings, and the authors do a comprehensive study of the various parameters. I’d definitely recommend to anyone interested in exploring how property testing might influence algorithms in the real world.

News for December 2015

Greetings from the exciting Workshop on Sublinear Algorithms at John Hopkins University! As this workshop and the upcoming SODA and ITCS conferences get 2016 to a roaring start, let us take one last look back at property testing news from last year. In December, one work in particular caught my eye:

 Non-Local Probes Do Not Help with Graph Problems by Mika Göös, Juho Hirvonen, Reut Levi, Moti Medina, and Jukka Suomela (arXiv). A generalization of property testing that has recently seen some fascinating developments in the past few years is the local computation algorithms (LCA) model, in which the algorithm is asked to answer some local query (such as “what is the color of this vertex in some fixed, legal coloring of the graph?”) in sublinear-time. This paper relates the LCA model to message-passing models and in the process gives a powerful new tool for establishing lower bounds in LCAs.