Author Archives: Clement Canonne

News for January 2023

Welcome to the first batch of 2023. Looks like it’s going to be a good year, with 5 property testing or related papers (that I could find) already:

An efficient asymmetric removal lemma and its limitations, by Lior Gishboliner, Asaf Shapira, and Yuval Wigderson (arXiv). One of the jewels of graph property testing is the triangle removel lemma (and its many generalizations and variants), which relates the number of triangles in a dense graph to its distance from being triangle-free: namely, any graph \(\varepsilon\)-far from being triangle-free must have \(\delta(\varepsilon)n^3\) triangles, where the density \(\delta(\varepsilon)\) only depends on the distance (and not the size of the graph!). This immediately leads to constant-query testers (and even “proximity-oblivious” testers) for triangle-freness (and, more generally, pattern-freeness). Unfortunately, the dependence on \(\varepsilon\) is quite bad, essentially a tower-type function (and it is known no polynomial bound is possible). This work attempts to bypass this impossibility result by proving an asymmetric removal lemma, or the form “any graph \(\varepsilon\)-far from being triangle-free must have \(\mathrm{poly}(\varepsilon)n^5\) 5-cycles” (and generalizations beyond triangles). This seems like a very interesting direction, with potential applications to property testing, and (who knows!) efficient testers for many properties hithertho only known to be (practically) testable for constant \(\varepsilon\).

Related (more removal lemmata!), a different work on this topic:

The Minimum Degree Removal Lemma Thresholds, by Lior Gishboliner, Zhihan Jin, and Benny Sudakov (arXiv). As mentioned above, removal lemmata relate the distance \(\varepsilon\) from being \(H\)-free (for a given subgraph \(H\)) to the density \(\delta(\varepsilon)\) of occurrences of \(H\) in the graph. Sadly, it is known that this density will be superpolynomial (in the distance) unless \(H\) is bipartite… which, while technically still yielding testing algorithms (query complexity independent of the size of the graph!), yields very inefficient testers (very bad dependence on \(\varepsilon\)!). This paper studies one direction to bypass this sad state of affairs: under which additional assumption on the underlying graph (specifically, bounds on its minimum degree) can we obtain a polynomial bound on \(\delta(\varepsilon)\)? And a linear bound? The authors give a tight degree condition for \(\delta(\varepsilon)\) to be polynomial when \(H\) is an odd cycle, and their results for the linear-dependence case establishes a separation between the two. Put differently: obtaining polynomial-query testers via removal lemmas is possible for a strictly larger class of graphs than linear-query ones!

And now, for something completely different: testing binary matrices!

A Note on Property Testing of the Binary Rank, by Nader H. Bshouty (arXiv). The binary rank of a matrix \(M\in\{0,1\}^{n\times m}\) is the smallest \(d\) such that there exist \(A\in\{0,1\}^{n\times d}\) and \(B\in\{0,1\}^{d\times m}\) with \(M=AB\); this can also be seen as the minimal number of bipartite cliques needed to partition the edges of a bipartite graph represented by \(M\). One can also define the relaxed notion of \(s\)-binary rank, if one enforces that each edge of the bipartite graph is covered by at most \(s\) bipartite cliques. The property testing question is then to decide, given inputs \(s,d,\varepsilon\), if \(M\) has \(s\)-binary rank at most \(d\), or is \(\varepsilon\)-far from it. The main result of this note is to give one-sided testers (one adaptive, and one non-adaptive) for \(s\)-binary rank with query complexity \(\tilde{O}(2^d)\) (for constant \(s\)), improving on the previous algorithms by a factor \(2^d\).

Into the quantum realm!

Testing quantum satisfiability, by Ashley Montanaro, Changpeng Shao, and Dominic Verdon (arXiv). Classically, one can study the property version of \(k\)-SAT, which asks to decide whether a given instance is satisfiable or far from being so. And people (namely, Alon and Shapira, in 2003) did! Quantumly, one can define an analogue of \(k\)-SAT, “quantum \(k\)-SAT”: and people (namely, Bravyi, in 2011) did! But what about property testing of quantum \(k\)-SAT? Well, now, people (namely, the authors of this paper) just did! Showing (Corollary 1.10) that one can efficiently distinguish between (1) the quantum \(k\)-SAT is satisfiable, and (2) it is far from satisfiable by a product state. This, effectively, extends the result of Alon–Shapira’03 to the quantum realm.

And to conclude, a foray into reinforcement learning via distribution testing…

Lower Bounds for Learning in Revealing POMDPs, by Fan Chen, Huan Wang, Caiming Xiong, Song Mei, and Yu Bai (arXiv). Alright, I’m even more out of my depth than usual here, so I’ll just copy (part of) the abstract, for fear I don’t do justice to the authors’ work: “This paper studies the fundamental limits of reinforcement learning (RL) in the challenging partially observable setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the revealing condition — A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. […] Technically, our hard instance construction adapts techniques in distribution testing, which is new to the RL literature and may be of independent interest.” (Emphasis mine)

News for September 2022

Apologies for the delay! Another quiet month in the property testing world, with only one paper (that we found). If we missed any, let us know in the comments!

On Interactive Proofs of Proximity with Proof-Oblivious Queries, by Oded Goldreich, Guy Rothblum, and Tal Skverer (ECCC). Interactive Proofs of Proximity (IPPs) are the “interactive” version of property testers, where the algorithm can both query the input and interact with an all-knowing (but untrusted) prover. In this work, the authors study the power of a specific and natural type of “adaptivity” for IPPs, asking what happens when the choice of queries and the interaction with the prover are independent, or restricted. That is, what happens when these two aspects of the IPP algorithm are in separate “modules”? Can we still test various properties as efficiently? The paper proves various results in under several models (=restrictions between the two “modules”), focusing on the intermediate restriction where the two modules (queries to the input and interaction with the prover) are separate (no interaction), but have access to shared randomness.

News for May 2022

The crazy numbers from last month are not quite gone: we have five papers this month, not bad at all!

Codes! Distributed computing! Probability distributions!

Improved local testing for multiplicity codes, by Dan Karliner and Amnon Ta-Shma (ECCC). Take the Reed–Muller code with parameters \(m, d\), whose codewords are the evaluation tables of all degree-\(m\) polynomials over \(\mathbb{F}^d\). RM codes are great, they are everywhere, and they are locally testable: one can test whether a given input \(x\) is a valid codeword (or far from every codeword) with only very few queries to \(x\). Now, take the multiplicity code: instead of just the evaluation table of the polynomial themselves, a codeword includes the evaluations of all its derivatives, up to order \(s\). These beasts generalize RM codes: are they also locally testable? Yes they are! And this work improves on our understanding of this aspect, by providing better bounds on the locality (how few queries are necessary to test), and simplifies the argument from previous work by Karliner, Salama, and Ta-Shma (2022).

Overcoming Congestion in Distributed Coloring, by Magnús M. Halldórsson, Alexandre Nolin, Tigran Tonoyan (arXiv). Two of the main distributed computing models, LOCAL and CONGEST, differ in how they model the bandwidth constraints. In the former, nodes can send messages of arbitrary size, and the limiting quantity is the number of rounds of communications; while in the latter, each node can only send a logarithmic number of bits at each round. This paper introduces a new technique that allows for communication-efficient distributed (coordinated) sampling, which as a direct applications enables porting several LOCAL algorithms to the CONGEST model at a small cost: for instance, \((\Delta+1)\)-List Coloring. This new technique also has applications beyond these distributed models, to graph property testing – in a slightly non-standard setting where we define farness from the property in a “local” sense (detect vertices or edges which contribute to many violations, i.e., are “locally far” from the property considered).

Robust Testing in High-Dimensional Sparse Models, by Anand Jerry George and Clément L. Canonne (arXiv). In the Gaussian mean testing problem, you are given samples from a high-dimensional Gaussian \(N(\mu, I_d)\), where \(\mu\) is either zero or has \(\ell_2\) norm greater than \(\varepsilon\), and you want to decide which of the two holds. This “mean testing” equivalent (due to, erm, “standard facts”) to testing in total variation distance, and captures the setting where one wantss to figure out whether an underlying signal \(\mu\), subject to white noise, is null or significant. Now, what if this \(\mu\) was promised to be \(s\)-sparse? Can we test more efficiently? But what if a small fraction of the samples were arbitrarily corrupted — how much harder does the testing task become? For some related tasks, it is known that being robust against adversarial corruptions makes testing as hard as learning… This paper addresses this “robust sparse mean testing” question, providing matching upper and lower bounds; as well as the related question of (robust, sparse) linear regression.

Sequential algorithms for testing identity and closeness of distributions, by Omar Fawzi, Nicolas Flammarion, Aurélien Garivier, and Aadil Oufkir (arXiv). Consider the two “usual suspects” of distribution testing, identity and closeness testing, where we must test if an unknown distribution is equal to some reference one or \(\varepsilon\)-far (in total variation distance) from it; or, the same thing, but with two unknown distributions (no reference one). These are, by now, quite well understood… but the algorithms for them take a worst-case number of samples, function of the distance parameter \(\varepsilon\). But if the two distributions are much further apart than \(\varepsilon\), fewer samples should be required! This is the focus of this paper, showing that with a sequential test one can achieve this type of guarantees: a number of samples which, in the “far” case, depends on the actual distance, not on its worst-case lower bound \(\varepsilon\). One could achieve this by combining known algorithms with a “doubling search;” however, this still would lose some constant factors in the sample complexity. The authors provide sequential tests which improve on this “doubling search technique” by constant factors, and back this up with empirical evaluations of their algorithms.

Estimation of Entropy in Constant Space with Improved Sample Complexity, by Maryam Aliakbarpour, Andrew McGregor, Jelani Nelson, and Erik Waingarten (arXiv). Suppose that, given samples from an unknown distribution \(p\) over \(n\) elements, your task is to estimate its (Shannon) entropy \(H(p)\) up to \(\pm\Delta\). You’re in luck! We know that \(\Theta(n/(\Delta\log n)+ (\log^2 n)/\Delta^2)\) samples are necessary and efficient. But what if you had to do that under strict memory constraints? Say, using only a constant number of words of memory? Previous work by Acharya, Bhadane, Indyk, and Sun (2019) shows that it is still possible, but the number of samples required shoots up, with their algorithm now requiring (up to polylog factors) \(n/\Delta^3\) samples. This works improves upon the dependence on \(\Delta\), providing a constant-memory algorithm with sample complexity \(O(n/\Delta^2 \cdot \log^4(1/\Delta))\); they further conjecture this to be optimal, up to the polylog factors.

New Property Testing book, by Arnab Bhattacharyya and Yuichi Yoshida

More great news: a new textbook on property testing, 📘 Property Testing: Problems and Techniques, by two experts in the field, Arnab Bhattacharyya and Yuichi Yoshida, is now available!

Property Testing: Problems and Techniques (Springer, 2022)

As the overview below outlines (from the book’s website), the book covers a wide range of topics, and should give anyone interested a great overview of scope, techniques, and results in testing.

This book introduces important results and techniques in property testing, where the goal is to design algorithms that decide whether their input satisfies a predetermined property in sublinear time, or even in constant time – that is, time is independent of the input size.

This book consists of three parts. The first part provides an introduction to the foundations of property testing. The second part studies the testing of specific properties on strings, graphs, functions, and constraint satisfaction problems. Vectors and matrices over real numbers are also covered. The third part is more advanced and explains general conditions, including full characterizations, under which properties are constant-query testable.

The first and second parts of the book are intended for first-year graduate students in computer science. They should also be accessible to undergraduate students with the adequate background. The third part can be used by researchers or ambitious graduate students who want to gain a deeper theoretical understanding of property testing.

2022: Voilà, WOLA!

Good news, everyone! WOLA, the Workshop on Local Algorithms, is coming back this year, with WOLA 2022 taking place in person* in Warsaw on June 25–27. Exciting speakers, events and outings are being planned!

Keep track of updates by visiting the website, and register at https://ideas-ncbr.pl/en/wola/registration/ (even if you intend to attend remotely).

* Virtual participation is also possible.

News for January 2022

A slow month to start 2022, as far as property testing (and myself) are concerned — “only” 3 papers, and a delay of several days in posting this. Let’s jump in with quantum testing!

Testing matrix product states, by Mehdi Soleimanifar and John Wright (arXiv). Suppose you are given a state \(|\psi\rangle\) of \(n\) qubits, and want to know “how entangled” this whole thing is: for instance, is \(|\psi\rangle\) a product state (no entanglement between the \(n\) qudits)? More generally, the “amount of entanglement” allowed is captured by an integer \(r\), the bond dimension, where product state corresponds to \(r=1\), and larger \(r\) allows for more entanglement. This paper then considers the following property testing question: how many copies of \(|\psi\rangle\) are needed to test whether it has bond dimension at most \(r\), or is \(\varepsilon\)-far from every such state (in trace distance)? While the case \(r=1\) had been previously considered, this paper considers the general case; and, in particular, shows a qualitative gap between \(r=1\) (for which a constant number of copies, \(O(1/\varepsilon^2)\), suffice) and \(r\geq 2\) (for which they show the number of states is \(\Omega(\sqrt{n}/\varepsilon^2)\), and \(O(n r^2/\varepsilon^2)\)).

Constant-time one-shot testing of large-scale graph states, by Hayata Yamasaki and Sathyawageeswar Subramanian (arXiv). In this paper, the authors consider the task of testing if the physical error rate of a given system is below a given threshold — namely, the threshold below which fault-tolerant measurement-based quantum computation (MBQC) becomes feasible. Casting this into the framework of property testing, the paper shows that measuring very few (a constant number!) of the input state is enough to test whether the error rate is low.

And, to conclude, a paper which escaped us in December, on private distribution testing:

Pure Differential Privacy from Secure Intermediaries, by Albert Cheu and Chao Yan (arXiv). Throwback to April 2020 and August 2021, which covered results on distribution testing (uniformity testing!) under the shuffle model of differential privacy. Namely, there was an upper bound of $$ O( k^{2/3}/(\alpha^{4/3}\varepsilon^{2/3})\log^{1/3}(1/\delta) + k^{1/2}/(\alpha\varepsilon) \log^{1/2}(1/\delta) + k^{1/2}/\alpha^2)$$ samples for testing uniformity of distributions over \([k]\), to distance \(\alpha\), under \((\varepsilon,\delta)\)shuffle privacy (so, approximate privacy: \(\delta>0\)). A partial lower bound existed for pure differential privacy, i.e., when \(\delta=0\): however, no upper bound was known for pure shuffle privacy.
Until now: this new paper shows that pure DP basically comes at no cost, by providing an \((\varepsilon,0)\)-shuffle private testing algorithm with sample complexity $$ O( k^{2/3}/(\alpha^{4/3}\varepsilon^{2/3}) + k^{1/2}/(\alpha\varepsilon) + k^{1/2}/\alpha^2)$$ The paper actually does a lot more, focusing on a different problem, private summation; and the testing upper bound is a corollary of the new methods they develop in the process.

News for October 2021

The month of September was quite busy, with seven papers, spanning (hyper)graphs, proofs, probability distributions, and sampling.

Better Sum Estimation via Weighted Sampling, by Lorenzo Beretta and Jakub Tětek (arXiv). This paper considers the following question: “given a large universe of items, each with an unknown weight, estimate the total weight to a multiplicative \(1\pm \varepsilon\).” The key is in the type of access you have to those items: here, the authors consider the setting where items can be sampled proportionally to their unknown weights, and show improved bounds on the sample/query complexity in this model. And there something for everyone: they also discuss connections to edge estimation in graphs (assuming random edge queries) and to distribution testing (specifically, in the “dual” or “probability-revealing” models of Canonne–Rubinfeld and Onak–Sun).

This gives us an easy segue to distribution testing, which is the focus of the next two papers.

As Easy as ABC: Adaptive Binning Coincidence Test for Uniformity Testing, by Sudeep Salgia, Qing Zhao, and Lang Tong (arXiv). Most of the work in distribution testing (from the computer science community) focuses on discrete probability distributions, for several reasons. Including a technical one: total variation distance is rather fickle with continuous distributions, unless one makes some assumption on the unknown distribution. This paper does exactly this: assuming the unknown distribution has a Lipschitz density function, it shows how to test uniformity by adaptively discretizing the domain, achieving (near) sample complexity.

Exploring the Gap between Tolerant and Non-tolerant Distribution Testing, by Sourav Chakraborty, Eldar Fischer, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen (arXiv). It is known that tolerant testing of distributions can be much harder than “standard” testing – for instance, for identity testing, the sample complexity can blow up by nearly a quadratic factor, from \(\sqrt{n}\) to \(\frac{n}{\log n}\)! But is it the worse that can happen, in general, for other properties? This work explores this question, and answers it in some notable cases of interest, such as for label-invariant (symmetric) properties.

And now, onto graphs!

Approximating the Arboricity in Sublinear Time, by Talya Eden, Saleet Mossel, and Dana Ron (arXiv). The arboricity of a graph is the minimal number of spanning forests required to cover all its edges. Many graph algorithms, especially sublinear-time ones, can be parameterized by this quantity: which is very useful, but what do you do if you don’t know the arboricity of your graph? Well, then you estimate it. Which this paper shows how to do efficiently, given degree and neighbor queries. Moreover, the bound they obtain — \(\tilde{O}(n/\alpha)\) queries to obtain a constant-factor approximation of the unknown arboricity \(\alpha\) — is optimal, up to logarithmic factors in the number of vertices \(n\).

Sampling Multiple Nodes in Large Networks: Beyond Random Walks, by Omri Ben-Eliezer, Talya Eden, Joel Oren, and Dimitris Fotakis (arXiv). Another thing which one typically wants to do with very large graphs is sample nodes from them, either uniformly or according to some prescribed distribution. This is a core building block in many other algorithms; unfortunately, approaches to do so via random walks will typically require a number of queries scaling with the mixing time \(t_{\rm mix}(G)\) of the graph \(G\), which might be very small for nicely expanding graphs, but not so great in many practical settings. This paper proposes and experimentally evaluates a different algorithm which bypasses this linear dependence on \(t_{\rm mix}(G)\), by first going through a random-walk-based “learning” phase (learn something about the structure of the graph) before using this learned structure to perform faster sampling, focusing on small connected components.

Why stop at graphs? Hypergraphs!

Hypergraph regularity and random sampling, by Felix Joos, Jaehoon Kim, Daniela Kühn, Deryk Osthus (arXiv). The main result in this paper is a hypergraph analogue of a result of Alon, Fischer, Newman and Shapira (for graphs), which roughly states that if a hypergraph satisfies some regularity condition, then so does with high probability a randomly sampled sub-hypergraph — and conversely. This in turn has direct implications to characterizing which hypergraph properties are testable: see the companion paper, by the same authors.
(Note: this paper is a blast from the past, as the result it shows was originally established in the linked companion paper, from 2017; however, the authors split this paper in two this October, leading to this new, standalone paper.)

And, to conclude, Arthur, Merlin, and proofs:

Sample-Based Proofs of Proximity, by Guy Goldberg, Guy Rothblum (ECCC). Finally, consider the setting of interactive proofs of proximities (IPPs), where the prover is as usual computationally unbounded, but the verifier must run in sublinear time (à la property testing). This has received significant interest in the past years: but what if the verifier didn’t even get to make queries, but only got access to uniformly random locations of the input? These “SIPP” (Sample-based IPPs), and their non-interactive counterpart SAMPs (Sample-based Merlin-Arthur Proofs of Proximity) are the object of study of this paper, which it introduces and motivates in the context, for instance, of delegation of computation for sample-based algorithms.

Workshop on Algorithms for Large Data: We found WALD(O), and so can you!

Ainesh Bakshi, Rajesh Jayaram, and Samson Zhou are organizing a 3-day Workshop on Algorithms for Large Data (nicely abbreviated as WALD(O), the O standing for Online), featuring many talks which should be of interest to the readers of this blog, as well as an open problems and a poster sessions, and a junior/senior lunch. As the organizers describe it:

This workshop aims to foster collaborations between researchers across multiple disciplines through a set of central questions and techniques for algorithm design for large data. We will focus on topics such as sublinear algorithms, randomized numerical linear algebra, streaming and sketching, and learning and testing.

The workshop will take place on August 23 — August 25 (ET). Attendance is free, but registration is required by August 20th. More details at https://waldo2021.github.io/

New for July 2021

This month saw three papers appear online, together covering a rather broad range of topics: testing of regular languages, distribution testing under differential privacy, and local testability from high-dimensional expanders. Let’s dive in!

Property Testing of Regular Languages with Applications to Streaming Property Testing of Visibly Pushdown Languages, by Gabriel Bathie and Tatiana Starikovskaya (paper). Let \(L\in \Sigma^\ast\) be a regular language recognized by an automation with \(m\) states and \(k\) connected components: given as input a word \(u\in \Sigma^n\), what is the query complexity to test membership to \(L\) in Hamming distance? Edit distance? Or, more generally, weighted edit distance, where each letter of the word \(u\) comes with a weight? In this paper, the authors focus on non-adaptive, one-sided errors testing algorithms, for which they show an upper bound of \(q=O(k m \log(m/\varepsilon)/\varepsilon)\) queries (with running time \(O(m^2 q)\)), which they complement by a query complexity lower bound of \(\Omega(\log(1/\varepsilon)/\epsilon)\), thus matching the upper bound for languages recognized by constant-size automata. The guarantee for the upper bound is with respected to weighted edit distance, and thus implies the same upper bound for testing with respect to Hamming distance.
To conclude, the authors use an existing connection to streaming property testing to obtain new algorithms for property testing of visibly pushdown languages (VPL) in the streaming model, along with a new lower bound in that model.

High dimensional expansion implies amplified local testability, by Tali Kaufman and Izhar Oppenheim (arXiv). This paper sets out to show that codes that arise from high-dimensional expanders are locally testable (membership to the code can be tested using very few queries). To do so, the authors define a new notion of high-dimensional expanding system (HDE system), as well as that of amplified local testability, a stronger notion than local testability; they then prove that a code based on a HDE system satisfies this stronger notion. Moreover, they show that many well-known families of codes are, in fact, HDE system codes, and therefore satisfy this stronger notion of local testability as well.

Finally, a survey on differential privacy, with a foray into distribution testing:

Differential Privacy in the Shuffle Model: A Survey of Separations, by Albert Cheu (arXiv). If you are familiar with differential privacy (DP), you may recall that there are several notions of DP, each meant to address a different “threat model” (depending on whom you trust with your data). Shuffle DP is one of them, intermediate between “central” DP and the more stringent “local” DP. Long story short: with shuffle DP, the tradeoff between privacy and accuracy can be strictly in-between what’s achievable in central and local DP, and that’s the case for one of the usual suspects of distribution testing, uniformity testing (“I want to test if the data uniformly distributed, but now, with privacy of that data in mind”). The survey discusses what is known about this in Sections 3.3 and 6, and what the implications are; but there are quite a few questions left unanswered… Long story short: a very good introduction to shuffle privacy, and to open problems in that area!

Looking back at WOLA’21: Videos available

The fifth Workshop on Local Algorithms (WOLA’21) took place earlier this month, and the recordings of the invited talks are now available on YouTube. If you missed the workshop, or want to refresh your memory, here are the recordings (ordered by the workshop schedule):

Thanks again to the speakers and organizers, and looking forward to WOLA’22!