Author Archives: Clement Canonne

News for January 2022

A slow month to start 2022, as far as property testing (and myself) are concerned — “only” 3 papers, and a delay of several days in posting this. Let’s jump in with quantum testing!

Testing matrix product states, by Mehdi Soleimanifar and John Wright (arXiv). Suppose you are given a state \(|\psi\rangle\) of \(n\) qubits, and want to know “how entangled” this whole thing is: for instance, is \(|\psi\rangle\) a product state (no entanglement between the \(n\) qudits)? More generally, the “amount of entanglement” allowed is captured by an integer \(r\), the bond dimension, where product state corresponds to \(r=1\), and larger \(r\) allows for more entanglement. This paper then considers the following property testing question: how many copies of \(|\psi\rangle\) are needed to test whether it has bond dimension at most \(r\), or is \(\varepsilon\)-far from every such state (in trace distance)? While the case \(r=1\) had been previously considered, this paper considers the general case; and, in particular, shows a qualitative gap between \(r=1\) (for which a constant number of copies, \(O(1/\varepsilon^2)\), suffice) and \(r\geq 2\) (for which they show the number of states is \(\Omega(\sqrt{n}/\varepsilon^2)\), and \(O(n r^2/\varepsilon^2)\)).

Constant-time one-shot testing of large-scale graph states, by Hayata Yamasaki and Sathyawageeswar Subramanian (arXiv). In this paper, the authors consider the task of testing if the physical error rate of a given system is below a given threshold — namely, the threshold below which fault-tolerant measurement-based quantum computation (MBQC) becomes feasible. Casting this into the framework of property testing, the paper shows that measuring very few (a constant number!) of the input state is enough to test whether the error rate is low.

And, to conclude, a paper which escaped us in December, on private distribution testing:

Pure Differential Privacy from Secure Intermediaries, by Albert Cheu and Chao Yan (arXiv). Throwback to April 2020 and August 2021, which covered results on distribution testing (uniformity testing!) under the shuffle model of differential privacy. Namely, there was an upper bound of $$ O( k^{2/3}/(\alpha^{4/3}\varepsilon^{2/3})\log^{1/3}(1/\delta) + k^{1/2}/(\alpha\varepsilon) \log^{1/2}(1/\delta) + k^{1/2}/\alpha^2)$$ samples for testing uniformity of distributions over \([k]\), to distance \(\alpha\), under \((\varepsilon,\delta)\)shuffle privacy (so, approximate privacy: \(\delta>0\)). A partial lower bound existed for pure differential privacy, i.e., when \(\delta=0\): however, no upper bound was known for pure shuffle privacy.
Until now: this new paper shows that pure DP basically comes at no cost, by providing an \((\varepsilon,0)\)-shuffle private testing algorithm with sample complexity $$ O( k^{2/3}/(\alpha^{4/3}\varepsilon^{2/3}) + k^{1/2}/(\alpha\varepsilon) + k^{1/2}/\alpha^2)$$ The paper actually does a lot more, focusing on a different problem, private summation; and the testing upper bound is a corollary of the new methods they develop in the process.

News for October 2021

The month of September was quite busy, with seven papers, spanning (hyper)graphs, proofs, probability distributions, and sampling.

Better Sum Estimation via Weighted Sampling, by Lorenzo Beretta and Jakub Tětek (arXiv). This paper considers the following question: “given a large universe of items, each with an unknown weight, estimate the total weight to a multiplicative \(1\pm \varepsilon\).” The key is in the type of access you have to those items: here, the authors consider the setting where items can be sampled proportionally to their unknown weights, and show improved bounds on the sample/query complexity in this model. And there something for everyone: they also discuss connections to edge estimation in graphs (assuming random edge queries) and to distribution testing (specifically, in the “dual” or “probability-revealing” models of Canonne–Rubinfeld and Onak–Sun).

This gives us an easy segue to distribution testing, which is the focus of the next two papers.

As Easy as ABC: Adaptive Binning Coincidence Test for Uniformity Testing, by Sudeep Salgia, Qing Zhao, and Lang Tong (arXiv). Most of the work in distribution testing (from the computer science community) focuses on discrete probability distributions, for several reasons. Including a technical one: total variation distance is rather fickle with continuous distributions, unless one makes some assumption on the unknown distribution. This paper does exactly this: assuming the unknown distribution has a Lipschitz density function, it shows how to test uniformity by adaptively discretizing the domain, achieving (near) sample complexity.

Exploring the Gap between Tolerant and Non-tolerant Distribution Testing, by Sourav Chakraborty, Eldar Fischer, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen (arXiv). It is known that tolerant testing of distributions can be much harder than “standard” testing – for instance, for identity testing, the sample complexity can blow up by nearly a quadratic factor, from \(\sqrt{n}\) to \(\frac{n}{\log n}\)! But is it the worse that can happen, in general, for other properties? This work explores this question, and answers it in some notable cases of interest, such as for label-invariant (symmetric) properties.

And now, onto graphs!

Approximating the Arboricity in Sublinear Time, by Talya Eden, Saleet Mossel, and Dana Ron (arXiv). The arboricity of a graph is the minimal number of spanning forests required to cover all its edges. Many graph algorithms, especially sublinear-time ones, can be parameterized by this quantity: which is very useful, but what do you do if you don’t know the arboricity of your graph? Well, then you estimate it. Which this paper shows how to do efficiently, given degree and neighbor queries. Moreover, the bound they obtain — \(\tilde{O}(n/\alpha)\) queries to obtain a constant-factor approximation of the unknown arboricity \(\alpha\) — is optimal, up to logarithmic factors in the number of vertices \(n\).

Sampling Multiple Nodes in Large Networks: Beyond Random Walks, by Omri Ben-Eliezer, Talya Eden, Joel Oren, and Dimitris Fotakis (arXiv). Another thing which one typically wants to do with very large graphs is sample nodes from them, either uniformly or according to some prescribed distribution. This is a core building block in many other algorithms; unfortunately, approaches to do so via random walks will typically require a number of queries scaling with the mixing time \(t_{\rm mix}(G)\) of the graph \(G\), which might be very small for nicely expanding graphs, but not so great in many practical settings. This paper proposes and experimentally evaluates a different algorithm which bypasses this linear dependence on \(t_{\rm mix}(G)\), by first going through a random-walk-based “learning” phase (learn something about the structure of the graph) before using this learned structure to perform faster sampling, focusing on small connected components.

Why stop at graphs? Hypergraphs!

Hypergraph regularity and random sampling, by Felix Joos, Jaehoon Kim, Daniela Kühn, Deryk Osthus (arXiv). The main result in this paper is a hypergraph analogue of a result of Alon, Fischer, Newman and Shapira (for graphs), which roughly states that if a hypergraph satisfies some regularity condition, then so does with high probability a randomly sampled sub-hypergraph — and conversely. This in turn has direct implications to characterizing which hypergraph properties are testable: see the companion paper, by the same authors.
(Note: this paper is a blast from the past, as the result it shows was originally established in the linked companion paper, from 2017; however, the authors split this paper in two this October, leading to this new, standalone paper.)

And, to conclude, Arthur, Merlin, and proofs:

Sample-Based Proofs of Proximity, by Guy Goldberg, Guy Rothblum (ECCC). Finally, consider the setting of interactive proofs of proximities (IPPs), where the prover is as usual computationally unbounded, but the verifier must run in sublinear time (à la property testing). This has received significant interest in the past years: but what if the verifier didn’t even get to make queries, but only got access to uniformly random locations of the input? These “SIPP” (Sample-based IPPs), and their non-interactive counterpart SAMPs (Sample-based Merlin-Arthur Proofs of Proximity) are the object of study of this paper, which it introduces and motivates in the context, for instance, of delegation of computation for sample-based algorithms.

Workshop on Algorithms for Large Data: We found WALD(O), and so can you!

Ainesh Bakshi, Rajesh Jayaram, and Samson Zhou are organizing a 3-day Workshop on Algorithms for Large Data (nicely abbreviated as WALD(O), the O standing for Online), featuring many talks which should be of interest to the readers of this blog, as well as an open problems and a poster sessions, and a junior/senior lunch. As the organizers describe it:

This workshop aims to foster collaborations between researchers across multiple disciplines through a set of central questions and techniques for algorithm design for large data. We will focus on topics such as sublinear algorithms, randomized numerical linear algebra, streaming and sketching, and learning and testing.

The workshop will take place on August 23 — August 25 (ET). Attendance is free, but registration is required by August 20th. More details at https://waldo2021.github.io/

New for July 2021

This month saw three papers appear online, together covering a rather broad range of topics: testing of regular languages, distribution testing under differential privacy, and local testability from high-dimensional expanders. Let’s dive in!

Property Testing of Regular Languages with Applications to Streaming Property Testing of Visibly Pushdown Languages, by Gabriel Bathie and Tatiana Starikovskaya (paper). Let \(L\in \Sigma^\ast\) be a regular language recognized by an automation with \(m\) states and \(k\) connected components: given as input a word \(u\in \Sigma^n\), what is the query complexity to test membership to \(L\) in Hamming distance? Edit distance? Or, more generally, weighted edit distance, where each letter of the word \(u\) comes with a weight? In this paper, the authors focus on non-adaptive, one-sided errors testing algorithms, for which they show an upper bound of \(q=O(k m \log(m/\varepsilon)/\varepsilon)\) queries (with running time \(O(m^2 q)\)), which they complement by a query complexity lower bound of \(\Omega(\log(1/\varepsilon)/\epsilon)\), thus matching the upper bound for languages recognized by constant-size automata. The guarantee for the upper bound is with respected to weighted edit distance, and thus implies the same upper bound for testing with respect to Hamming distance.
To conclude, the authors use an existing connection to streaming property testing to obtain new algorithms for property testing of visibly pushdown languages (VPL) in the streaming model, along with a new lower bound in that model.

High dimensional expansion implies amplified local testability, by Tali Kaufman and Izhar Oppenheim (arXiv). This paper sets out to show that codes that arise from high-dimensional expanders are locally testable (membership to the code can be tested using very few queries). To do so, the authors define a new notion of high-dimensional expanding system (HDE system), as well as that of amplified local testability, a stronger notion than local testability; they then prove that a code based on a HDE system satisfies this stronger notion. Moreover, they show that many well-known families of codes are, in fact, HDE system codes, and therefore satisfy this stronger notion of local testability as well.

Finally, a survey on differential privacy, with a foray into distribution testing:

Differential Privacy in the Shuffle Model: A Survey of Separations, by Albert Cheu (arXiv). If you are familiar with differential privacy (DP), you may recall that there are several notions of DP, each meant to address a different “threat model” (depending on whom you trust with your data). Shuffle DP is one of them, intermediate between “central” DP and the more stringent “local” DP. Long story short: with shuffle DP, the tradeoff between privacy and accuracy can be strictly in-between what’s achievable in central and local DP, and that’s the case for one of the usual suspects of distribution testing, uniformity testing (“I want to test if the data uniformly distributed, but now, with privacy of that data in mind”). The survey discusses what is known about this in Sections 3.3 and 6, and what the implications are; but there are quite a few questions left unanswered… Long story short: a very good introduction to shuffle privacy, and to open problems in that area!

Looking back at WOLA’21: Videos available

The fifth Workshop on Local Algorithms (WOLA’21) took place earlier this month, and the recordings of the invited talks are now available on YouTube. If you missed the workshop, or want to refresh your memory, here are the recordings (ordered by the workshop schedule):

Thanks again to the speakers and organizers, and looking forward to WOLA’22!

Announcing WOLA’21 (Workshop on Local Algorithms)

The fifth WOLA (Workshop on Local Algorithms) will be virtual, and take place June 14-15. Registration is free, but required: please fill this form by June 10th to attend.

Local algorithms — that is, algorithms that compute and make decisions on parts of the output considering only a portion of the input — have been studied in a number of areas in theoretical computer science and mathematics. Some of the related areas include sublinear-time algorithms, distributed algorithms, streaming algorithms, (massively) parallel algorithms, inference in large networks, and graphical models. These communities have similar goals but a variety of approaches, techniques, and methods. This workshop is aimed at fostering dialogue and cross-pollination of ideas between the various communities.

This year, the workshop will include:

  • A poster session: Please submit your poster proposal (title and abstract) at by May 26th. Everyone is invited to contribute. This session will take place on gather.town.
  • Invited long talks: the tentative schedule is available, and features talks by James Aspnes, Jelani Nelson, Elaine Shi, Christian Sohler, Uri Stemmer, and Mary Wootters.
  • Junior-Senior social meetings
  • An AMA (Ask Me Anything) session, moderated by Merav Parter
  • A Slack channel
  • An Open Problems session

The Program Committee of WOLA 2021 is comprised of:

  • Venkatesan Guruswami (CMU)
  • Elchanan Mossel (MIT)
  • Merav Parter (Weizmann Institute of Science)
  • Sofya Raskhodnikova (chair) (Boston University)
  • Gregory Valiant (Stanford)

and the organizing committee:

  • Sebastian Brandt (ETH)
  • Yannic Maus (Technion)
  • Slobodan Mitrović (MIT)

For more detail, see the website; to stay up to date with the latest announcements concerning WOLA, join our mailing list!

News for April 2021

A somewhat “sublinear” month of April, as far as property testing is concerned, with only one paper. (We may have missed some; if so, please let us know in the comments!)

Graph Streaming Lower Bounds for Parameter Estimation and Property Testing via a Streaming XOR Lemma, by Sepehr Assadi and Vishvajeet N (arXiv). This paper establishes space vs. pass trade-offs lower bounds for streaming algorithms, for a variety of graph tasks: that is, of the sort “any \(m\)-pass-streaming algorithm for task \(\mathcal{T}\) must use memory at least \(f(m)\).” The tasks considered include graph property estimation (size of the maximum matching, of the max cut, of the weight of the MST) and property testing for sparse graphs (connectivity, bipartiteness, and cycle-freeness). The authors obtained exponentially improved lower bounds for those, via reductions to a relatively standard problem, (noisy) gap cycle counting, for which they establish their main lower bound. As a key component of their proof, they prove a general direct product result (XOR lemma) for the streaming setting, showing that the advantage for solving the XOR of \(\ell\) copies of a streaming predicate \(f\) decreases exponentially with \(\ell\).

News for January 2021

The first month of 2021 has brought with it 5 papers, covering graph testing, Boolean function testing, and distribution testing — as well as database theory. Let’s dive into it.

Random walks and forbidden minors III: \(\mathrm{poly}(d/\varepsilon)\)-time partition oracles for minor-free graph classes, by Akash Kumar, C. Seshadhri, and Andrew Stolman (arXiv). Minor-closed bounded-degree graphs have a very nice property: denoting by \(d\) the degree bound and \(n\) the number of edges, it is always possible to partition any such graph into components of constant size, \(O_\varepsilon(1)\), just by removing a linear number of edges, merely \(\varepsilon dn\) (\(\varepsilon\) being a small parameter). This is a crucial result in many graph property testing algorithms, those which rely on something called a “partition oracle”: loosely speaking, a routine which makes “few” queries to the graph, and is able to indicate which component of an underlying partition any given vertex belongs to. But what is “few” here? The first partition oracles made \(d^{\mathrm{poly}(d,1/\varepsilon)}\) queries to the graph to answer any such request. This later got significantly improved to \(d^{\mathrm{log}(d/\varepsilon)}\). Using spectral graph theory techniques previously developed by the authors (hence the “III” in the title), this work settles the question, achieving partition oracles which as good as it gets: making only \(\mathrm{poly}(d,1/\varepsilon)\) queries to the graph! This in turns has immediate consequences for graph property testing, which the paper details.

And since we are on the topic of oracles… more exciting news on that front:

Spectral Clustering Oracles in Sublinear Time, by Grzegorz Gluch, Michael Kapralov, Silvio Lattanzi, Aida Mousavifar, Christian Sohler (arXiv). Given a graph \(G\) and an integer \(k\), one often want to partition the graph into components \(C_1,C_2,\dots,C_k\), such that each \(C_i\) is well-connected and has few edges to the other \(C_j\)’s. But can we get ultra efficient algorithms for such spectral clusterings? Specifically, can we design oracles for them: sublinear-time algorithms which provide implicit access to an underlying “good” spectral clustering \(\hat{C}_1,\hat{C}_2,\dots,\hat{C}_k\), by returning on any query vertex \(v\) the index of the cluster \(\hat{C}_i\) to which \(v\) belongs? This paper introduces the question, and answers it in the affirmative: in more detail, it provides a spectral clustering oracle which, for any \(\varepsilon>0\), has preprocessing time \(2^{\mathrm{poly}(k/\varepsilon)}n^{1/2+O(\varepsilon)}\), query time \(n^{1/2+O(\varepsilon)}\), and space \(n^{1/2+O(\varepsilon)}\); and provides access to a clustering with relative error \(O(\varepsilon\log k)\) per cluster. The paper also allows tradeoffs between query time and space, and discusses applications to the Local Computation Algorithms (LCA) model.

Next stop, distribution testing…

The Sample Complexity of Robust Covariance Testing, by Ilias Diakonikolas and Daniel M. Kane (arXiv). Suppose you have i.i.d. samples from some high-dimensional Gaussian \(\mathcal{N}(0,\Sigma)\) in \(d\) dimensions, and want to test whether the unknown covariance matrix \(\Sigma\) is the identity, versus \(\varepsilon\)-far from it (in Frobenius norm). Good news: we know how to do that, and \(\Theta(d/\varepsilon^2)\) samples are necessary and sufficient. (To learn \(\Sigma\), that’d be \(\Theta(d^2/\varepsilon^2)\).) Bad news: you don’t have i.i.d. samples from some high-dimensional Gaussian \(\mathcal{N}(0,\Sigma)\); what you have is i.i.d. samples from a noisy version of it, \((1-\alpha)\mathcal{N}(0,\Sigma) + \alpha B\), where \(B\) is an arbitrary “bad” distribution (not necessarily Gaussian itself). You still want to test whether the covariance \(\Sigma\) is the identity, but now you have that extra \(\alpha\) fraction of noisy samples, and you need to do that testing robustly… The good news is, you can still do that by learning the covariance matrix \(\Sigma\), robustly, with \(O(d^2/\varepsilon^2)\) samples. The bad news is the main result of this paper: that’s also the best you can do. That is, \(\Omega(d^2)\) samples are necessary: if you have to be robust to noise, testing is no longer easier than learning….

Onto Boolean functions!

Junta Distance Approximation with Sub-Exponential Queries, by Vishnu Iyer, Avishay Tal, and Michael Whitmeyer (ECCC). If you follow this blog, you may have seen over the past couple years a flurry of results about tolerant junta testing: “given query access to some Boolean function \(f\colon\{-1,1\}^n\to\{-1,1\}\), how close is \(f\) to only depending on \(k \ll n\) of its variables?”
This paper contains several results on this problem, including an improved bicriteria tolerant testing algorithm: an efficient algorithm to distinguish between \(\varepsilon\)-close to \(k\)-junta and \(1.1\varepsilon\)-far from \(k’\)-junta making \(\mathrm{poly}(k,1/\varepsilon)\) queries (and \(k’ = O(k/\varepsilon^2)\)). But the main result of the paper, and the one giving it its name, is for the non-relaxed version where \(k’=k\): while all previous works had a query complexity \(2^{O(k)}\), here the authors show how to break that exponential barrier, giving a fully tolerant testing algorithm with query complexity \(2^{\tilde{O}(\sqrt{k})}\)!

And finally, a foray into database theory:

Towards Approximate Query Enumeration with Sublinear Preprocessing Time, by Isolde Adler and Polly Fahey (arXiv). In this paper, the authors are concerned with the task of (approximate) query enumeration on databases, aiming for ultra efficient (i.e., sublinear-time) algorithms. Leveraging techniques from property testing (specifically, in the bounded-degree graph model), they show the following:
On input databases of bounded degree and bounded tree-width, every (fixed) first-order definable query can be enumerated approximately in time linear in the output size, after only a sublinear-time preprocessing phase.

That’s all for this month! If you noticed a paper we missed, please let us know in the comments.

News for October 2020

Sorry for the delay in writing this monthly digest: we hope you didn’t spend the week frantically refreshing your browsers! We found four papers this month: let’s dive in.

A Structural Theorem for Local Algorithms with Applications to Coding, Testing, and Privacy, by Marcel Dall’Agnol, Tom Gur, and Oded Lachish (ECCC). This paper introduces the notion of “robust (local) algorithm,” an abstraction which encompasses may types of algorithms: e.g., property testing algorithms, locally decodable codes, etc. The main result of this work is that any (possibly adaptive) \(q\)-query robust local algorithm can be transformed into a non-adaptive, sample-based one making \(n^{1-1/(q^2\log q)}\) queries (where \(n\) is the input size). Here, “sample-based” means that the algorithm doesn’t get to make arbitrary queries, but just gets to observe randomly chosen coordinates of the input. As application of this result, the authors derive new upper and lower bounds for several of the types of local algorithms mentioned above, resolving open questions from previous works.

Testing Tail Weight of a Distribution Via Hazard Rate, by Maryam Aliakbarpour, Amartya Shankha Biswas, Kavya Ravichandran, Ronitt Rubinfeld (arXiv). The authors consider, from the point of view of distribution testing, the question of deciding whether data is heavy-tailed: as would, for instance, data following a power law. The paper first sets out to formalize the question, and discusses various possible definitional choices before setting on one; after which it provides a test, and analyzes its sample complexity as a function of various parameters (such as smoothness of the unknown distribution). The authors finally back these results with empirical evaluation of their algorithm.

On Testing of Samplers, by Kuldeep S. Meel, Yash Pote, Sourav Chakraborty (arXiv). Suppose you are given a (known) distribution \(p\) over some domain \(\Omega\), and want to sample from it conditioned on some predicate \(\varphi\). Now someone comes to you with an algorithm which does exactly that, efficiently, and cheaply: great! But can you easily check that you’re not getting fooled, and that this sampler actually does what it claims? This paper provides this: an algorithm which accepts if the sampled distribution is \(\varepsilon\)-close to what it should (roughly, in a multiplicative, KL divergence sense), and rejects if it’s \(\varepsilon’\)-far (in total variation distance). The number of samples required is polynomial in \(\varepsilon’-\varepsilon\), and depends on some characteristic of \(p\) and \(\varphi\), the “tilt” (ratio between max and min probability of the conditional distribution).

Finally, an omission from late September:
Sample optimal Quantum identity testing via Pauli Measurements, by Nengkun Yu (arXiv). The abstract is concise and clear enough to speak for itself: “In this paper, we show that \(\Theta(\textrm{poly}(n)\cdot\frac{4^n}{\varepsilon^2})\) is the sample complexity of testing whether two \(n\)-qubit quantum states \(\rho\) and \(\sigma\) are identical or \(\varepsilon\)-far in trace distance using two-outcome Pauli measurements.”

Please let us know if you spotted a paper we missed!

Videos from the WoLA 2020 workshop

The 4th Workshop on Local Algorithms (WoLA 2020) recently concluded: aimed at “fostering dialogue and cross-pollination of ideas between the various communities” related to local algorithms, broadly construed, it featured invited and contributed talks on a variety of topics, many (if not most) very relevant to the sublinear algorithms and property testing community.

Because of the online format of the workshop (imposed by the current circumstances), all talks were recorded and posted online. As such, all videos all available on the workshop’s website and YouTube channel: a good list of resources to peruse!