**Testing in the bounded-degree graph model with degree bound two** by Oded Goldreich and Laliv Tauber (ECCC). One of great, central results in graph property testing is that all monotone properties are testable (with query complexity independent on graph size) on dense graphs. The sparse graph universe is far, far, more complicated and interesting. Even for graphs with degree bound 3, natural graph properties can have anywhere from constant to linear (in \(n\)) query complexity. This note shows that when considering graphs with degree bound at most 2, the landscape is quite plain. The paper shows that all properties are testable in \(poly(\varepsilon^{-1})\). Any graph with degree at most 2 is a collection of paths and cycles. In \(poly(\varepsilon^{-1})\) queries, one can approximately learn the graph. (After which the testing problem is trivial.) The paper gives a simple \(O(\varepsilon^{-4})\) query algorithm, which is improved to the nearly optimal \(\widetilde{O}(\varepsilon^{-2})\) bound.

**On the power of nonstandard quantum oracles** by Roozbeh Bassirian, Bill Fefferman, and Kunal Marwaha (arXiv). This paper is on the power of oracles in quantum computation. An important question in quantum complexity theory is whether \(QCMA\) is a strict subset of \(QMA\). The former consists of languages decided by Merlin-Arthur quantum protocols with a classical witness (the string that Merlin provides). The latter class allows Merlin to be a quantum witness. This paper shows a property testing problem where such a separation is shown. The property is essentially graph non-expansion (does there exist a set of low conductance?). The input graph should be thought of as an even (bounded) degree with “exponentially many” vertices. So it has \(N = 2^n\) vertices. The graph is represented through a special “graph-coded” function. The paper shows that there is a \(poly(n)\)-sized quantum witness for non-expansion that can be verified in \(poly(n)\) time, which includes queries to the graph-coded function. On the other hand, there is no classic \(poly(n)\)-sized witness that can be verified in \(poly(n)\) queries to the graph-coded function. (Informally speaking, any \(QCMA\) protocol needs exponentially many queries to the graph.)

Our own Clément Canonne has written a beautiful survey which is now available in FnT book format from now publishers. This appears to be a very promising read — especially for the Distribution Testers among you. Today’s post is a mere advertisement for this beautiful survey/book which is clearly the result of a dedicated pursuit.

Let me now dig into this survey a teeny tiny bit. One among the many cool features of this survey is that it uses one central example (testing goodness-of-fit) to give a unified treatment to the diverse tools and techniques used in distribution testing. Another plus for me is the historical notes section that accompanies every chapter. In particular, I really liked jumping into the informative history section at the end of Chapter 2 which has an almost story like feel to it. If the above points do not catch your fancy, then please try opening the survey. You will be hardpressed to find a book that is typeset in such an aesthetically pleasing way with colored fonts to emphasize various parameters in several intricate proofs. Happy Reading!

]]>**Sublinear Time Algorithms and Complexity of Approximate Maximum Matching** by Soheil Behnezhad, Mohammad Roghani, Aviad Rubinstein (arXiv) This paper makes significantly advances our understanding of the maximum matching problem in the sublinear regime. Your goal is to estimate the size of the maximum matching and you may assume that you have query access to the adjacency list of your graph. Our posts from Dec 2021 and June 2022 reported some impressive progress on this problem. The upshot from these works essentially said that you can beat greedy matching and obtain a \(\frac{1}{2} + \Omega(1)\) approximate maximum matching in sublinear time. Let me first go over the algorithmic results from the current paper. The paper shows the following two algorithmic results:

(1) An algorithm that runs in time \(n^{2 – \Omega_{\varepsilon}(1)}\) and returns a \(2/3 – \varepsilon\) approximation to maximum matching in general graphs, and

(2) An algorithm that runs in time \(n^{2 – \Omega_{\varepsilon}(1)}\) and returns a \(2/3 + \varepsilon\) approximation to maximum matching size in *bipartite* graphs.

The question remained — can we show a lower bound that grows superlinearly with \(n\). The current work achieves this and shows that *even on bipartite graphs*, you must make at least \(n^{1.2 – o(1)}\) *queries *to the adjacency list to get a better than \(2/3 + \Omega(1)\) approximation. (An aside: A concurrent work by Bhattacharya-Kiss-Saranurak from December also obtains similar algorithmic results for approximating the maximum matching size in general graphs).

**Directed Isoperimetric Theorems for Boolean Functions on the Hypergrid and an \(\widetilde{O}(n \sqrt d)\) Monotonicity Tester** by Hadley Black, Deeparnab Chakrabarty, C. Seshadhri (arXiv) Boolean Monotonicity testing is as classic as classic gets in property testing. Encouraged by the success of isoperimetric theorems over the hypercube domain and the monotonicity testers powered by these isoperimetries (over the hypercube), one may wish to obtain efficient monotonicity testers for the hypergrid \([n]^d\). Indeed, the same gang of authors as above showed in a previous work that a Margulis style directed isoperimetry can be extended from the lowly hypercube to the hypergrid. This resulted in a tester with \(\widetilde{O}(d^{5/6})\) queries. The more intricate task of proving a directed Talagrand style isoperimetry that underlies the Khot-Minzer-Safra breakthrough was a challenge. Was. The featured work extends this isoperimetry from the hypercube to the hypergrid and this gives a tester with query complexity \(\widetilde{O}(n \sqrt d)\) which is an improvement over the \(d^{5/6}\) bound for domains where \(n\) is (say) some small constant. But as they say, when it rains, it pours. This brings us to a concurrent paper with the same result.

**Improved Monotonicity Testers via Hypercube Embeddings** by Mark Braverman, Subhash Khot, Guy Kindler, Dor Minzer (arXiv) Similar to the paper above, this paper also obtains monotonicity testers over the hypergrid domain, \([n]^d\), with \(\widetilde{O}(n^3 \sqrt d)\) queries. This paper also presents monotonicity testers over the standard hypercube domain — \(\{0,1\}^d\) in the \(p\)-biased setting. In particular, their tester issues \(\widetilde{O}(\sqrt d)\) queries to successfully test monotonicity on the \(p\)-biased cube. Coolly enough, this paper also proves directed Talagrand style isoperimetric inequalities both over the hypergrid and the \(p\)-biased hypercube domains.

**Toeplitz Low-Rank Approximation with Sublinear Query Complexity** by Michael Kapralov, Hannah Lawrence, Mikhail Makarov, Cameron Musco, Kshiteej Sheth (arXiv) Another intriguing paper for the holiday month. So, take a Toeplitz matrix. Did you know that any *psd* Toeplitz matrix admits a (near-optimal in the Frobenius norm) low-rank approximation which is itself Toeplitz? This is a remarkable statement. The featured paper proves this result and uses it to get more algorithmic mileage. In particular, suppose you are given a \(d \times d\) Toeplitz matrix \(T\). Armed with the techniques from the paper you get algorithms that return a Toeplitz matrix \(\widetilde{T}\) with rank slightly bigger than \(rank(T)\) which is a very good approximation to \(T\) in the Frobenius norm. Moreover, the algorithm only issues a number of queries sublinear in the size of \(T\).

**Sampling an Edge in Sublinear Time Exactly and Optimally** by Talya Eden, Shyam Narayanan and Jakub Tětek (arXiv) Regular readers of PTReview are no strangers to the fundamental task of sampling a random edge from a graph which you can access via query access to its vertices. Of course, you don’t have direct access to the edges of this graph. This paper considers the task of sampling a truly uniform edge from the graph \(G = (V,E)\) with \(|V| = n, |E| = m\). In STOC 22, Tětek and Thorup presented an algorithm for a relaxation of this problem where you want an \(\varepsilon\)-approximately unifrom edge. This algorithm runs in time \(O\left(\frac{n}{\sqrt{m}} \cdot \log(1/\varepsilon) \right)\). The featured paper presents an algorithm that samples an honest to goodness uniform edge in expected time \(O(n/\sqrt{m})\). This closes the problem as we already know a matching lower bound. Indeed, just consider a graph with \(O(\sqrt m)\) vertices which induce a clique and all the remaining components are singletons. You need to sample at least \(\Omega(n/\sqrt m)\) vertices before you see any edge.

**Support Size Estimation: The Power of Conditioning** by Diptarka Chakraborty, Gunjan Kumar, Kuldeep S. Meel (arXiv) This work considers the classic problem of support size estimation with a slight twist. You are given access to a stronger (conditioning based) sampling oracle. Let me highlight one of the results from this paper. So, you are given a distribution \(D\) where \(supp(D) \subseteq [n]\). You want to obtain an estimate to \(supp(D)\) that lies within \(supp(D) \pm \varepsilon n\) with high probability. Suppose you are also given access to the following sampling oracle. You may choose any subset \(S \subseteq [n]\) and you may request a sample \(x \sim D\vert_S\). An element \(x \in S\) is returned with probability \(D\vert_S(x) = D(x)/D(S)\) (for simplicity of this post, let us assume \(D(S) > 0\)). In addition, this oracle also reveals for you the value \(D(x)\). The paper shows that the algorithmic task of obtaining a high probability estimate to the support size (to within \(\pm \varepsilon n\)) with this sampling oracle admits a lower bound of \(\Omega(\log (\log n)\) calls to the sampling oracle.

**Computing (1+epsilon)-Approximate Degeneracy in Sublinear Time** by Valerie King, Alex Thomo, Quinton Yong (arXiv) Degeneracy is one of the important graph parameters which is relevant to several problems in algorithmic graph theory. A graph \(G = (V,E)\) is \(\delta\)-degenerate if all induced subgraphs of \(G\) contain a vertex with degree at most \(\delta\). The featured paper presents algorithms for a \((1 + \varepsilon)\)-approximation to degeneracy of \(G\) where you are given access to \(G\) via its adjacency list.

**Learning and Testing Latent-Tree Ising Models Efficiently** by Davin Choo, Yuval Dagan, Constantinos Daskalakis, Anthimos Vardis Kandiros (arXiv) Ising models are emerging as a rich and fertile frontier for Property Testing and Learning Theory researchers (at least to the uninitiated ones like me). This paper considers latent-tree ising models. These are ising models that can only be observed at their leaf nodes. One of the results in this paper gives an algorithm for testing whether the leaf distributions attached to two latent-tree ising models are close or far in the TV distance.

**A constant lower bound for the union-closed sets conjecture** by Justin Gilmer (arXiv) The union-closed sets conjecture of Frankl states that for any union closed set system \(\mathcal{F} \subseteq 2^{[n]}\), it holds that there is a mysterious element \(i \in [n]\) that shows up in at least \(c = 1/2\) of the sets in \(\mathcal{F}\). Gilmer took a first swipe on this problem and gave a constant lower bound of \(c = 0.01\). This has already been improved by at least four different groups to \(\frac{3-\sqrt{5}}{2}\), a bound which is the limit of Gilmer’s method (which takes all of only 9 pages!).

The key lemma Gilmer proves is the following. Suppose you sample two sets: \(A, B \sim \mathcal{D}_n\) *(iid)* from some distribution \(\mathcal{D}_n\) over the subsets of \([n]\). Suppose for every index \(i \in [n]\), it holds that the probability that the element \(i\) shows up in the random set \(A\) is at most $0.01$. Then you have \(H(A \cup B) \geq 1.26 H(A)\). This is all you need to finish Gilmer’s proof (of \(c = 0.01\)). The remaining argument is as follows. Suppose, by the way of contradiction, that no element shows up in at least \(0.01\) fraction of sets in the union closed family \(\mathcal{F}\). An application of the key lemma would then give \(H(A \cup B) > H(A)\) which is a contradiction if \(A,B\) are chosen uniformly from \(\mathcal{F}\). The proof of the key lemma is also fairly slick and uses pretty simple information theoretic tools.

**Gaussian Mean Testing Made Simple**, by Ilias Diakonikolas, Daniel Kane and Ankit Pensia (arXiv). Consider an unknown distribution distribution \(p\) over \(\mathbb{R}^d\) that we have sample access to. The paper studies the problem of determining whether \(p\) is a standard Gaussian with zero mean or whether it is a Gaussian with large mean. More formally, the task is to distinguish between the case that \(p\) is \(\mathcal{N}(0, I_d)\) and the case that \(p\) is a Gaussian of the form \(\mathcal{N}(\mu, \Sigma)\), where \(||\mu||_2 \geq \epsilon\) and \(\Sigma\) is an unknown covariance matrix. Canonne, Chen, Kamath, Levi and Weingarten (2021) gave a sample-optimal algorithm for this problem with sample complexity \(\Theta(\sqrt{d}/\epsilon^2)\) sample complexity. The current paper gives another sample-optimal algorithm for the same problem with a simpler analysis. In addition to being sample-optimal, the algorithm in the current paper also runs in time linear in the total sample size, which is an improvement over the work of Canonne et al.

**Superpolynomial lower bounds for decision tree learning and testing**, by Caleb Koch, Carmen Strassle and Li-Yang Tan (arXiv). Roughly speaking, the paper studies the problems of testing if a function has a low-depth decision tree and learning a low-depth decision tree approximating a function (provided that one such tree exists). In what follows, we summarize the testing results in the paper. Given an explicit representation of a function \(f:\{0,1\}^n \to \{0,1\}\) and access to samples from a known distribution \(\mathcal{D}\) over \(\{0,1\}^n\), one can aim to determine, with probability at least \(2/3\), if \(f\) has a decision tree of depth at most \(d\) or whether \(f\) is \(\epsilon\)-far from having a decision tree of depth at most \(d\log d\), where the distance is measured with respect to \(\mathcal{D}\). The paper shows that, under the randomized exponential time hypothesis, this problem cannot be solved in time \(\exp(d^{\Omega(1)})\). An immediate corollary is that the same lower bound holds for the problem of distribution-free testing of the property of having depth-\(d\) decision trees. The bound in the current paper is an improvement over the recent work of Blais, Ferreira Pinto Jr., and Harms (2021), who give a lower bound of \(\tilde{\Omega}(2^d)\) on the query complexity of testers for the same problem. However, the advantage of the latter result is that it is unconditional, as opposed to the result in the current paper.

**On Interactive Proofs of Proximity with Proof-Oblivious Queries**, by Oded Goldreich, Guy Rothblum, and Tal Skverer (ECCC). Interactive Proofs of Proximity (IPPs) are the “interactive” version of property testers, where the algorithm can both query the input and interact with an all-knowing (but untrusted) prover. In this work, the authors study the power of a specific and natural type of “adaptivity” for IPPs, asking what happens when the choice of queries and the interaction with the prover are independent, or restricted. That is, what happens when these two aspects of the IPP algorithm are in separate “modules”? Can we still test various properties as efficiently? The paper proves various results in under several models (=restrictions between the two “modules”), focusing on the intermediate restriction where the two modules (queries to the input and interaction with the prover) are separate (no interaction), but have access to shared randomness.

**Training Overparametrized Neural Networks in Sublinear Time** by Hang Hu, Zhao Song, Omri Weinstein, Danyang Zhuo (arXiv). Think of a classification problem where the inputs are in \(\mathbb{R}^d\). We have \(n\) such points (with their true labels, as training data) and wish to train a Neural Network. A two layer Rectified Linear Unit (ReLU) Neural Network (NN) works as follows. The first layer has \(m\) vertices, where each vertex has vector weight \(\vec{w}_i \in \mathbb{R}^d\). The second “hidden layer” has \(m\) vertices, each with a scalar weight \(a_1, a_2, \ldots, a_m\). This network is called overparametrized when \(m \gg n\). The output of this NN on input vector \(\vec{x}\) is (up to scaling) \(\sum_{i \leq m} a_i \phi(\vec{w_i} \cdot \vec{x})\) (where \(\phi\) is a thresholded linear function). Observe that to compute the value on a single input takes \(O(md)\) time, so the total time to compute all values on \(n\) training inputs takes \(O(mnd)\) time. The training is done by gradient descent methods; given a particular setting of weights, we compute the total loss, and then modify the weights along the gradient. Previous work showed how a single iteration can be done in time \(O(mnd + n^3)\). When \(m \gg n^2\), this can be thought of as linear in computing the loss function (which requires evaluating the NN on all the \(n\) points). This paper shows how to implement a single iteration in \(O(m^{1-\alpha}nd + n^3)\) time, for some \(\alpha > 0\). Hence, the time for an iteration is sublinear in the trivial computation. The techniques used are sparse recovery methods and random projections.

**Testing of Index-Invariant Properties in the Huge Object Model** (by Sourav Chakraborty, Eldar Fischer, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen)(arXiv) This paper explores a class of distribution testing problems in the Huge Object Model introduced by Goldreich and Ron (see our coverage of the model here). A quick refresher of this model: so, suppose you want to test whether a distribution \(\mathcal{D}\) supported over, say the boolean hypercube \(\{0,1\}^n\) has a certain property \(\mathcal{P}\). You pick a string \(x \sim \mathcal{D}\) where the length of \(x\) is \(n\). In situations where \(n\) is really large, you might not want to read all of \(x\) and you may instead want to read only a few bits from it. To this end, Goldreich and Ron formulated a model where you have query access to the strings you sample. The distribution \(\mathcal{D}\) is deemed to be \(\varepsilon\)-far from \(\mathcal{P}\) if \(EMD(\mathcal{D}, \mathcal{P}) \geq \varepsilon\) (here \(EMD\) denotes the earthmover distance with respect to the relative Hamming distance between bitstrings). In this model, one parameter of interest is the query complexity of your tester.

One of the results in the featured paper above shows the following: Let \(\sf{MONOTONE}\) denote the class of monotone distributions supported over \(\{0,1\}^n\) (a distribution \(D\) belongs to the class \(\sf{MONOTONE}\) if \(D(x) \leq D(y)\) whenever \(0^n \preceq x \preceq y \preceq 1^n\)). Let \(\mathcal{B}_d\) denote the class of distributions supported over \(\{0,1\}^n\) whose supports have VC dimension at most \(d\). Let \(\mathcal{P} = \sf{MONOTONE} \cap \mathcal{B}_d\). Then, for any \(\varepsilon > 0\), you can test whether a distribution \(\mathcal{D} \in \mathcal{P}\) or whether it is \(\varepsilon\) far from \(\mathcal{P}\) with query complexity \(poly(1/\varepsilon)\). In fact, the paper shows this for a much richer class \(\mathcal{P}\) which is the class of so-called *index-invariant* distributions with bounded VC-dimensions. The paper also shows the necessity of both of these conditions for efficient testability. Do check it out!

**Identity Testing for High-Dimensional Distributions via Entropy Tensorization** (by Antonio Blanca, Zongchen Chen, Daniel Štefankovič, and Eric Vigoda)(arXiv)

This paper considers a classic in distribution testing. Namely, the problem of testing whether the hidden input distribution \(\pi\) is identical to an explicitly given distribution \(\mu\). Both distributions are supported over a set \(\Omega\). The caveat is \(\Omega\) is some high dimensional set (think \(\Omega = [k]^n\)) and that it has a size that grows exponentially in \(n\). In this case, identity testing has sample complexity \(\Omega(k^{n/2})\) even when \(\mu\) is the uniform distribution. In an attempt to overcome this apparent intractability of identity testing in high dimensions, this paper takes the following route: in addition to the standard sample access to \(\pi\), you also assume access to a *stronger sampling oracle* from \(\pi\). And now you would like to understand for which class of explicitly given distributions \(\mu\) can you expect algorithms with efficient sample complexity (assuming the algorithm is equipped with this *stronger sampling oracle*). For any \(i \in [n]\) and \(\omega \in \Omega\), the stronger oracle considered in this work allows you to sample \(x \sim \pi_{\omega(-i)}\) where \(\pi_{\omega(-i)}\) denotes the conditional marginal distribution of \(\pi\) over the \(i\)-th coordinate when the remaining coordinates have been fixed according to \(\omega\).

The paper shows if the known distribution \(\mu\) satisfies some approximate tensorization of entropy criterion, then identity testing with such distributions \(\mu\) can be done with \(\tilde{O}(n/\varepsilon)\) queries. Thanks to the spectral independence toolkit pioneered by Anari et al, it turns out that the approximate tensorization property holds for a rich class of distributions. (*A side note to self*: It looks like I am running out of reasons to postpone learning about the new tools like Spectral Independence.)

**Near-Optimal Bounds for Testing Histogram Distributions** (by Clément L. Canonne, Ilias Diakonikolas, Daniel M. Kane, and Sihan Liu)(arXiv) Histograms comprise one of the most natural and widely used ways for summarizing some relevant aspects of massive datasets. Let \(\Omega\) denote an \(n\)-element dataset (with elements being \(\{1,2, \ldots, n \}\)). A \(k\)-histogram is a function that is piecewise constant over \(k\) interval pieces. This paper studies the sample complexity of the following fundamental task: given a distribution \(\mathcal{P}\) supported over \(\Omega\), is \(\mathcal{P}\) a \(k\)-histogram or is \(\mathcal{P}\) far from being a \(k\)-histogram. The main result of the paper is a (near) sample optimal algorithm for this problem. Specifically, this paper shows that \(k\)-histogram testing has sample complexity \(\Theta\left(\sqrt{nk}/\varepsilon + k/\varepsilon^2 + \sqrt{n}/\varepsilon^2\right)\).

**Comments on “Testing Conditional Independence of Discrete Distributions”** (by Ilmun Kim)(arXiv) Probability is full of subtleties and conditional probability is perhaps the biggest landmine of subtleties in this venerable discipline. The featured paper closely examines some subtleties in Theorem 1.3 of the CDKS18 paper on testing conditional independence of discrete distributions. Essentially, this theorem undertakes the following endeavor: you would like to test whether a bivariate discrete distribution has independent marginals conditioned on values assumed by a third random variable. Theorem 1.3 of CDKS18 asserts that there exists a computationally efficient tester for conditional independence with small sample complexity. The featured paper fixes the sample complexity bound claimed in Theorem 1.3 of CDKS18.

**Cryptographic Hardness of Learning Halfspaces with Massart Noise** (by Ilias Diakonikolas, Daniel M. Kane, Pasin Manurangsi, and Lisheng Ren)(arXiv) The study of robust supervised learning in high dimensions has seen a lot of impressive progress in the last few years. The paper under review presents sample complexity lower bounds for the task of learning halfspaces in this overarching framework. Let us unpack this paper slowly. So, let us recall the classic task of learning halfspaces in \(\mathbb{R}^n\). You know the drill. I have a known concept class \(\mathcal{C}\) (comprising of boolean functions) in my hand. Unbeknownst to you, I have a boolean function \(f \in \mathcal{C}\). You get as input a multiset \(\{x_i, f(x_i)\}_{i \in [s]}\) of labeled examples from a distribution \(\mathcal{D}\) where \(x_i \sim \mathcal{D}_x\) and \(\mathcal{D}_x\) is fixed but arbitrary. Your goal is to develop an algorithm that returns a hypothesis with a small misclassification rate. The classic stuff.

Now, consider the same setup with a little twist: the so-called Massart noise setup. The labels \(f(x_i)\) are no longer reliable and the label on each \(x_i\) gets flipped adversarially with probability \(\eta_i \leq \eta < 1/2\). In a breakthrough Diakonikolas, Gouleakis, and Tzamos made the first algorithmic progress on this problem and gave algorithms with running time \(poly(n/\varepsilon)\) and misclassification rate \(\eta + \varepsilon\). The current paper shows a lower-bound result. Assuming the hardness of the so-called “Learning With Errors” problem, this paper shows that under Massart Noise, it is not possible for a polynomial time learning algorithm to achieve a misclassification rate of \(o(\eta)\).

**Locally-iterative (Δ+1)-Coloring in Sublinear (in Δ) Rounds** (by Xinyu Fu, Yitong Yin, and Chaodong Zheng)(arXiv) A time-honored problem in Distributed Computing is Distributed graph coloring. Let us first understand what problem this paper studies. So, you are given a graph \(G = (V,E)\) with maximum degree \(\Delta\). In a seminal work, Szegedy and Vishwanathan introduced the framework of *locally-iterative algorithms* as a natural family of distributed graph coloring algorithms. These algorithms proceed in \(r\) rounds. In each round, you update the color of a vertex \(v\) where the new color of \(v\) is a function of the current color of \(v\) and the current color of its neighbors. The current paper shows that you can in the locally-iterative framework, you can in fact, obtain a proper coloring of \(G\) with \(\Delta(G) + 1\) colors in \(r = O(\Delta^{3/4} \log \Delta) + \log^* n\) rounds.

**Learning Hierarchical Structure of Clusterable Graphs** (by Michael Kapralov, Akash Kumar, Silvio Lattanzi, Aida Mousavifar)(arXiv) [*Disclaimer: I am one of the authors of this paper.*] Hierarchical clustering of graph data is a fundamentally important task in the current big data era. In 2016, Dasgupta introduced the notion of Dasgupta cost which essentially allows one to measure the quality of a hierarchical clustering. This paper presents algorithms that can estimate the Dasgupta Cost of a graph coming from a special family of \(k\)-clusterable graphs in the semi-supervised setting. These graphs have \(k\) clusters. These clusters are essentially subsets of vertices that induce expanders and these clusters are sparsely connected to each other. We are given query access to the adjacency list of \(G\). Also, for an initial “warmup” set of randomly chosen vertices, we are told the clusters they belong to. Armed with this setup, this paper presents algorithms that run in time \(\approx \sqrt{n}\) and return an estimate to the Dasgupta Cost of \(G\) which is within a \(\approx \sqrt{\log k}\) factor of the optimum cost.

**Finding a Hidden Edge** (by Ron Kupfer and Noam Nisan)(arXiv) Let us consider as a warmup (as done in the paper) the following toy problem. You have a graph on \(n\) vertices whose edge set \(E\) is hidden from you. Your objective is to return any \((i,j) \in E\). The only queries you are allowed are of the following form. You may consider any subset \(Q \subseteq V \times V\) and you can ask whether \(Q\) contains any edge. A simple binary search solves this question with \(\log m\) queries (where \(m = {n \choose 2}\)). However, if you want a non-adaptive algorithm for this problem (unlike binary search) you can show that any deterministic algorithm must issue \(m\) non-adaptive queries. Turns out randomness can help you get away with only \(O(\log^2m)\) non-adaptive queries for this special toy problem. Now, let me describe the problem considered in this work in earnest. Suppose the only queries you are allowed are of the following form: you may pick any \(S \subseteq V\) and you may ask whether the graph induced on \(S\) contains an edge. The paper’s main result is that there is an algorithm for finding an edge in \(G\) which issues nearly linear in \(n\) many non-adaptive queries. The paper also presents an almost matching lower bound.

**On One-Sided Testing Affine Subspaces** (by Nader Bshouty)(ECCC) Dictatorship testing is one of the classics in property testing of boolean functions. A more generalized problem considers testing whether the presented function is a \(k\)-monomial. If you are a regular reader of the posts on PTReview, you might have seen this problem essentially asks you to test whether a boolean function \(f \colon \mathcal{F}^n \to {0,1}\) is an indicator of an \((n-d)\) dimensional affine/linear subspace of \(\mathcal{F}^n\) (here \(\mathcal{F}\) denotes a finite field). Namely, you would like to test whether the set \(f^{-1}\) is an \((n-k)\) dimensional affine subspace of \(\mathcal{F}^n\). The paper under review improves the state-of-the-art query complexity for this problem from a previous value of \(O\left(|\mathcal{F}|/\varepsilon\right)\) to \(\tilde{O}\left(1/\varepsilon\right)\).

**Non-Adaptive Edge Counting and Sampling via Bipartite Independent Set Queries** (by Raghavendra Addanki, Andrew McGregor, and Cameron Musco)(arXiv) If you have been around the PTReview corner for a while, you know that sublinear time estimation of graph properties is one of our favorite pastimes here. Classic work in this area considers the following queries: vertex degree queries, \(i\)-th neighbor queries, and edge existence queries. This classic query model has received a lot of attention and thanks to the work of Eden and Rosenbaum we know algorithms for near-uniform edge sampling with query complexity \(O(n/\sqrt{m}) \cdot poly(\log n) \cdot poly(1/\varepsilon)\). Motivated by a desire to obtain more query-efficient algorithms, Beame et al. introduced an augmented query model where you are also allowed the following queries: you may pick \(L, R \subseteq V\) and you get a yes/no response indicating whether there exists an edge in \(E(L, R)\). These are also called the bipartite independent set *(BIS)* queries. The featured paper shows that with (*BIS*) queries you get *non-adaptive* algorithms for near-uniform edge sampling with query complexity being a mere \(\widetilde{O}(\varepsilon^{-4} \log^6 n)\). The main result of the paper gives a non-adaptive algorithm for estimating the number of edges in \(G\) with query complexity (under *BIS*) being a mere \(\widetilde{O}(\varepsilon^{-5} \log^5 n)\).

**A Query-Optimal Algorithm for Finding Counterfactuals** (by Guy Blanc, Caleb Koch, Jane Lange, Li-Yang Tan)(arXiv) Given an abstract space \(X^d\), an instance \(x^* \in X^d\) and a model \(f\) (which you think of as a boolean function over \(X^d\)), a point \(x’ \in X^d\) is called a counterfactual to \(x^*\) if \(x^*, x’\) differ in few features (i.e., have a small Hamming distance) and \(f(x^*) \neq f(x’)\). Ideally, you would like to find counterfactuals that are as close to each other in Hamming Distance. The main result of this paper is the following: Take a monotone model \(f \colon \{0,1\}^d \to \{0,1\}\), an instance \(x^* \in \{0,1\}^d\) with small sensitivity (say \(\alpha\)). Then there exists an algorithm that makes at most \(\alpha^{\Delta(x^*)}\) queries to \(f\) and returns all optimal counterfactuals of \(f\). Here \(\Delta(x^*) = \min_{x \in \{0,1\}^d} \{\Delta_H(x, x^*) \colon f(x) \neq f(x^*) \}\). The paper also proves a matching lower bound on query complexity which is obtained by some monotone model \(f\).

**A Sublinear-Time Quantum Algorithm for Approximating Partition Functions** (by Arjan Cornelissen and Yassine Hamoudi)(arXiv) For the classical Hamiltonian \(H \colon \Omega \to \{0,1, \ldots, n\}\), at inverse temperature \(\beta\), the probability, under the so-called Gibbs distribution, assigned to a state \(x \in \Omega\) is proportional to \(\exp(-\beta H(x))\). The partition function is given by \(Z(\beta) = \sum_{x \in \Omega} \exp(-\beta H(x))\). At high temperatures (or low values of \(\beta\)) the partition function is typically easy to compute. However, the low-temperature regime is often challenging. You use MCMC methods to compute \(Z(\infty)\). In particular, you write this as the following telescoping product \(Z(\infty) = Z(0) \cdot \prod_{i = 0}^{i = \ell – 1} \frac{Z(\beta_{i+1})}{Z(\beta_i)}\) where \(0 = \beta_1 < \beta_2 < \ldots < \beta_{\ell} = \infty\) is some increasing sequence of inverse temperatures with limited fluctuations in Gibbs distribution between two consecutive values and you use MCMC methods to estimate each of the \(\ell\) ratios in the above product. The main result of this paper presents a quantum algorithm that on input a Gibbs distribution generated by a Markov Chain with a large spectral gap performs sublinearly few steps (in size of the logarithm of the state space) of the quantum walk operator and returns a \(\pm \varepsilon Z(\infty)\) additive estimate to \(Z(\infty)\).

**A Near-Cubic Lower Bound for 3-Query Locally Decodable Codes from Semirandom CSP Refutation** (by Omar Alrabiah, Venkatesan Guruswami, Pravesh Kothari, and Peter Manohar)(ECCC) If you made it till here, it is time for a treat. Let us close (hopefully, I did not miss any papers this time!) with a breakthrough in Locally Decodable Codes. So, for 2-query LDCs, we know fairly tight bounds on the block length. For 3-query LDCs, on the other hand, we know a sub-exponential upper bound on the block length. However, the best-known lower bound on the block length was merely quadratic. The featured paper improves this to a cubic lower bound on the block length. The main tool used to achieve this is a surprising connection between the existence of locally decodable codes and the refutation of Boolean CSP instances with limited randomness. This looks like a fantastic read to close off this month’s report!

**Beating Greedy Matching in Sublinear Time**, by Soheil Behnezhad, Mohammad Roghani, Aviad Rubinstein, and Amin Saberi (arXiv). Designing sublinear-time algorithms to estimate the size of maximum matching in a graph is a well-studied problem. This paper gives the first \(\frac{1}{2} + \Omega(1)\) approximation algorithm that runs in time sublinear in the size of the input graph. Specifically, given a graph on \(n\) vertices and maximum degree \(\Delta\) in the adjacency list model, and a parameter \(\epsilon >0\), the algorithm runs in time \(\tilde{O}(n + \Delta^{1+\epsilon})\) and produces a \(\frac{1}{2} + f(\epsilon)\) approximation to the maximum matching for some function \(f\). It must be noted that a seminal work of Yoshida, Yamamoto and Ito (STOC, 2009) also gives a better than \(\frac{1}{2}\) approximation sublinear-time algorithm for the same problem. However, the result of Yoshida et al. requires assumptions on the maximum degree of the input graph. An additional point worth mentioning is that the authors do not believe that their techniques will yield an approximation guarantee better than \(0.51\), i.e., \(f(\epsilon) < 0.01\) for all \(\epsilon\).

**Sublinear-Time Clustering Oracle for Signed Graphs**, by Stefan Neumann and Pan Peng (arXiv). Consider a large *signed graph* on \(n\) vertices where vertices represent users of a social network and signed edges (+/-) denote the type of interactions (friendly or hostile) between users. Assume that the vertices of the social network can be partitioned into \(O(\log n)\) large clusters, where each cluster has a sparse cut with the rest of the graph. Further, each cluster is a minimal set (w.r.t. inclusion) that can be partitioned into roughly equal-sized opposing sub-communities, where a sub-community opposes another sub-community if most of the edges going across are negatively signed and most of the edges within the sub-communities are positively signed. This work provides a local oracle that, given probe access to a signed graph with such a hidden cluster structure, answers queries of the form “What cluster does vertex \(v\) belong to?” in time \(\tilde{O}(\sqrt{n} \cdot \text{poly}(1/\epsilon))\) per query. This result is a generalization of the same problem studied for unsigned graphs (Peng, 2020). The authors additionally show that their method works well in practice using both synthetic and real-world datasets. They also provide the first public real-world datasets of large signed graphs with a small number of large ground-truth communities having this property.

**Sublinear Algorithms for Hierarchical Clustering**, by Arpit Agarwal, Sanjeev Khanna, Huan Li, and Prathamesh Patil (arXiv). Consider a weighted graph \(G = (V,E,w)\), where the set \(V\) of vertices denotes datapoints and the weight \(w(e) > 0\) of edge \(e \in E\) denotes the similarity between the endpoints of \(e\). A hierarchical clustering of \(V\) is a tree \(T\) whose root is the set \(V\) and leaves are the singleton sets corresponding to individual vertices. An internal node of the tree corresponds to a cluster containing all the leaf vertices that are descendants of that node. A hierarchical clustering tree provides us with a scheme to cluster datapoints at multiple levels of granularity. The cost of a hierarchical clustering tree is \(\sum_{(u,v) \in E} |T_{u,v}| \cdot w(u,v)\), where \(T_{u,v}\) denotes the lowest common ancestor of the leaves \(u\) and \(v\). In this paper, the authors present sublinear algorithms for determining a hierarchical clustering tree with the minimum cost. In the query model with degree queries and neighbor queries to the graph, they give an algorithm that outputs an \(\tilde{O}(1)\)-approximate hierarchical clustering and makes \(\tilde{O}(n^{4-2\gamma})\) queries, when the number of edges \(m = \Theta(n^{\gamma})\) for \(1.5 \geq \gamma > 4/3\). When the input graph is sparse, i.e., \(\gamma \leq 4/3\), the algorithm makes \(\tilde{O}(\max\{n, m\})\) queries, and when the graph is dense, i.e., \(\gamma >1.5\), the algorithm makes \(\tilde{O}(n)\) queries. They complement their upper bounds with nearly tight lower bounds. In order to obtain their upper bounds, they design a sublinear-time algorithm for the problem of obtaining a *weak* cut sparsifier that approximates cuts sizes upto an additive term in addition to the usual multiplicative factor. They also design sublinear algorithms for hierarchical clustering in the MPC and streaming models of computation.

**Sharp Constants in Uniformity Testing via the Huber Statistic**, by Shivam Gupta and Eric Price (arXiv). This paper revisits the fundamental problem of uniformity testing — i.e., to decide whether an unknown distribution over \(n\) elements is uniform or \(\epsilon\)-far from uniform. This problem is known to be solvable optimally with probability at least \(1 – \delta\) using \(s = \Theta\left(\frac{\sqrt{n \log (1/\delta)}}{\epsilon^2} + \frac{\log (1/\delta)}{\epsilon^2}\right)\) independent samples from the unknown distribution. Multiple testers are known for the problem and they all compute a statistic of the form \(\sum_{i \in [n]} f(s_i)\), where \(s_i\) for \(i \in [n]\) and \(f\) is some function and make their decision based on whether or not the value of the statistic is above or below a threshold. For instance, the earliest known uniformity tester (Batu, Fortnow, Rubinfeld, Smith and White 2000; Goldreich and Ron 2011), also called the *collisions tester*, uses \(f(k) = \frac{k(k-1)}{2}\). The current paper proposes a new tester based on the Huber loss. For \(\beta > 0\), let \(h_\beta(x) := \min\{x^2, 2\beta x – \beta^2\}\). The statistic that the authors use in their test is defined by the function \(f(k) := k – s/n\), where \(s\) is the number of samples and \(n\) is the support size of the distribution. The authors show that their tester is better than all previously known testers as they achieve the best constants in the sample complexity.

Codes! Distributed computing! Probability distributions!

**Improved local testing for multiplicity codes**, by Dan Karliner and Amnon Ta-Shma (ECCC). Take the Reed–Muller code with parameters \(m, d\), whose codewords are the evaluation tables of all degree-\(m\) polynomials over \(\mathbb{F}^d\). RM codes are great, they are everywhere, and they are* locally testable*: one can test whether a given input \(x\) is a valid codeword (or far from every codeword) with only very few queries to \(x\). Now, take the *multiplicity code*: instead of just the evaluation table of the polynomial themselves, a codeword includes the evaluations of all its derivatives, up to order \(s\). These beasts generalize RM codes: are they *also* locally testable? Yes they are! And this work improves on our understanding of this aspect, by providing better bounds on the locality (how few queries are necessary to test), and simplifies the argument from previous work by Karliner, Salama, and Ta-Shma (2022).

**Overcoming Congestion in Distributed Coloring,** by Magnús M. Halldórsson, Alexandre Nolin, Tigran Tonoyan (arXiv). Two of the main distributed computing models, LOCAL and CONGEST, differ in how they model the bandwidth constraints. In the former, nodes can send messages of arbitrary size, and the limiting quantity is the number of rounds of communications; while in the latter, each node can only send a logarithmic number of bits at each round. This paper introduces a new technique that allows for communication-efficient distributed (coordinated) sampling, which as a direct applications enables porting several LOCAL algorithms to the CONGEST model at a small cost: for instance, \((\Delta+1)\)-List Coloring. This new technique also has applications beyond these distributed models, to graph property testing – in a slightly non-standard setting where we define farness from the property in a “local” sense (detect vertices or edges which contribute to many violations, i.e., are “locally far” from the property considered).

**Robust Testing in High-Dimensional Sparse Models,** by Anand Jerry George and Clément L. Canonne (arXiv). In the Gaussian mean testing problem, you are given samples from a high-dimensional Gaussian \(N(\mu, I_d)\), where \(\mu\) is either zero or has \(\ell_2\) norm greater than \(\varepsilon\), and you want to decide which of the two holds. This “mean testing” equivalent (due to, erm, “standard facts”) to testing in total variation distance, and captures the setting where one wantss to figure out whether an underlying signal \(\mu\), subject to white noise, is null or significant. Now, what if this \(\mu\) was promised to be \(s\)-sparse? Can we test more efficiently? But what if a small fraction of the samples were arbitrarily corrupted — how much harder does the testing task become? For some related tasks, it is known that being robust against adversarial corruptions makes testing as hard as learning… This paper addresses this “robust sparse mean testing” question, providing matching upper and lower bounds; as well as the related question of (robust, sparse) linear regression.

**Sequential algorithms for testing identity and closeness of distributions,** by Omar Fawzi, Nicolas Flammarion, Aurélien Garivier, and Aadil Oufkir (arXiv). Consider the two “usual suspects” of distribution testing, *identity* and *closeness* testing, where we must test if an unknown distribution is equal to some reference one or \(\varepsilon\)-far (in total variation distance) from it; or, the same thing, but with two unknown distributions (no reference one). These are, by now, quite well understood… but the algorithms for them take a worst-case number of samples, function of the distance parameter \(\varepsilon\). But if the two distributions are much further apart than \(\varepsilon\), fewer samples should be required! This is the focus of this paper, showing that with a sequential test one can achieve this type of guarantees: a number of samples which, in the “far” case, depends on the actual distance, not on its worst-case lower bound \(\varepsilon\). One could achieve this by combining known algorithms with a “doubling search;” however, this still would lose some constant factors in the sample complexity. The authors provide sequential tests which improve on this “doubling search technique” by constant factors, and back this up with empirical evaluations of their algorithms.

**Estimation of Entropy in Constant Space with Improved Sample Complexity,** by Maryam Aliakbarpour, Andrew McGregor, Jelani Nelson, and Erik Waingarten (arXiv). Suppose that, given samples from an unknown distribution \(p\) over \(n\) elements, your task is to estimate its (Shannon) entropy \(H(p)\) up to \(\pm\Delta\). You’re in luck! We know that \(\Theta(n/(\Delta\log n)+ (\log^2 n)/\Delta^2)\) samples are necessary and efficient. *But what if you had to do that under strict memory constraints? *Say, using only a *constant* number of words of memory? Previous work by Acharya, Bhadane, Indyk, and Sun (2019) shows that it is still possible, but the number of samples required shoots up, with their algorithm now requiring (up to polylog factors) \(n/\Delta^3\) samples. This works improves upon the dependence on \(\Delta\), providing a constant-memory algorithm with sample complexity \(O(n/\Delta^2 \cdot \log^4(1/\Delta))\); they further conjecture this to be optimal, up to the polylog factors.

As the overview below outlines (from the book’s website), the book covers a wide range of topics, and should give anyone interested a great overview of scope, techniques, and results in testing.

]]>This book introduces important results and techniques in property testing, where the goal is to design algorithms that decide whether their input satisfies a predetermined property in sublinear time, or even in constant time – that is, time is independent of the input size.

This book consists of three parts. The first part provides an introduction to the foundations of property testing. The second part studies the testing of specific properties on strings, graphs, functions, and constraint satisfaction problems. Vectors and matrices over real numbers are also covered. The third part is more advanced and explains general conditions, including full characterizations, under which properties are constant-query testable.

The first and second parts of the book are intended for first-year graduate students in computer science. They should also be accessible to undergraduate students with the adequate background. The third part can be used by researchers or ambitious graduate students who want to gain a deeper theoretical understanding of property testing.