# News for September 2021

A nice array of papers for this month, ranging from distribution testing, dense graph property testing, sublinear graph algorithms, and various mixtures of them. Let us now sample our spread! (Adding another semistreaming paper who achieved similar results to another posted paper. -Ed)

A Lower Bound on the Complexity of Testing Grained Distributions by Oded Goldreich and Dana Ron (ECCC). A discrete distribution is called $$m$$-grained if all probabilities are integer multiples of $$1/m$$. This paper studies the complexity of testing this property of distributions. For simplicity, consider the property of being $$n/2$$-grained, where the support size is $$n$$. The classic lower bound for testing uniformity shows that $$\Omega(\sqrt{n})$$ samples are required to distinguish the uniform distribution from a distribution uniform on $$n/2$$ elements. Thus, we get a lower bound of $$\Omega(\sqrt{n})$$ for testing $$n/2$$-grainedness (if I am permitted to use that word). This paper proves a lower bound of $$\Omega(n^c)$$, for all constant $$c < 1$$. It is conjectured that the lower bound is actually $$\Omega(n/\log n)$$, which would match the upper bound (for any label-invariant property).

Testing Distributions of Huge Objects by Oded Goldreich and Dana Ron (ECCC). This paper introduced a new model that marries distribution testing with property testing on strings. The “object” of interest is a distribution $$\mathcal{D}$$ over strings of length $$n$$. We wish to test if $$\mathcal{D}$$ possesses some property. The tester can get a random string $$x$$ from the distribution, and can query any desired index of $$x$$. The distance between distributions is defined using the earthmover distance (where we use the Hamming distance between strings). This model is called the DoHO (Distributions of Huge Objects) model. There are many questions posed and connections drawn to classical property testing and distribution testing. What I find interesting is a compelling application: the distribution $$\mathcal{D}$$ may represent noisy or perturbed versions of a single object. The DoHO model gives a natural generalization of standard property testing to noisy objects. This paper considers problems such as testing if $$\mathcal{D}$$ is: a random perturbation of a string, or a random cyclic shift, or a random isomorphism of a graph.

Sublinear Time and Space Algorithms for Correlation Clustering via Sparse-Dense Decompositions by Sepehr Assadi and Chen Wang (arXiv). Correlation clustering is a classic problem where edges in a graph are labeled ‘+’ or ‘-‘, denoting whether these edges should be uncut or cut. The aim is to cluster the graph minimizing the total “disagreements” (cut ‘+’ edges or uncut ‘-‘ edges). This paper gives an $$O(1)$$-approximation algorithm that runs in $$O(n\log^2n)$$ time; this is the first sublinear time approximation algorithm for this problem. Correlation clustering has seen results for the property testing/sublinear algorithms community, first by Bonchi, Garcia Soriano, and Kutzkov. But previous results were essentially on the dense graph model, giving $$O(\epsilon n^2)$$ error assuming adjacency matrix input. This paper considers access to the adjacency list of ‘+’ edges. Interestingly (from a technical standpoint), the key tool is a new sparse-dense decomposition. Such decompositions emerged from the seminal work of Assadi-Chen-Khanna for sublinear $$(\Delta+1)$$-colorings, and it is great to see applications beyond coloring.

Sublinear-Time Computation in the Presence of Online Erasures by Iden Kalemaj, Sofya Raskhodnikova, and Nithin Varma (arXiv). Can property testing be done when portions of the input are hidden? This question was first raised by Dixit-Raskhodnikova-Thakurta-Varma, who gave a model of erasure-resilient testing. There is an adversary who hides (erases) part of the input function; queries to those parts just yield a dummy symbol. This paper defines an online version of this model. There is an erasure parameter $$t$$. On each query by the property tester, the adversary can erase $$t$$ values of the function. Consider the property of classic linearity of functions $$f:\{0,1\}^d \rightarrow \{0,1\}$$. The BLR tester queries triples of pairs $$(x,y, x \oplus y)$$. Observe how this tester is easily defeated by our adversary, by erasing the value $$f(x\oplus y)$$. One of the main results of this paper is a $$O(1/\varepsilon)$$-query tester for linearity, that works for any constant erasure parameter $$t$$. Note that this matches the bound for the standard setting. There are a number of results for other classic properties, such as monotonicity (sortedness) and Lipschitz.

Sublinear Time Eigenvalue Approximation via Random Sampling by Rajarshi Bhattacharjee, Cameron Musco, and Archan Ray (arXiv). Consider the problem of estimating all the eigenvalues of a real, symmetric $$n \times n$$ matrix $$M$$ with bounded entries, in sublinear time. The main result shows that the eigenvalues of a uniform random $$O(\epsilon^{-4}\log n)$$ principal submatrix can be used to approximate all eigenvalues of $$M$$ up to additive error $$\epsilon n$$. One can think of this as a sort of concentration inequality for eigenvalues. This result follows (and builds upon) work of Bakshi-Chepurko-Jayaram on property testing semidefiniteness. The key idea is that eigenvectors corresponding to large eigenvalues have small infinity norm: intuitively, since all entries in $$M$$ are bounded, such an eigenvector must have its mass spread out among many coordinates. Hence, we can get information about it by randomly sampling a few coordinates. The paper also shows that approach of taking principal submatrices requires taking $$\Omega(\epsilon^{-2})$$ columns/rows.

Deterministic Graph Coloring in the Streaming Model by Sepehr Assadi, Andrew Chen, and Glenn Sun (arXiv). This is technically not a sublinear algorithms paper (well ok, streaming is sublinear, but we tend not to cover the streaming literature. Maybe we should? – Ed.) But, I think that the connections are of interest to our community of readers. The main tool of sublinear $$(\Delta+1)$$-coloring algorithm of Assadi-Chen-Khanna is the palette sparsification lemma ($$\Delta$$ is the maximum degree). This lemma shows that vertices can randomly shorten their ”palette” of colors, after which all colorings from these palettes lead to very few monochromatic edges. This is an immensely powerful tool, since one can get immediately sublinear complexity algorithms in many models: adjacency list, streaming, distributed. Is the randomness necessary? Note that these algorithms run in $$\Omega(n)$$ time/space, so it is conceivable that deterministic sublinear algorithms exists. This paper shows that randomization is necessary in the semi-streaming model (space $$O(n poly(\log n))$$). Indeed, there exist no deterministic semi-streaming algorithms that can achieve even $$\exp(\Delta^{o(1)})$$ colorings.

Adversarially Robust Coloring for Graph Stream by Amit Chakrabarti, Prantar Ghosh, and Manuel Stoeckl (arXiv). This paper studies the same problem as the above, but presents the results in a different way. In a randomized algorithm, we normally think of an adversary that fixes a (hard) input, and the algorithm then makes its random decisions. An adaptive adversary is one that changes the input (stream) based on the decisions of an algorithm. In this definition, a robust algorithm is one that can give correct answers, even for adversarially generated output. A deterministic algorithm is automatically robust. This paper show that there do not exist semi-streaming algorithms that can achieve $$(\Delta+1)$$-colorings. The quantitative lower bound is weaker ($$\Omega(\Delta^2)$$ colors), but it is against a stronger adversary.

# News for June 2021

A quieter month after last month’s bonanza. One (applied!) paper on distribution testing, a paper on tolerant distribution testing, and a compendium of open problems. (Ed: alas, I missed the paper on tolerant distribution testing, authored by one of our editors. Sorry, Clément!)

Learning-based Support Estimation in Sublinear Time by Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, and Tal Wagner (arXiv). A classic problem in distribution testing is that of estimating the support size $$n$$ of an unknown distribution $$\mathcal{D}$$. (Assume that all elements in the support have probability at least $$1/n$$.) A fundamental result of Valiant-Valiant (2011) proves that the sample complexity of this problem is $$\Theta(n/\log n)$$. A line of work has emerged in trying to reduce this complexity, with additional sources of information. Canonne-Rubinfeld (2014) showed that, if one can query the exact probabilities of elements, then the complexity can be made independent of $$n$$. This paper studies a robust version of this assumption: suppose, we can get constant factor approximations to the probabilities. Then, the main result is that we can get a query complexity of $$n^{1-1/\log(\varepsilon^{-1})} \ll n/\log n$$ (where the constant $$\varepsilon$$ denotes the additive approximation to the support size). This paper also does empirical experiments to show that the new algorithm is indeed better in practice. Moreover, it shows that existing methods degraded rapidly with poorer probability estimates, while the new algorithm maintains its accuracy even with such estimates.

The Price of Tolerance in Distribution Testing by Clément L. Canonne, Ayush Jain, Gautam Kamath, and Jerry Li (arXiv). While we have seen many results in distribution testing, the subject of tolerance is one that hasn’t received as much attention. Consider the problem of testing if unknown distribution $$\mathcal{p}$$ (over domain $$[n]$$) is the same as known distribution $$\mathcal{q}$$. We wish to distinguish $$\varepsilon_1$$-close from $$\varepsilon_2$$-far, under total variation distance. When $$\varepsilon_1$$ is zero, this is the standard property testing setting, and classic results yield $$\Theta(\sqrt{n})$$ sample complexity. If $$\varepsilon_1 = \varepsilon_2/2$$, then we are looking for a constant factor approximation to the distance. And the complexity is $$\Theta(n/\log n)$$. Surprisingly, nothing was known in better. Until this paper, that is. The main result gives a complete characterization of sample complexity (up to log factors), for all values of $$\varepsilon_1, \varepsilon_2$$. Remarkably, the sample complexity has an additive term $$(n/\log n) \cdot (\varepsilon_1/\varepsilon^2_2)$$. Thus, when $$\varepsilon_1 > \sqrt{\varepsilon_2}$$, the sample complexity is $$\Theta(n/\log n)$$. When $$\varepsilon_1$$ is smaller, the main result gives a smooth dependence on the sample complexity. One the main challenges is that existing results use two very different techniques for the property testing vs constant-factor approximation regimes. The former uses simpler $$\ell_2$$-statistics (e.g. collision counting), while the latter is based on polynomial approximations (estimating moments). The upper bound in this paper shows that simpler statistics based on just the first two moments suffice to getting results for all regimes of $$\varepsilon_1, \varepsilon_2$$.

Open Problems in Property Testing of Graphs by Oded Goldreich (ECCC). As the title clearly states, this is a survey covering a number of open problems in graph property testing. The broad division is based on the query model: dense graphs, bounded degree graphs, and general graphs. A reader will see statements of various classic open problems, such as the complexity of testing triangle freeness for dense graphs and characterizing properties that can be tested in $$poly(\varepsilon^{-1})$$ queries. Arguably, there are more open problems (and fewer results) for testing in bounded degree graphs, where we lack broad characterizations of testable properties. An important, though less famous (?), open problem is that of the complexity of testing isomorphism. It would appear that the setting of general graphs, where we know the least, may be the next frontier for graph property testing. A problem that really caught my eye: can we transform testers that work for bounded degree graphs into those that work for bounded arboricity graphs? The latter is a generalization of bounded degree that has appeared in a number of recent results on sublinear graph algorithms.

# News for March 2021

A somewhat quieter month by recent standards. Three Two papers: graph property testing and quantum distribution testing. (Ed: The distribution testing paper was a revision of a paper we already covered in Sept 2020.)

Robust Self-Ordering versus Local Self-Ordering by Oded Goldreich (ECCC). In Nov 2020, we covered a paper that uses a tool called self-ordered graphs, that transferred bit string function lower bounds to graph property testing. Consider a labeled graph. A graph is self-ordered if its automorphism group only contains the identity element (it has no non-trivial isomorphisms). A graph is robustly self-ordered, if every permutation of the vertices leads to a (labeled) graph that is sufficiently “far” according to edit distance. Given a self-ordered graph $$G$$, a local self-ordering procedure is the following. Given access to a copy $$G’$$ of $$G$$ and a vertex $$v \in V(G’)$$, this procedure determines the (unique) vertex in $$V(G)$$ that corresponds to $$v$$ with sublinear queries to $$G$$. In other words, it can locally “label” the graph. Intuitively, one would think that more robustly self-ordered graphs will be easier to locally label. This paper studies the relation between robust and local self-ordering. Curiously, this paper refutes the above intuition for bounded-degree graphs, and (weakly) confirms it for dense graphs. Roughly speaking, there are bounded degree graphs that are highly robustly self-ordered, for which any local self-ordering procedure requires $$\omega(\sqrt{n})$$ queries. Moreover, there are bounded degree graphs with $$O(\log n)$$-query local self-ordering procedures, yet are not robustly self-ordered even for weak parameters. For dense graphs, the existence of fast non-adaptive local self-ordering procedures implies robust self-ordering.

Testing identity of collections of quantum states: sample complexity analysis by Marco Fanizza, Raffaele Salvia, and Vittorio Giovannetti (arXiv). This paper takes identity testing to the quantum setting. One should think of a $$d$$-dimensional quantum state as a $$d \times d$$ density matrix (with some special properties). To learn the state entirely up to error $$\varepsilon$$ would require $$O(\varepsilon^{-2} d^2)$$ samples/measurements. A recent result of Badescu-O’Donnell-Wright proves that identity testing to a known state can be done significantly faster using $$O(\varepsilon^{-2} d)$$ measurements. This paper takes this result a step further by consider a set of $$N$$ quantum states. A “sample” is like a classical sample, where one gets a sample from a distribution of quantum states. The YES (“uniform”) case is when all the states are identical. The NO (“far from uniform”) case is when they are “far” from being the same state. This paper proves that $$O(\varepsilon^{-2}\sqrt{N}d)$$ samples suffices for distinguishing these cases.

# News for December 2020

Happy 2021! We now cover the last papers of 2020: a (smaller) spread with graph property testing, distribution property testing, and circuit complexity.

Almost Optimal Bounds for Sublinear-Time Sampling of k-Cliques: Sampling Cliques is Harder Than Counting by Talya Eden, Dana Ron, and Will Rosenbaum (arXiv). In many settings, sampling and counting are tightly coupled; the classic example being permanent estimation and sampling random bipartite matchings.This paper investigates the problem of sampling uniform random cliques, and unearths an interesting separation of approximately uniform sampling from approximate counting. Let’s consider the fundamental problem of edge and triangle counting, in graphs of constant arboricity. This class, which contains all minor-closed families, has deep connections to clique counting; clique counting can be done in linear time for bounded arboricity graphs.) The input connected graph $$G$$ has $$n$$ vertices, $$m$$ edges, and $$t$$ triangles. We assume the standard query model, with uniform random vertices, neighbor, and degree queries. Our aim is to get $$(1+\varepsilon)$$-approximations for $$m$$ and $$t$$ (in general, any $$k$$-clique count). It is known from previous work that edge estimation can be done in $$O(1)$$ time and triangle estimation can be done in $$O(n/t)$$ time (ignoring log and $$\varepsilon$$ dependencies). Moreover, these bounds is optimal. Can one get similar bounds for approximately sampling a uniform random edge/triangle? Previous work of Eden-Rosenbaum achieve such a bound for sampling edges. Surprisingly, this paper shows that triangle sampling is fundamentally harder, and requires $$\Omega(n^{3/4}/\sqrt{t})$$ queries. The main result gives a precise and optimal bound for uniform sampling $$k$$-cliques in graphs of arboricity $$\alpha$$.

Testing and reconstruction via decision trees by Guy Blanc, Jane Lange, and Li-Yang Tan (arXiv). A property tester is an (sublinear) algorithm that distinguishes an object with a property from those that are far. Reconstruction is a much stronger sublinear algorithm, which actually provides query access to the closest object. Suppose we have query access to an $$n$$-variate Boolean function $$f$$, and the property $$\mathcal{P}$$ of interest is size-$$s$$ decision trees. A reconstructor would give query access to a function $$g \in \mathcal{P}$$, such that $$dist(f,g) = O(OPT_f)$$, where $$OPT_f$$ is the distance of $$f$$ to $$\mathcal{P}$$. Observe that a reconstructor automatically gives a distance estimator (just estimate $$dist(f,g)$$ by random sampling). One of the main results of this paper is a (weaker, bicriteria) reconstructor, where $$g$$ is guaranteed to have decision-tree complexity $$s^{O(\log^2s)}$$. Each query to $$g$$ is computed in $$poly(\log s, 1/\varepsilon)\cdot n$$ time. This reconstructor leads to a number of testing results, and reconstructors for other properties like low Fourier degree. Another interesting result proven: if there exists an efficient reconstructor that gets $$g$$ of size $$s$$, then there exists an efficient proper learner for decision trees (a big open problem).

Testing Product Distributions: A Closer Look by Arnab Bhattacharyya, Sutanu Gayen, Saravanan Kandasamy, and N. V. Vinodchandran (arXiv). This paper is on distribution testing, where all distributions involved are promised to be product distributions. Let $$P, Q$$ be distributions over $$\Sigma^n$$, where $$\Sigma$$ is some (small) finite set. The problems studied are identity testing ($$Q$$ is known/fixed, and one gets samples from unknown $$P$$) and closeness testing (both unknown). It is known from previous work that these problems can basically be solved with $$O(n|\Sigma|\varepsilon^{-2})$$ samples. Note that the domain size is exponential in $$n$$; thus, these bounds give an exponential improvement over identity/closeness testing in the vanilla (non-product) setting. Tolerant testing is of special significance here. Because the domain is so large, it makes sense to allow for some distance even in the “yes” case. The main result of this paper is to completely map out the complexities for identity/closeness testing where we wish to distinguish $$d_1(P,Q) < \varepsilon_1$$ from $$d_2(P,Q) > \varepsilon_2$$, where $$d_1, d_2$$ are potentially different distance measures. For example, the paper considers total variation (TV), KL, Hellinger, and $$\chi^2$$ distances. An important improvement achieved is for identity testing where $$d_1$$ is Hellinger and $$d_2$$ is TV distance: for certain values of $$\varepsilon_1, \varepsilon_2$$, one can get a tester of optimal sample complexity $$O(\sqrt{n|\Sigma|})$$. Previous bounds only gave a $$O(n|\Sigma|)$$ sample complexity. There are a number of new lower bounds for different combinations of $$d_1, d_2$$ distance pairs.

# News for September 2020

Apologies dear readers for the late posting. The beginning of the school year is always frenzied, and the pandemic has only added to that frenzy. We have an exciting September, with four papers on graph property testing, one two papers on distribution testing, and one paper that connects both topics.

(Ed: we normally scan through ECCC and arXiv, but are happy to post about papers that appear elsewhere. Thanks to the reader who pointed out a relevant COLT 2020 paper.)

Estimation of Graph Isomorphism Distance in the Query World by Sourav Chakraborty, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen (ECCC). Graph isomorphism is about as fundamental as it gets, and this papers studies approximating the graph isomorphism distance for dense graphs. There is a known graph $$G_k$$ (with $$n$$ vertices). The algorithm is given query access to an input graph $$G_u$$ and needs to approximate the number of edge inserts/deletes required to make the graphs isomorphic. This is the tolerant testing version; the property testing version is known to be doable in $$\widetilde{O}(\sqrt{n})$$ queries (Matsliah-Fischer). The main insight of this paper is to relate the tolerant testing complexity to a distribution testing problem. Consider distributions over the $$\{0,1\}^n$$ defined by multisets of $$n$$ hypercube points. Our aim is to estimate the earthmover distance between a known distribution and an unknown distribution. Interestingly, the query model is different: one can sample the underlying multisets without replacement. It turns out that the optimal complexity of this problem is (upto polylog factors) is the same as the optimal complexity of tolerant testing of graph isomorphism. A direct corollary is that the isomorphism distance can be approximated upto additive $$\epsilon n^2$$ using $$\widetilde{O}(n)$$ samples. This equivalence also gives an alternate proof for lower bounds for property testing graph isomorphism.

Robustly Self-Ordered Graphs: Constructions and Applications to Property Testing by Oded Goldreich and Avi Wigderson (ECCC). Let’s start from the application. The aim is to prove the following property testing lower bounds for the bounded-degree graph setting: an exponential separation between tolerant and vanilla testing, and finding an efficiently decidable property (in polynomial time) that cannot be property tested in sublinear time. For binary strings, results of this form are known. Can these be “ported” to the bounded-degree graph world? Can we construct graphs such that adjacency queries reduce to bit queries in strings? Naturally, one can simply represent the adjacency list as a string and treat graph queries as bit queries. But the problem is that of isomorphisms: different bit strings could represent the same graph and therefore, the different bit strings must have the same status with respect to the underlying property. The key insight in this paper is to introduce robustly self-ordered graphs, as a tool to port bit string property testing lower bounds to bounded-degree graphs. Such graphs essentially have a unique (identity) automorphism, even after a few edge insert/deletes. The actual definition is more technical, but that is the essence. The main result is an explicit construction of such graphs, from which the lower bound can be ported directly through a convenient lemma.

Modifying a Graph’s Degree Sequence and the Testablity of Degree Sequence Properties by Lior Gishboliner (arXiv). A sequence of numbers $$D = (d_1, d_2, \ldots, d_n)$$ is graphic if there exists an undirected graph on $$n$$ vertices whose degrees are precisely the numbers of the sequence. Graphical sequences have been characterized by classic results of Erdös-Gállai and Havel-Hakimi. This paper first proves the following theorem. Suppose a graphic sequence $$D’$$ has $$l_1$$-distance at most $$\delta$$ from the degree sequence $$D$$ of a graph $$G$$. Then, there exists a graph $$G’$$ with degree sequence $$D’$$ such that the (dense graph) distance between $$G$$ and $$G’$$ is $$O(\sqrt{\delta})$$. This theorem is used to prove an interesting property testing result. Let $$\mathcal{D}$$ be a subset of graphic sequences that are closed under permutation. Let $$\mathcal{G}$$ be the set of graphs that have a degree sequence in $$\mathcal{D}$$. Then $$\mathcal{G}$$ can be tested in $$poly(1/\epsilon)$$ queries.

Sampling an Edge Uniformly in Sublinear Time by Jakub Têtek (arXiv). In the general model for sublinear algorithms on graphs, an important choice is whether one allows uniform random edge queries. A natural question is whether such queries can simulated efficiently, using only random vertex, degree, and neighbor queries. This problem appears somewhat implicitly in previous sublinear subgraph counting algorithms, and Eden-Ron-Rosenbaum study it explicitly. They prove that one can sample from an $$\epsilon$$-approximate uniform distribution (over edges) using $$O(n/\sqrt{\epsilon m})$$ samples. The problem of sampling from exactly the uniform distribution is left open. Until this paper. The main result shows that by modifying the Eden-Ron-Rosenbaum algorithm parameters, one can generate edge samples from an $$\epsilon$$-approximate uniform distribution using $$O((n/\sqrt{m})\log \epsilon^{-1})$$ samples. The exact uniform distribution is achieved by setting $$\epsilon = 1/n$$, to get a sample complexity of $$O((n\log n)/\sqrt{m})$$.

Faster Property Testers in a Variation of the Bounded Degree Model by Isolde Adler and Polly Fahey (arXiv). The setting of bounded-degree graph property testing naturally extends to bounded-degree relational databases, which can be thought of as “directed” hypergraphs. This is an interesting new direction of research, that combines property testing with database theory (see Adler-Harwath and Chen-Yoshida). One of the main contributions of this work is to consider another notion of distance: edge and vertex inserts/deletes. This is a natural extension, and we can now compare distances between graphs/databases with different numbers of vertices. The main result is that, under this notion of distance, a large class of properties can be tested in constant running time on databases with bounded degree and treewidth. Specifically, any property expressible in Counting Monadic Second-Order Logic (CMSO) can be tested in constant time. Previous results by Alder-Harwath showed that such properties can be tested (under the standard distance notion) in constant queries, but polylogarithmic time.

Optimal Testing of Discrete Distributions with High Probability by Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, John Peebles, and Eric Price (arXiv, ECCC). The focus of this paper is distribution testing in the “high probability” regime, where we wish the error of the tester to be $$< \delta$$. Typically, most results just get an error of at most $$1/3$$, from which standard probabilistic boosting would tack on an extra $$O(\log 1/\delta)$$ factor. In standard TCS settings, one doesn’t focus on optimizing this dependence, but in statistics, there is significant focus on the optimal sample complexity. And indeed, for practical applications, it is crucial to have sharp bounds on the right number of samples required for hypothesis testing. The paper also argues that getting the optimal sample complexity requires new algorithms, even for uniformity testing. There are optimal results given for closeness and independence testing. The optimal sample complexity only pays a multiplicative factor of $$\log^{1/3} (1/\delta)$$ or $$\log^{1/2}(1/\delta)$$ over the optimal bound for constant error (with other additive terms depending on $$\log(1/\delta)$$).

Bessel Smoothing and Multi-Distribution Property Estimation by Yi Hao and Ping Li (COLT 2020). Let us consider some standard (tolerant) distribution testing questions, phrases as approximation algorithms. Given sample access to two distributions $$p$$ and $$q$$ over $$[n]$$, we may wish to estimate the $$l_1$$-distance, $$l_2$$-distance, relative entropy, etc. between these distributions. One can phrases this problem abstractly as estimating $$\sum_{i \in [n]} f(p_i, q_i)$$, where $$f$$ is some explicit function. This papers shows that for any 1-Lipschitz function $$f$$ that satisfies some “regularity” property, the sum $$\sum_{i \in [n]} f(p_i, q_i)$$ can be $$\epsilon$$-approximated with $$O(\epsilon^{-3}n/\sqrt{\log n})$$ samples (apologies to the authors to replacing their $$k$$ with the more familiar $$n$$ for our readers). Thus, we can get sublinear sampling complexity for a very general class of estimation problems. Moreover, this was actually the simplest setting consider in the paper. One can deal with such functions of $$d$$ distributions, not just two distributions. One of the corollaries of the theorems is a sublinear tolerant tester for the property of being a mixture of distributions.

# Policy on reporting papers

While we at PTReview always look through the posted papers, we do not check for correctness. We make a serious attempt to make sure the paper is reasonable. In a few instances, we have decided not to post a (topically relevant) paper, because it looks absolutely wrong. Our position is: the benefit of doubt goes to the author, and a borderline paper should be posted. We are only curating relevant tech reports, not passing judgment on results.

In some borderline cases, readers familiar with the subject complained to us that the paper should be not be considered a scientific contribution (because of, say, unspecified algorithms, blatantly incorrect or unverifiable central claims). These are cases where we were also unsure of the paper. We have usually removed/not posted such papers.

If the paper author(s) feels that his/her paper should nonetheless be posted, then they should email us at little.oh.of.n@gmail.com. As long as the paper is not complete nonsense and appears to cite relevant history, we will defer to the authors’ wishes.

# News for June 2020

Sublinear algorithms in times of social distancing…always something exciting. This month we have a slew of results on sublinear algorithms for classic graph problems.

(Ed: We have removed a previously posted paper due to correctness concerns raised by our readers. Please look at the post on our paper policy.)

Palette Sparsification Beyond (∆ + 1) Vertex Coloring by Noga Alon and Sepehr Assadi (arXiv). A basic fact from graph theory is that any graph has a $$(\Delta+1)$$-coloring, where $$\Delta$$ is the maximum degree. Followers of property testing are likely familiar with a fantastic result of Assadi-Chen-Khanna (ACK) on sublinear algorithms, that gives a sublinear algorithm for $$(\Delta+1)$$-coloring. (The running time is $$\widetilde{O}(n^{3/2})$$, where $$n$$ is the number of vertices.) The key tool is a palette sparsification theorem: suppose each vertex is given a “palette” of $$(\Delta+1)$$ colors. Each vertex randomly sparsifies its palette by sampling $$O(\log n)$$ colors, and is constrained to only use these colors. Remarkably, whp the graph can still be properly colored. This tool is at the heart of sublinear time/space algorithms for coloring. This paper gives numerous extensions to this theorem, where one can tradeoff a larger initially palette for a smaller final sample. Another extension is for triangle-free graphs, where the initial palette is of size $$O(\Delta/\ln \Delta)$$ and the sample is of size $$O(\Delta^\gamma + \sqrt{\ln n})$$ (for parameter $$\gamma < 1$$. This leads to an $$O(n^{3/2 + \gamma})$$ time algorithm for $$O(\Delta/\ln \Delta)$$ coloring of triangle-free graphs.

When Algorithms for Maximal Independent Set and Maximal Matching Run in Sublinear-Time by Sepehr Assadi and Shay Solomon (arXiv). Taking off from sublinear coloring algorithms, one can ask if there are sublinear time algorithms for Maximal Independent Set (MIS) and Maximal Matching (MM). Alas, ACK prove that this is impossible. This paper investigates when one can get a sublinear time algorithm for these problems. For graph $$G$$, let $$\beta(G)$$ be the “neighborhood independence number”, the size of the largest independent set contained in a vertex neighborhood. This papers shows that both problems can be solved in $$\widetilde{O}(n \beta(G))$$ time. Examples of natural classes of graphs where $$\beta(G)$$ is constant: line graphs and unit-disk graphs. An interesting aspect is that MIS algorithm is actually deterministic! It’s the simple marking algorithm that rules out neighborhoods of chosen vertices; the analysis shows that not much time is wasted in remarking the same vertex.

Sublinear Algorithms and Lower Bounds for Metric TSP Cost Estimation by Yu Chen, Sampath Kannan, and Sanjeev Khanna (arXiv). This paper studies sublinear algorithms for the metric TSP problem. The input is an $$n \times n$$ distance matrix. One can 2-approximate the TSP by computing the MST, and a result of Czumaj-Sohler gives a $$(1+\varepsilon)$$-approximation algorithm for the latter, running in $$O(n\varepsilon^{-O(1)})$$ time. The main question is: can one beat the 2-factor approximation in sublinear time? This paper considers the graphic TSP setting, where the distance matrix corresponds to the shortest path metric of an unweighted graph. One result is a $$(2-\varepsilon_0)$$-approximation algorithm (for an explicit constant $$\varepsilon_0$$) that runs in $$\widetilde{O}(n)$$ time. For the important $$(1,2)$$ TSP setting (all distances are either 1 or 2), the paper gives a $$O(n^{1.5})$$ time 1.63-approximation algorithm. Interestingly, there is a lower bound showing that $$(1+\varepsilon)$$-approximations, for arbitrarily small $$\varepsilon$$, cannot be achieved in $$o(n^2)$$ time. One of the key tools is sublinear algorithms for estimating the maximum matching size, itself a well-studied problem in the community.

# Welcome Akash Kumar!

Let’s welcome our latest editor, Akash Kumar. Akash will be taking the place of Gautam Kamath, who has decided to pass the torch on. Let’s also thank Gautam for all the help with PTReview.

# News for March 2020

I hope all of you are keeping safe and healthy in these difficult times. Thank heavens we have our research and our math to keep our sanity…

This month has seen two papers, one on testing variable partitions and one on distributed isomorphism testing.

Learning and Testing Variable Partitions by Andrej Bogdanov and Baoxiang Wang (arXiv). Consider a function $$f:\Sigma^n \to G$$, where $$G$$ is Abelian group. Let $$V$$ denote the set of variables. The function $$f$$ is $$k$$-separable if there is a partition $$V_1, V_2, \ldots, V_k$$ of $$V$$ such that $$f(V)$$ can be expressed as the sum $$f_1(V_1) + f_2(V_2) + \ldots + f_k(V_k)$$. This is an obviously natural property to study, though the specific application mentioned in the paper is high-dimensional reinforcement learning control. There are a number of learning results, but we’ll focus on the main testing result. The property of $$k$$-separability can be tested with $$O(kn^3/\varepsilon)$$ queries, for $$\Sigma = \mathbb{Z}_q$$ (and distance between functions is the usual Hamming distance). There is an analogous result (with different query complexity) for $$\Sigma = \mathbb{R}$$. It is also shown that testing 2-separability requires $$\Omega(n)$$ queries, even with 2-sided error. The paper, to its credit, also has empirical studies of the learning algorithm with applications to reinforcement learning.

Distributed Testing of Graph Isomorphism in the CONGEST model by Reut Levi and Moti Medina (arXiv). This result follows a recent line of research in distributed property testing algorithms. The main aim is to minimize the number of rounds of (synchronous) communication for a property testing problem. Let $$G_U$$ denote the graph representing the distributive network. The aim is to test whether an input graph $$G_K$$ is isomorphic to $$G_U$$. The main property testing results are as follows. For the dense graph case, isomorphism can be property tested (with two-sided error) in $$O(D + (\varepsilon^{-1}\log n)^2)$$ rounds, where $$D$$ is the diameter of the graph and $$n$$ is the number of nodes. (And, as a reader of this blog, you probably know what $$\varepsilon$$ is already…). There is a standard $$\Omega(D)$$ lower bound for distributed testing problems. For various classes of sparse graphs (like bounded-degree minor-free classes), constant time isomorphism (standard) property testers are known. This paper provides a simulation argument showing that standard/centralized $$q$$-query property testers can be implemented in the distributed model, in $$O(Dq)$$ rounds (this holds for any property, not just isomorphism). Thus, these simulations imply $$O(D)$$-round property testers for isomorphism for bounded-degree minor-free classes.

# News for December 2019

Happy new year! And now for the last post of 2019 papers. We have found a diverse collection of six papers, ranging from classic property testing topics to new perspectives on sublinear computation.

Sublinear Optimal Policy Value Estimation in Contextual Bandits by Weihao Kong, Gregory Valiant, Emma Brunskill (arXiv). This isn’t our usual sublinear paper, but it is definitely of interest to us sublinear folk. Let’s start with a stripped down definition of the problem (or rather, game). There are $$K$$ “arms”, where the $$i$$th arm is represented by an unknown vector in $$\beta_i \in \mathbb{R}^d$$. We are presented with a “context”, which is a vector $$x \in \mathbb{R}^d$$. Our job is to choose an arm $$i \in [K]$$. We get the reward $$x \cdot \beta_i$$ (with some noise added). The contexts appears from a known distribution. To aid us, we observe the rewards of $$N$$ iid contexts, so we observe a total of $$T = KN$$ rewards. There has been much work on figuring out the minimum value of $$T$$ required to learn the optimal policy. One requires at least $$d$$ (the dimension) samples to estimate any of the arm vectors. This papers shows that one can actually estimate the expected reward of the optimal policy, without being able to describe it, with sublinear in $$d$$ (technically, $$\widetilde{O}(\sqrt{d})$$) samples. We see this a lot in property testing, where producing the “optimal” solution for a problem requires linear-in-dimension samples, but estimating the optimal value is much cheaper (consider, for example, the situation of linearity testing, where we wish to find the closest linear function).

Sublinear Time Numerical Linear Algebra for Structured Matrices by Xiaofei Shi and David P. Woodruff (arXiv). This follows the recent linear of advances in sublinear time linear algebra. Given a matrix $$A \in \mathbb{R}^{n \times d}$$, the aim is to get algorithms that only look at $$o(nnz(A))$$ entries (where $$nnz(A)$$ is the number of non-zeroes, or the support). Consider the classic talk of low rank approximation. Unfortunately, suppose one entry is extremely large, and the others are extremely small. One has to find this large entry for any reasonable approximation, which (in the worst-case) requires $$nnz(A)$$ queries into $$A$$. Thus, previous papers make structural assumption (such as, $$A$$ being a Vandermonde matrix) to get sublinear bounds. This paper gives a clean black box method to get a variety of such results. Basically, one can replace the usual $$nnz(A)$$ term in many algorithms, by $$T(A)$$, which is the time to compute the matrix-vector product $$Ay$$, for $$y \in \mathbb{R}^d$$. In many cases $$T(A) = \widetilde{O}(n)$$, which can be significantly smaller than $$nnz(A)$$. This paper gives such results for low-rank approximations and many regression problems.

Robust and Sample Optimal Algorithms for PSD Low-Rank Approximation by Ainesh Bakshi, Nadiia Chepurko, and David P. Woodruff. Consider the low rank problem discussed above. As mentioned in the previous paragraph, we need structural assumptions on $$A$$. Previous results gave sublinear time low-rank approximations assuming that $$A$$ is positive semidefinite (PSD). The aim is to get a rank $$k$$ matrix $$B$$ such that $$\|A-B\|^2_2$$ is at most $$(1+\epsilon)$$-times the optimal such approximation. The previous algorithm of Musco-Woodruff makes $$\widetilde{O}(nk/\epsilon^{2.5})$$ queries in to $$A$$, while there is a lower bound of $$\Omega(nk/\epsilon)$$. This gap between the complexities is resolved in this paper with an upper bound of $$\widetilde{O}(nk/\epsilon)$$ queries.

Constructive derandomization of query algorithms by Guy Blanc, Jane Lange, and Li-Yang Tan (arXiv). This paper discusses an intriguing angle to sublinear question: when can they be derandomized? Abstractly, consider a randomized algorithm $$R$$ that makes $$q$$ queries. Think of $$R$$ as a function $$R(x,r)$$, where $$x$$ is the input, and $$r$$ is the randomness. We would like to design a deterministic algorithm $$D$$ making, ideally, close to $$q$$ queries and approximates $$\mathop{E}_r[R(x,r)]$$. For starters, consider some distribution over $$x$$, and suppose we want $$\mathbb{E}_x[D(x) – \mathbb{E}_r[R(x,r)]] < \epsilon$$. By (the easy direction of) Yao’s minimax lemma, one can show the existence of such an algorithm $$D$$ that makes $$O(q/\epsilon)$$ queries. But how to explicitly construct it? Indeed, the first result of this paper gives a “meta-algorithm” that takes as input the description of $$R$$ (which is of size $$N$$), has running time $$poly(N)2^{O(q/\epsilon)}$$ and outputs a description of $$D$$. When $$R$$ satisfies the stronger property of “bounded error”, one can get a $$O(q^3)$$-query algorithm $$D$$ that approximates $$\mathop{E}_r[R(x,r)]$$ for all $$x$$ (again, the existence is proven by a classic theorem of Nisan). Overall, this paper gives a method to derandomize sublinear time algorithms, and I wonder if there could be some applications of this method for proving lower bounds. After all, Yao’s minimax theorem is the tool for property testing lower bounds, and any perspective on Yao’s theorem is likely of relevance.

Testing Membership for Timed Automata by Richard Lassaigne and Michel de Rougemont (arXiv). Property testing for regular languages is a classic result in sublinear algorithms. This paper focuses on the more complex notion of timed automata. The technical definition is quite complicated, but here’s an overview. There is a finite automaton and a collection of “clocks”. Imagine a string being processed, where each alphabet symbol appears with a new timestamp. Thus, the input word is called a “timed word”. The transitions of the automaton involve the new symbol read, as well as constraints involving the clock times and the timestamp. Thus, we can enforce conditions like “only transition if another symbol is read within a single time unit”. In general, deciding whether a timed word is accepted by a timed automaton is NP-complete. This papers studies the property testing viewpoint. The paper gives a new definition of “timed edit distance” between timed words. The main result shows that one can distinguish time words accepted by a timed automaton from words that are far (according to timed edit distance), by querying a constant number of word positions.

On the query complexity of estimating the distance to hereditary graph properties by Carlos Hoppen, Yoshiharu Kohayakawa, Richard Lang, Hanno Lefmann, and Henrique Stagni (arXiv). This paper concerns the classic setting of property testing of dense graphs. It is well-known that all hereditary graph properties are testable, and are moreover, one can estimate the distance to the property in time that only depends on $$\varepsilon$$. Unfortunately, the queries complexities have large tower dependencies on $$\varepsilon$$, arising from the use of the Szemeredi regularity lemma. The question of property testing in dense graphs can be reduced to finding “removal” lemmas (such as the classic triangle remove lemma). Such a lemma states that if at least $$\varepsilon n^2$$ edges need to be removed from $$G$$ to destroy all “forbidden subgraphs”, then there must be “many” forbidden subgraphs in $$G$$. There is much recent research on finding families of forbidden subgraphs, where the “many” (in the above statement) is at least $$poly(\varepsilon)$$ times the trivial upper bound. This paper shows that one can also estimate the distance to any hereditary property, in a query complexity that depends directly on the corresponding removal lemma parameters. As a compelling example, one can estimate the distance to being chordal in $$\exp(1/\varepsilon)$$ queries, a significant improvement over standard tower bounds.