News for September 2020

Apologies dear readers for the late posting. The beginning of the school year is always frenzied, and the pandemic has only added to that frenzy. We have an exciting September, with four papers on graph property testing, one two papers on distribution testing, and one paper that connects both topics.

(Ed: we normally scan through ECCC and arXiv, but are happy to post about papers that appear elsewhere. Thanks to the reader who pointed out a relevant COLT 2020 paper.)

Estimation of Graph Isomorphism Distance in the Query World by Sourav Chakraborty, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen (ECCC). Graph isomorphism is about as fundamental as it gets, and this papers studies approximating the graph isomorphism distance for dense graphs. There is a known graph $$G_k$$ (with $$n$$ vertices). The algorithm is given query access to an input graph $$G_u$$ and needs to approximate the number of edge inserts/deletes required to make the graphs isomorphic. This is the tolerant testing version; the property testing version is known to be doable in $$\widetilde{O}(\sqrt{n})$$ queries (Matsliah-Fischer). The main insight of this paper is to relate the tolerant testing complexity to a distribution testing problem. Consider distributions over the $$\{0,1\}^n$$ defined by multisets of $$n$$ hypercube points. Our aim is to estimate the earthmover distance between a known distribution and an unknown distribution. Interestingly, the query model is different: one can sample the underlying multisets without replacement. It turns out that the optimal complexity of this problem is (upto polylog factors) is the same as the optimal complexity of tolerant testing of graph isomorphism. A direct corollary is that the isomorphism distance can be approximated upto additive $$\epsilon n^2$$ using $$\widetilde{O}(n)$$ samples. This equivalence also gives an alternate proof for lower bounds for property testing graph isomorphism.

Robustly Self-Ordered Graphs: Constructions and Applications to Property Testing by Oded Goldreich and Avi Wigderson (ECCC). Let’s start from the application. The aim is to prove the following property testing lower bounds for the bounded-degree graph setting: an exponential separation between tolerant and vanilla testing, and finding an efficiently decidable property (in polynomial time) that cannot be property tested in sublinear time. For binary strings, results of this form are known. Can these be “ported” to the bounded-degree graph world? Can we construct graphs such that adjacency queries reduce to bit queries in strings? Naturally, one can simply represent the adjacency list as a string and treat graph queries as bit queries. But the problem is that of isomorphisms: different bit strings could represent the same graph and therefore, the different bit strings must have the same status with respect to the underlying property. The key insight in this paper is to introduce robustly self-ordered graphs, as a tool to port bit string property testing lower bounds to bounded-degree graphs. Such graphs essentially have a unique (identity) automorphism, even after a few edge insert/deletes. The actual definition is more technical, but that is the essence. The main result is an explicit construction of such graphs, from which the lower bound can be ported directly through a convenient lemma.

Modifying a Graph’s Degree Sequence and the Testablity of Degree Sequence Properties by Lior Gishboliner (arXiv). A sequence of numbers $$D = (d_1, d_2, \ldots, d_n)$$ is graphic if there exists an undirected graph on $$n$$ vertices whose degrees are precisely the numbers of the sequence. Graphical sequences have been characterized by classic results of Erdös-Gállai and Havel-Hakimi. This paper first proves the following theorem. Suppose a graphic sequence $$D’$$ has $$l_1$$-distance at most $$\delta$$ from the degree sequence $$D$$ of a graph $$G$$. Then, there exists a graph $$G’$$ with degree sequence $$D’$$ such that the (dense graph) distance between $$G$$ and $$G’$$ is $$O(\sqrt{\delta})$$. This theorem is used to prove an interesting property testing result. Let $$\mathcal{D}$$ be a subset of graphic sequences that are closed under permutation. Let $$\mathcal{G}$$ be the set of graphs that have a degree sequence in $$\mathcal{D}$$. Then $$\mathcal{G}$$ can be tested in $$poly(1/\epsilon)$$ queries.

Sampling an Edge Uniformly in Sublinear Time by Jakub Têtek (arXiv). In the general model for sublinear algorithms on graphs, an important choice is whether one allows uniform random edge queries. A natural question is whether such queries can simulated efficiently, using only random vertex, degree, and neighbor queries. This problem appears somewhat implicitly in previous sublinear subgraph counting algorithms, and Eden-Ron-Rosenbaum study it explicitly. They prove that one can sample from an $$\epsilon$$-approximate uniform distribution (over edges) using $$O(n/\sqrt{\epsilon m})$$ samples. The problem of sampling from exactly the uniform distribution is left open. Until this paper. The main result shows that by modifying the Eden-Ron-Rosenbaum algorithm parameters, one can generate edge samples from an $$\epsilon$$-approximate uniform distribution using $$O((n/\sqrt{m})\log \epsilon^{-1})$$ samples. The exact uniform distribution is achieved by setting $$\epsilon = 1/n$$, to get a sample complexity of $$O((n\log n)/\sqrt{m})$$.

Faster Property Testers in a Variation of the Bounded Degree Model by Isolde Adler and Polly Fahey (arXiv). The setting of bounded-degree graph property testing naturally extends to bounded-degree relational databases, which can be thought of as “directed” hypergraphs. This is an interesting new direction of research, that combines property testing with database theory (see Adler-Harwath and Chen-Yoshida). One of the main contributions of this work is to consider another notion of distance: edge and vertex inserts/deletes. This is a natural extension, and we can now compare distances between graphs/databases with different numbers of vertices. The main result is that, under this notion of distance, a large class of properties can be tested in constant running time on databases with bounded degree and treewidth. Specifically, any property expressible in Counting Monadic Second-Order Logic (CMSO) can be tested in constant time. Previous results by Alder-Harwath showed that such properties can be tested (under the standard distance notion) in constant queries, but polylogarithmic time.

Optimal Testing of Discrete Distributions with High Probability by Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, John Peebles, and Eric Price (arXiv, ECCC). The focus of this paper is distribution testing in the “high probability” regime, where we wish the error of the tester to be $$< \delta$$. Typically, most results just get an error of at most $$1/3$$, from which standard probabilistic boosting would tack on an extra $$O(\log 1/\delta)$$ factor. In standard TCS settings, one doesn’t focus on optimizing this dependence, but in statistics, there is significant focus on the optimal sample complexity. And indeed, for practical applications, it is crucial to have sharp bounds on the right number of samples required for hypothesis testing. The paper also argues that getting the optimal sample complexity requires new algorithms, even for uniformity testing. There are optimal results given for closeness and independence testing. The optimal sample complexity only pays a multiplicative factor of $$\log^{1/3} (1/\delta)$$ or $$\log^{1/2}(1/\delta)$$ over the optimal bound for constant error (with other additive terms depending on $$\log(1/\delta)$$).

Bessel Smoothing and Multi-Distribution Property Estimation by Yi Hao and Ping Li (COLT 2020). Let us consider some standard (tolerant) distribution testing questions, phrases as approximation algorithms. Given sample access to two distributions $$p$$ and $$q$$ over $$[n]$$, we may wish to estimate the $$l_1$$-distance, $$l_2$$-distance, relative entropy, etc. between these distributions. One can phrases this problem abstractly as estimating $$\sum_{i \in [n]} f(p_i, q_i)$$, where $$f$$ is some explicit function. This papers shows that for any 1-Lipschitz function $$f$$ that satisfies some “regularity” property, the sum $$\sum_{i \in [n]} f(p_i, q_i)$$ can be $$\epsilon$$-approximated with $$O(\epsilon^{-3}n/\sqrt{\log n})$$ samples (apologies to the authors to replacing their $$k$$ with the more familiar $$n$$ for our readers). Thus, we can get sublinear sampling complexity for a very general class of estimation problems. Moreover, this was actually the simplest setting consider in the paper. One can deal with such functions of $$d$$ distributions, not just two distributions. One of the corollaries of the theorems is a sublinear tolerant tester for the property of being a mixture of distributions.