We’re seeing lots of papers in the summer. I guess the heat and sun (and more time?) is bringing out the good stuff. Distribution testing, from testing to streaming, hypergraph testing, and bunch of papers on graph testing. And just for the record: in what follows, \(n\) is the support size in distribution testing, it’s the number of vertices in graph testing, and it’s the length when the input is a string. And \(m\) is the number of edges when the input is a graph. Amen. (Update: added a paper on testing Dyck languages. 11 papers this month, I think that’s a PTReview record.)
Which Distribution Distances are Sublinearly Testable?, by Constantinos Daskalakis, Gautam Kamath, and John Wright (Arxiv). Distribution testing is seeing much progress these days. Many questions in this area can be formulated as: given a known distribution \(q\), how many samples from an unknown distribution \(p\) are required to determine the distance between \(p\) and \(q\)? For the “identity testing” setting, we are given discrete distribution \(q\), two distance measures \(d_1, d_2\), and proximity parameters \(\varepsilon_1, \varepsilon_2\). How many samples from unknown distribution \(p\) are required to distinguish \(d_1(p,q) \leq \varepsilon_1\) from \(d_2(p,q) \geq \varepsilon_2\)? Observe the asymmetric nature of the question. The choices for the distributions are most of the standard ones: TV, KL-divergence, Hellinger, \(\chi^2\), and \(\ell_2\). And the paper basically completes the entire picture by giving matching upper and lower bounds (in terms of the support size \(n\)) and showing certain settings that are untestable with sublinear sample complexity. The results also consider the “equivalence/closeness testing” case, where \(q\) is also unknown. There are many interesting aspects of this collection of results. For example, when \(d_1\) is KL-divergence instead of \(\chi^2\), the complexity can jump from \(\Theta(\sqrt{n})\) to \(\Theta(n/\log n)\).
We have two independent papers with similar results (and titles!).
Differentially Private Identity and Closeness Testing of Discrete Distributions, by Maryam Aliakbarpour, Ilias Diakonikolas, and Ronitt Rubinfeld (arXiv). Differentially Private Testing of Identity and Closeness of Discrete Distributions, by Jayadev Acharya, Ziteng Sun, and Huanyu Zhang (arXiv). Both these papers study the problem of identity testing over TV-distance, in the differentially private setting. In essence, the distribution testing algorithm should not be able to distinguish between two “neighboring” input distributions \(p, p’\). The main result, achieved in both papers, is such an algorithm for identity (and closeness) testing whose sample complexity is quite close to the best non-differentially private algorithms. At a high level, the approaches used are also similar, specifically using a result of Goldreich on black-box reductions from identity to uniformity testing, and designing a private version of Paninski’s uniformity tester.
Estimating parameters associated with monotone properties, by Carlos Hoppen, Yoshiharu Kohayakawa, Richard Lang, Hanno Lefmann, and Henrique Stagni (arXiv). The study of monotone graph properties has rich history in graph property testing. This paper focuses on constant-query estimation of monotone graph parameters. A monotone graph property is one that is closed under subgraphs, and a monotone parameter is one that is non-increasing for subgraphs. Given a family of (constant-size) graphs \(\mathcal{F}\), a graph is \(\mathcal{F}\)-free if it contains no subgraph in \(\mathcal{F}\). For any graph \(G\), define the parameter \(z_\mathcal{F}\) to be (essentially) the log of the number of \(\mathcal{F}\)-free spanning subgraphs of \(G\). This parameter has been studied in the context of subgraph counting problems. This papers studies the complexity of estimating \(z_\mathcal{F}\) up to additive error \(\varepsilon\). Most results of this flavor typically involve complexities that are towers in \(1/\varepsilon\), but this result builds connections with the breakthrough subgraph removal lemma of Fox, to yield complexities that are towers in \(\log(1/\varepsilon)\).
Testable Bounded Degree Graph Properties Are Random Order Streamable, by Morteza Monemizadeh, S. Muthukrishnan, Pan Peng, and Christian Sohler (arXiv). This paper proves a nice connection between sublinear and streaming algorithms. A number of properties, such as connectivity, subgraph-freeness, minor-closed properties, are constant-time testable in the bounded-degree graph setting. The main result here shows that any such property can also be tested using constant space (in terms of words) when the graph is a random order stream. In contrast, existing work by Huang and Peng proves \(\Omega(n)\) space lower bounds for some of these properties for adversarial edge order. The main technical ingredient is an algorithm that maintains distributions of constant-sized rooted subgraphs in the random-order streaming setting.
On Testing Minor-Freeness in Bounded Degree Graphs With One-Sided Error, by Hendrik Fichtenberger, Reut Levi, Yadu Vasudev, and Maximilian Wötzel (arXiv). A major open problem in testing of bounded-degree graphs is the one-sided testability of minor-closed properties. Benjamini, Schramm, and Shapira first proved that minor-closed properties are two-sided testable, and conjectured that the one-sided complexity is \(O(\sqrt{n})\). Czumaj et al prove this result for the specific property of being \(C_k\)-minor free (where \(C_k\) is the \(k\)-cycle). This paper gives a \(O(n^{2/3})\) algorithm for testing whether a graph is \(k \times 2\)-grid minor-free. The latter class includes graphs that are cycles with non-intersecting chords and cactus graphs. A common approach in one-sided graph testing is to use the existence of a “nice” decomposition
of the graph into small well-behaved pieces, and argue that the tester only has to look within these pieces. This result employs a recent partitioning scheme of Lenzen and Levi into pieces of size \(O(n^{1/3})\). Unfortunately, this partitioning may remove many edges, and the tester is forced to find minors among these cut edges. The combinatorics of the \(k \times 2\)-grid minor ensures that any such large cut always contains such a minor, and it can be detected efficiently.
Testing bounded arboricity, by Talya Eden, Reut Levi, and Dana Ron (arXiv). The arboricity of a graph is the minimum number of spanning forests that a graph can be decomposed into. It is a 2-approximation of the maximum average degree of a subgraph, and thus is a measure of whether “local” density exists. For example, the arboricity of all minor-closed graphs is constant. The main result in this paper is a property tester for arboricity. Formally, the paper provides an algorithm that accepts if the graph is \(\varepsilon\)-close to having arboricity \(\alpha\), and rejects if the graph is \(c\varepsilon\)-far from having arboricity \(3\alpha\) (for large constant \(c\)). Ignoring dependencies on \(\varepsilon\), the running time is \(O(n/\sqrt{m} + n\alpha/m)\), which is proven to be optimal. The result builds on distributed graph algorithms for forest decompositions.The constant of \(3\) can be brought down arbitrarily close to \(2\), and it is an intriguing question if it can be reduced further.
On Approximating the Number of k-cliques in Sublinear Time, by Talya Eden, Dana Ron, and C. Seshadhri (arXiv). This paper studies the classic problem of approximating the number of \(k\)-cliques of a graph, in the adjacency list representation. The main result is an algorithm that \((1+\varepsilon)\)-approximates \(C_k\) (the number of \(k\)-cliques) in time \(O(n/C^k_k + m^{k/2}/C_k)\). (This expression ignored polylogs and \(\varepsilon\).) Somewhat surprisingly, this strange expression is the optimal complexity. It is a generalization of previous result of Goldreich-Ron (average degree/edge counting, \(k=2\)) and Eden et al (triangle counting, \(k=3\)). The algorithm is quite intricate, and is achieved by building on various sampling techniques in these results.
A characterization of testable hypergraph properties, by Felix Joos, Jaehoon Kim, Daniela Kühn, and Deryk Osthus (arXiv). A fundamental result of Alon, Fischer, Newman, and Shapira characterized all constant-time testable (dense) graph properties, through a connection to the regularity lemma. This paper essentially extends that characterization to \(k\)-uniform hypergraph properties. Essentially, the main results shows the testability of property \(\bf{P}\) is equivalent to \(\bf{P}\) being “regular-reducible”. This is a rather involved definition that captures properties where hypergraph regularity can be used for testing. Furthermore, this can also be used to prove that any testable property also has a constant-time distance estimation algorithm.
Distributed Testing of Conductance, by Hendrik Fichtenberger and Yadu Vasudev (arXiv). There has been recent interest in property testing in the distributed graph framework. The problem studied here in graph conductance. Specifically, we wish to accept a graph with conductance \(\Phi\) and reject a graph that is far from having conductance \(\Phi^2\). The setting is the CONGEST framework, where each vertex has a processor and can only communicate \(O(\log n)\) amount of data to its neighbors in a single round. The main result gives an algorithm with \(O(\log n)\) rounds, which is shown to be optimal. Standard testing algorithms for conductance basically perform a collection of random walks from some randomly chosen seeds vertices. An important insight in this paper is that testing algorithms for conductance can be simulated in the CONGEST model without having to maintain all the random walks explicitly. It suffices to transfer a small amount of information that tracks overall statistics of the walks, which suffice for the testing problem.
Improved bounds for testing Dyck languages, by Eldar Fischer, Frédéric Magniez, Tatiana Starikovskaya (arXiv). Testing membership in languages is a classic problem in property testing. Fix a language \(L\) over an alphabet \(\Sigma\). Given a string \(x\) of length \(n\), the tester should accept if \(x \in L\) and reject if \(x\) is far (in terms of Hamming distance) from \(L\). This paper focuses the simple, yet fundamental setting of \(L\) being a Dyck language, the set of strings of perfectly balanced parentheses. Let \(D_k\) be the Dyck language using \(k\) types of parentheses. The classic paper of Alon et al on regular language testing also gives a constant time tester for \(k=1\), and subsequent work of Parnas, Ron, Rubinfeld proves, for \(k \geq 2\), testing complexity bounds of \(O(n^{2/3})\) and \(\Omega(n^{1/11})\). This paper gives significant improvements in testing complexity: an upper bound of \(O(n^{2/5+\delta})\) (for any \(\delta > 0\)) and a lower bound of \(\Omega(n^{1/5})\). The lower bound proof introduces some new techniques for proving that similarity of transcript distributions of a deterministic algorithm over two distributions of inputs (the key step in any Yao’s minimax lemma application). The proof constructs a third transcript distribution using a special “wild” transcript that represents the (low probability) event of the algorithm “learning” too much. A careful construction of this distribution can used to upper bound the distance between the original transcript distributions.