# Open Problems From the Workshop on Data Summarization and the Workshop on Local Algorithms

A blast from the (recent) past: the open problems from the Workshop on Data Summarization held in March 2018 in the University of Warwick have been posted online, and are available here.

Fast-forward to this summer: the recently concluded 3rd Workshop on Local Algorithms, which took place last month at ETH Zurich, featured an open problems session. The list of open problems can be found in this link, or here for a PDF version.

# News for July 2019

We saw quite an activity last month, with a total of 9 papers on property testing or related areas.

Revisiting Alphabet Reduction in Dinur’s PCP, by Venkatesan Guruswami, Jakub Opršal, and Sai Sandeep (ECCC). As mentioned in the title, this work provides an alternate proof of one of the two parts of the Dinur’s PCP theorem proof. In that proof, the argument goes by alternating two main steps: (i) amplifying the soundness gap, at the price of increasing the alphabet size (and the instance size); (ii) reducing the alphabet size, at the price of increasing the instance size. From the observation that step (i) creates some structure in the instances that can be leveraged, the authors provide an alternative construction for (ii), leading to an arguably simpler analysis of the soundness of the underlying test. A nice throwback to some of the roots of property testing.

Then, four papers on distribution testing:
Testing Mixtures of Discrete Distributions, by Maryam Aliakbarpour, Ravi Kumar, Ronitt Rubinfeld (arXiv). One of the most studied problems in distribution testing is identity testing (a.k.a., goodness-of-fit): given a reference distribution $$q$$ over a domain of size $$n$$ and i.i.d. samples from an unknown $$p$$, decide between $$p=q$$ and $$d_\textrm{TV}(p,q) > \varepsilon$$. While this is now well-understood and can be done with a strongly sublinear number of samples (namely, $$\sqrt{n})$$), the case of tolerant testing is much sadder: Valiant and Valiant proved that $$\Theta(n/\log n)$$ samples were required for distinguishing $$d_\textrm{TV}(p,q) < \varepsilon/2$$ from $$d_\textrm{TV}(p,q) > \varepsilon$$. So, noise-tolerance is almost as hard as it gets… except, as this works suggests and studies, if one has some guarantees on the noise! What if the noise in the completeness case was just that $$q$$ is perturbed by some (known, or available via samples as well) noise coming from a second distribution $$\eta$$? I.e., the question is now to test whether $$p$$ if of the form $$\alpha q + (1-\alpha) \eta$$ for some $$\alpha\in [0,1]$$, or if it is $$\varepsilon$$-far from all mixtures of $$q$$ and $$\eta$$. The authors have various results on this question and some extensions, but the upshot is: “with structured noise, strongly sublinear sample complexity is once again possible.”

Can Distributed Uniformity Testing Be Local?
, by Uri Meir, Dor Mintzer, and Rotem Oshman (ACM PODC Proceedings). More on identity testing, specifically uniformity testing. No noise here, but another constraint: the samples are distributed. There are $$k$$ players, each holding $$q$$ i.i.d. samples from a distribution $$p$$ over $$[n]$$; each can send $$\ell=1$$ bit to a central referee, in a non-interactive fashion. Is $$p$$ the uniform distribution, or is it $$\varepsilon$$-far from it? The authors here consider the case where $$k,n,\varepsilon$$ are fixed, and prove lower bounds on the number $$q=q(k,n,\varepsilon)$$ of samples each player must hold in order to perform this uniformity testing task. (Note: their lower bounds hold even in the public-randomness setting.) Motivated by the LOCAL model, they focus on 3 cases of interest, for which they obtain tight or near-tight bounds on $$q$$ (in view of upper bounds from previous work of a subset of the authors): (1) when the referee’s decision has to be the $$\textsf{AND}$$ of all $$n$$ bits she receives; (2) when the referee can do something more involved, and use a threshold function; and (3) when the referee can use an arbitrary function of those $$n$$ messages. Underlying those lower bounds is a neat application of Boolean Fourier analysis, recasting the analysis of the standard “Paninski” lower bound instance and the player’s decision functions as a question on Boolean functions.

Domain Compression and its Application to Randomness-Optimal Distributed Goodness-of-Fit, by Jayadev Acharya, Clément Canonne, Yanjun Han, Ziteng Sun, and Himanshu Tyagi (arXiv,ECCC). Remember the paper just above? Now, focus on the case (3), where the referee can use any function of the message, allow each player to send $$\ell$$ bits instead of one, but give only one sample (instead of $$q$$) to each player. This becomes a setting we have talked about before on this blog (specifically, with three papers on these three months). Those three papers established the optimal number of player $$k$$, in terms of the domain size $$n$$, the distance parameter $$\varepsilon$$, and the number of bits per player $$\ell$$, in both the private-coin and public-coin settings. This new work interpolates between those two extremes, and gives the tight answer to the general question: if there are $$0\leq s \leq \log n$$ bits of public randomness available, what is the optimal number of players to perform identity testing? Behind the upper bounds is a new notion of (randomness-efficient) domain compression they introduce: how to reduce the domain size of the probability distributions while preserving the $$\ell_1$$ distances between them as much as possible?

Towards Testing Monotonicity of Distributions Over General Posets, by Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee (arXiv). Away from identity testing, let’s now consider another central question, testing monotonicity of distributions. Specifically, a distribution $$p$$ over a given poset $$(\mathcal{X}, \prec)$$ is said to be monotone if $$p(x) \leq p(y)$$ whenever $$x\prec y$$. While the question is fully understood for the familiar case of the line $$\mathcal{X} = \{1,2,\dots,n\}$$, the case of general posets is much murkier, and more or less uncharted. This paper initiates a general study of the question, improving on previously known bounds for the matching poset and Boolean hypercube, and providing several new results and techniques for the general case, to establish both upper and lower bounds on the sample complexity.

And now… graphs!
Expansion Testing using Quantum Fast-Forwarding and Seed Sets, by Simon Apers (arXiv). This paper is concerned with (bicriteria) testing of vertex expansion of a given graph $$G=(V,E)$$ in the bounded-degree graph model. Loosely speaking, given a value $$\Phi$$ and query access (i.e., sampling a node u.a.r., querying the degree of a node, and querying the $$i$$-th neighbor of a node) to a graph $$G$$ with maximum degree $$d=\Theta(1)$$, the goal is to distinguish between (i) $$G$$ has vertex expansion at most $$\Phi$$ and (ii) $$G$$ is $$\varepsilon$$-far from any $$G’$$ with vertex expansion at most $$\Phi^2$$ (simplifying and dropping some technicalities). The previous state-of-the-art was a $$\tilde{O}_{\varepsilon}(n^{1/2}/\Phi^2)$$ query complexity for classical algorithms, and a $$\tilde{O}_{\varepsilon}(\min(n^{1/3}/\Phi^2),n^{1/2}/\Phi^2))$$ query complexity upper bound for quantum testers. The current work improves on this by providing a $$\tilde{O}_{\varepsilon}(n^{1/3}/\Phi)$$ quantum tester. Graphs expand—and so does the gap between classical and quantum testing.

Walking Randomly, Massively, and Efficiently, by Jakub Łącki, Slobodan Mitrović, Krzysztof Onak, and Piotr Sankowski (arXiv). One very useful primitive, in many graph property testing results and of course well beyond property testing, is the ability to perform random walks on graphs. In the case of large, distributed (or merely impractically massive) graphs, however, it is not clear how to implement this primitive efficiently. This work addresses this question in the MPC (Massive Parallel Computation) model, and shows how to perform independently, from all of the $$n$$ nodes of a graph (i.e., machine), an $$\ell$$-length random walk with only $$O(\log \ell)$$ rounds of communication and $$O(n^\alpha)$$ space per machine, for any constant $$\alpha \in(0,1)$$. This round complexity matches a known (conditional) lower bound, and thus the result is as good as it gets — further, the authors obtain such MPC algorithms for both undirected and directed graph, somehow leveraging the latter case to handle the (significantly more complicated) directed case. As an application of this efficient random walk primitive, they provide MPC property testing algorithms for both bipartiteness and vertex expansion (same definition as in the paper above: vertex expansion $$\Phi$$ vs. $$\varepsilon$$-far from any $$G’$$ with vertex expansion $$\Phi^2$$) in the general graph model. (Beyond property testing, another application—the main one in the paper— is computing PageRank in both graphs and directed graphs.)

A Lower Bound on Cycle-Finding in Sparse Digraphs
, by Xi Chen, Tim Randolph, Rocco Servedio, and Timothy Sun (arXiv). In this paper, the authors tackle the problem of one-sided testing of acyclicity of directed graphs, in the bounded-degree graph model (equivalently, on the task of finding a cycle in a digraph promised to be far from acyclic). Previously, an $$\Omega(\sqrt{n})$$ lower bound for this task had been shown by Bender and Ron, based on a birthday-paradox-type argument; this work improves on this hardness result, showing that $$\tilde{\Omega}(n^{5/9})$$ queries are necessary. Interestingly, whether achieving any $$o(n)$$ query complexity is possible remains open.

Constant-Time Dynamic $$(\Delta+1)$$-Coloring and Weight Approximation for Minimum Spanning Forest: Dynamic Algorithms Meet Property Testing, by Monika Henzinger and Pan Peng (arXiv). Finally, the last paper covered this month is concerned with dynamic graph algorithms: where updates to a graph (which can be both edge additions and deletions) come sequentially, and the objective is to maintain, e.g., a coloring, or an approximation of some function of the graph, at any time using as little time per update. This paper advocates the use of techniques from property testing (loosely speaking, as their locality and query-efficiency translates to fast update times) to design better dynamic graph algorithms. It illustrates this idea by obtaining several new results, in particular an (amortized) $$O(1)$$-update-time randomized algorithm to maintain a $$(\Delta+1)$$-coloring in a graph with degree bound $$\Delta$$. The algorithm relies on a random ranking of the nodes which, combined with a suitable randomized update rule for recoloring a vertex when needed, ensure that the recolorings will not (with high probability) “propagate” to much. This is the first time that this idea, previously used in several testing and sublinear-time papers, is used in the context of dynamic graph algorithms—adding an edge between the two areas, hopefully only the first of many.

Update: we did miss a paper! Make it ten in July…
Nearly optimal edge estimation with independent set queries, by Xi Chen, Amit Levi, Erik Waingarten (arXiv). Consider the following type of access to an unknown graph $$G=(V,E)$$ on $$n=|V|$$ nodes: on query $$S\subseteq V$$, the oracle returns 1 if $$S$$ is is an independent set, and 0 otherwise. (This sounds quite a bit like what is allowed in the group testing literature.) How many such queries are necessary to get a multiplicative estimate of the number of edges, $$m=|E|$$? In this paper, the authors improve on the previous bound of $$\min(\sqrt{m}, n^2/m)$$ (ignoring logarithmic factors), obtaining an $$\min(\sqrt{m}, n/m)$$ upper bound — complemented by a (nearly, up to those pesky log factors) matching lower bound.

If we missed a paper this month, or represented some result above, please mention it in the comments.