The crazy numbers from last month are not quite gone: we have five papers this month, not bad at all!
Codes! Distributed computing! Probability distributions!
Improved local testing for multiplicity codes, by Dan Karliner and Amnon Ta-Shma (ECCC). Take the Reed–Muller code with parameters \(m, d\), whose codewords are the evaluation tables of all degree-\(m\) polynomials over \(\mathbb{F}^d\). RM codes are great, they are everywhere, and they are locally testable: one can test whether a given input \(x\) is a valid codeword (or far from every codeword) with only very few queries to \(x\). Now, take the multiplicity code: instead of just the evaluation table of the polynomial themselves, a codeword includes the evaluations of all its derivatives, up to order \(s\). These beasts generalize RM codes: are they also locally testable? Yes they are! And this work improves on our understanding of this aspect, by providing better bounds on the locality (how few queries are necessary to test), and simplifies the argument from previous work by Karliner, Salama, and Ta-Shma (2022).
Overcoming Congestion in Distributed Coloring, by Magnús M. Halldórsson, Alexandre Nolin, Tigran Tonoyan (arXiv). Two of the main distributed computing models, LOCAL and CONGEST, differ in how they model the bandwidth constraints. In the former, nodes can send messages of arbitrary size, and the limiting quantity is the number of rounds of communications; while in the latter, each node can only send a logarithmic number of bits at each round. This paper introduces a new technique that allows for communication-efficient distributed (coordinated) sampling, which as a direct applications enables porting several LOCAL algorithms to the CONGEST model at a small cost: for instance, \((\Delta+1)\)-List Coloring. This new technique also has applications beyond these distributed models, to graph property testing – in a slightly non-standard setting where we define farness from the property in a “local” sense (detect vertices or edges which contribute to many violations, i.e., are “locally far” from the property considered).
Robust Testing in High-Dimensional Sparse Models, by Anand Jerry George and Clément L. Canonne (arXiv). In the Gaussian mean testing problem, you are given samples from a high-dimensional Gaussian \(N(\mu, I_d)\), where \(\mu\) is either zero or has \(\ell_2\) norm greater than \(\varepsilon\), and you want to decide which of the two holds. This “mean testing” equivalent (due to, erm, “standard facts”) to testing in total variation distance, and captures the setting where one wantss to figure out whether an underlying signal \(\mu\), subject to white noise, is null or significant. Now, what if this \(\mu\) was promised to be \(s\)-sparse? Can we test more efficiently? But what if a small fraction of the samples were arbitrarily corrupted — how much harder does the testing task become? For some related tasks, it is known that being robust against adversarial corruptions makes testing as hard as learning… This paper addresses this “robust sparse mean testing” question, providing matching upper and lower bounds; as well as the related question of (robust, sparse) linear regression.
Sequential algorithms for testing identity and closeness of distributions, by Omar Fawzi, Nicolas Flammarion, Aurélien Garivier, and Aadil Oufkir (arXiv). Consider the two “usual suspects” of distribution testing, identity and closeness testing, where we must test if an unknown distribution is equal to some reference one or \(\varepsilon\)-far (in total variation distance) from it; or, the same thing, but with two unknown distributions (no reference one). These are, by now, quite well understood… but the algorithms for them take a worst-case number of samples, function of the distance parameter \(\varepsilon\). But if the two distributions are much further apart than \(\varepsilon\), fewer samples should be required! This is the focus of this paper, showing that with a sequential test one can achieve this type of guarantees: a number of samples which, in the “far” case, depends on the actual distance, not on its worst-case lower bound \(\varepsilon\). One could achieve this by combining known algorithms with a “doubling search;” however, this still would lose some constant factors in the sample complexity. The authors provide sequential tests which improve on this “doubling search technique” by constant factors, and back this up with empirical evaluations of their algorithms.
Estimation of Entropy in Constant Space with Improved Sample Complexity, by Maryam Aliakbarpour, Andrew McGregor, Jelani Nelson, and Erik Waingarten (arXiv). Suppose that, given samples from an unknown distribution \(p\) over \(n\) elements, your task is to estimate its (Shannon) entropy \(H(p)\) up to \(\pm\Delta\). You’re in luck! We know that \(\Theta(n/(\Delta\log n)+ (\log^2 n)/\Delta^2)\) samples are necessary and efficient. But what if you had to do that under strict memory constraints? Say, using only a constant number of words of memory? Previous work by Acharya, Bhadane, Indyk, and Sun (2019) shows that it is still possible, but the number of samples required shoots up, with their algorithm now requiring (up to polylog factors) \(n/\Delta^3\) samples. This works improves upon the dependence on \(\Delta\), providing a constant-memory algorithm with sample complexity \(O(n/\Delta^2 \cdot \log^4(1/\Delta))\); they further conjecture this to be optimal, up to the polylog factors.