# News for December 2020

Happy 2021! We now cover the last papers of 2020: a (smaller) spread with graph property testing, distribution property testing, and circuit complexity.

Almost Optimal Bounds for Sublinear-Time Sampling of k-Cliques: Sampling Cliques is Harder Than Counting by Talya Eden, Dana Ron, and Will Rosenbaum (arXiv). In many settings, sampling and counting are tightly coupled; the classic example being permanent estimation and sampling random bipartite matchings.This paper investigates the problem of sampling uniform random cliques, and unearths an interesting separation of approximately uniform sampling from approximate counting. Let’s consider the fundamental problem of edge and triangle counting, in graphs of constant arboricity. This class, which contains all minor-closed families, has deep connections to clique counting; clique counting can be done in linear time for bounded arboricity graphs.) The input connected graph $$G$$ has $$n$$ vertices, $$m$$ edges, and $$t$$ triangles. We assume the standard query model, with uniform random vertices, neighbor, and degree queries. Our aim is to get $$(1+\varepsilon)$$-approximations for $$m$$ and $$t$$ (in general, any $$k$$-clique count). It is known from previous work that edge estimation can be done in $$O(1)$$ time and triangle estimation can be done in $$O(n/t)$$ time (ignoring log and $$\varepsilon$$ dependencies). Moreover, these bounds is optimal. Can one get similar bounds for approximately sampling a uniform random edge/triangle? Previous work of Eden-Rosenbaum achieve such a bound for sampling edges. Surprisingly, this paper shows that triangle sampling is fundamentally harder, and requires $$\Omega(n^{3/4}/\sqrt{t})$$ queries. The main result gives a precise and optimal bound for uniform sampling $$k$$-cliques in graphs of arboricity $$\alpha$$.

Testing and reconstruction via decision trees by Guy Blanc, Jane Lange, and Li-Yang Tan (arXiv). A property tester is an (sublinear) algorithm that distinguishes an object with a property from those that are far. Reconstruction is a much stronger sublinear algorithm, which actually provides query access to the closest object. Suppose we have query access to an $$n$$-variate Boolean function $$f$$, and the property $$\mathcal{P}$$ of interest is size-$$s$$ decision trees. A reconstructor would give query access to a function $$g \in \mathcal{P}$$, such that $$dist(f,g) = O(OPT_f)$$, where $$OPT_f$$ is the distance of $$f$$ to $$\mathcal{P}$$. Observe that a reconstructor automatically gives a distance estimator (just estimate $$dist(f,g)$$ by random sampling). One of the main results of this paper is a (weaker, bicriteria) reconstructor, where $$g$$ is guaranteed to have decision-tree complexity $$s^{O(\log^2s)}$$. Each query to $$g$$ is computed in $$poly(\log s, 1/\varepsilon)\cdot n$$ time. This reconstructor leads to a number of testing results, and reconstructors for other properties like low Fourier degree. Another interesting result proven: if there exists an efficient reconstructor that gets $$g$$ of size $$s$$, then there exists an efficient proper learner for decision trees (a big open problem).

Testing Product Distributions: A Closer Look by Arnab Bhattacharyya, Sutanu Gayen, Saravanan Kandasamy, and N. V. Vinodchandran (arXiv). This paper is on distribution testing, where all distributions involved are promised to be product distributions. Let $$P, Q$$ be distributions over $$\Sigma^n$$, where $$\Sigma$$ is some (small) finite set. The problems studied are identity testing ($$Q$$ is known/fixed, and one gets samples from unknown $$P$$) and closeness testing (both unknown). It is known from previous work that these problems can basically be solved with $$O(n|\Sigma|\varepsilon^{-2})$$ samples. Note that the domain size is exponential in $$n$$; thus, these bounds give an exponential improvement over identity/closeness testing in the vanilla (non-product) setting. Tolerant testing is of special significance here. Because the domain is so large, it makes sense to allow for some distance even in the “yes” case. The main result of this paper is to completely map out the complexities for identity/closeness testing where we wish to distinguish $$d_1(P,Q) < \varepsilon_1$$ from $$d_2(P,Q) > \varepsilon_2$$, where $$d_1, d_2$$ are potentially different distance measures. For example, the paper considers total variation (TV), KL, Hellinger, and $$\chi^2$$ distances. An important improvement achieved is for identity testing where $$d_1$$ is Hellinger and $$d_2$$ is TV distance: for certain values of $$\varepsilon_1, \varepsilon_2$$, one can get a tester of optimal sample complexity $$O(\sqrt{n|\Sigma|})$$. Previous bounds only gave a $$O(n|\Sigma|)$$ sample complexity. There are a number of new lower bounds for different combinations of $$d_1, d_2$$ distance pairs.

# News for September 2020

Apologies dear readers for the late posting. The beginning of the school year is always frenzied, and the pandemic has only added to that frenzy. We have an exciting September, with four papers on graph property testing, one two papers on distribution testing, and one paper that connects both topics.

(Ed: we normally scan through ECCC and arXiv, but are happy to post about papers that appear elsewhere. Thanks to the reader who pointed out a relevant COLT 2020 paper.)

Estimation of Graph Isomorphism Distance in the Query World by Sourav Chakraborty, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen (ECCC). Graph isomorphism is about as fundamental as it gets, and this papers studies approximating the graph isomorphism distance for dense graphs. There is a known graph $$G_k$$ (with $$n$$ vertices). The algorithm is given query access to an input graph $$G_u$$ and needs to approximate the number of edge inserts/deletes required to make the graphs isomorphic. This is the tolerant testing version; the property testing version is known to be doable in $$\widetilde{O}(\sqrt{n})$$ queries (Matsliah-Fischer). The main insight of this paper is to relate the tolerant testing complexity to a distribution testing problem. Consider distributions over the $$\{0,1\}^n$$ defined by multisets of $$n$$ hypercube points. Our aim is to estimate the earthmover distance between a known distribution and an unknown distribution. Interestingly, the query model is different: one can sample the underlying multisets without replacement. It turns out that the optimal complexity of this problem is (upto polylog factors) is the same as the optimal complexity of tolerant testing of graph isomorphism. A direct corollary is that the isomorphism distance can be approximated upto additive $$\epsilon n^2$$ using $$\widetilde{O}(n)$$ samples. This equivalence also gives an alternate proof for lower bounds for property testing graph isomorphism.

Robustly Self-Ordered Graphs: Constructions and Applications to Property Testing by Oded Goldreich and Avi Wigderson (ECCC). Let’s start from the application. The aim is to prove the following property testing lower bounds for the bounded-degree graph setting: an exponential separation between tolerant and vanilla testing, and finding an efficiently decidable property (in polynomial time) that cannot be property tested in sublinear time. For binary strings, results of this form are known. Can these be “ported” to the bounded-degree graph world? Can we construct graphs such that adjacency queries reduce to bit queries in strings? Naturally, one can simply represent the adjacency list as a string and treat graph queries as bit queries. But the problem is that of isomorphisms: different bit strings could represent the same graph and therefore, the different bit strings must have the same status with respect to the underlying property. The key insight in this paper is to introduce robustly self-ordered graphs, as a tool to port bit string property testing lower bounds to bounded-degree graphs. Such graphs essentially have a unique (identity) automorphism, even after a few edge insert/deletes. The actual definition is more technical, but that is the essence. The main result is an explicit construction of such graphs, from which the lower bound can be ported directly through a convenient lemma.

Modifying a Graph’s Degree Sequence and the Testablity of Degree Sequence Properties by Lior Gishboliner (arXiv). A sequence of numbers $$D = (d_1, d_2, \ldots, d_n)$$ is graphic if there exists an undirected graph on $$n$$ vertices whose degrees are precisely the numbers of the sequence. Graphical sequences have been characterized by classic results of Erdös-Gállai and Havel-Hakimi. This paper first proves the following theorem. Suppose a graphic sequence $$D’$$ has $$l_1$$-distance at most $$\delta$$ from the degree sequence $$D$$ of a graph $$G$$. Then, there exists a graph $$G’$$ with degree sequence $$D’$$ such that the (dense graph) distance between $$G$$ and $$G’$$ is $$O(\sqrt{\delta})$$. This theorem is used to prove an interesting property testing result. Let $$\mathcal{D}$$ be a subset of graphic sequences that are closed under permutation. Let $$\mathcal{G}$$ be the set of graphs that have a degree sequence in $$\mathcal{D}$$. Then $$\mathcal{G}$$ can be tested in $$poly(1/\epsilon)$$ queries.

Sampling an Edge Uniformly in Sublinear Time by Jakub Têtek (arXiv). In the general model for sublinear algorithms on graphs, an important choice is whether one allows uniform random edge queries. A natural question is whether such queries can simulated efficiently, using only random vertex, degree, and neighbor queries. This problem appears somewhat implicitly in previous sublinear subgraph counting algorithms, and Eden-Ron-Rosenbaum study it explicitly. They prove that one can sample from an $$\epsilon$$-approximate uniform distribution (over edges) using $$O(n/\sqrt{\epsilon m})$$ samples. The problem of sampling from exactly the uniform distribution is left open. Until this paper. The main result shows that by modifying the Eden-Ron-Rosenbaum algorithm parameters, one can generate edge samples from an $$\epsilon$$-approximate uniform distribution using $$O((n/\sqrt{m})\log \epsilon^{-1})$$ samples. The exact uniform distribution is achieved by setting $$\epsilon = 1/n$$, to get a sample complexity of $$O((n\log n)/\sqrt{m})$$.

Faster Property Testers in a Variation of the Bounded Degree Model by Isolde Adler and Polly Fahey (arXiv). The setting of bounded-degree graph property testing naturally extends to bounded-degree relational databases, which can be thought of as “directed” hypergraphs. This is an interesting new direction of research, that combines property testing with database theory (see Adler-Harwath and Chen-Yoshida). One of the main contributions of this work is to consider another notion of distance: edge and vertex inserts/deletes. This is a natural extension, and we can now compare distances between graphs/databases with different numbers of vertices. The main result is that, under this notion of distance, a large class of properties can be tested in constant running time on databases with bounded degree and treewidth. Specifically, any property expressible in Counting Monadic Second-Order Logic (CMSO) can be tested in constant time. Previous results by Alder-Harwath showed that such properties can be tested (under the standard distance notion) in constant queries, but polylogarithmic time.

Optimal Testing of Discrete Distributions with High Probability by Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, John Peebles, and Eric Price (arXiv, ECCC). The focus of this paper is distribution testing in the “high probability” regime, where we wish the error of the tester to be $$< \delta$$. Typically, most results just get an error of at most $$1/3$$, from which standard probabilistic boosting would tack on an extra $$O(\log 1/\delta)$$ factor. In standard TCS settings, one doesn’t focus on optimizing this dependence, but in statistics, there is significant focus on the optimal sample complexity. And indeed, for practical applications, it is crucial to have sharp bounds on the right number of samples required for hypothesis testing. The paper also argues that getting the optimal sample complexity requires new algorithms, even for uniformity testing. There are optimal results given for closeness and independence testing. The optimal sample complexity only pays a multiplicative factor of $$\log^{1/3} (1/\delta)$$ or $$\log^{1/2}(1/\delta)$$ over the optimal bound for constant error (with other additive terms depending on $$\log(1/\delta)$$).

Bessel Smoothing and Multi-Distribution Property Estimation by Yi Hao and Ping Li (COLT 2020). Let us consider some standard (tolerant) distribution testing questions, phrases as approximation algorithms. Given sample access to two distributions $$p$$ and $$q$$ over $$[n]$$, we may wish to estimate the $$l_1$$-distance, $$l_2$$-distance, relative entropy, etc. between these distributions. One can phrases this problem abstractly as estimating $$\sum_{i \in [n]} f(p_i, q_i)$$, where $$f$$ is some explicit function. This papers shows that for any 1-Lipschitz function $$f$$ that satisfies some “regularity” property, the sum $$\sum_{i \in [n]} f(p_i, q_i)$$ can be $$\epsilon$$-approximated with $$O(\epsilon^{-3}n/\sqrt{\log n})$$ samples (apologies to the authors to replacing their $$k$$ with the more familiar $$n$$ for our readers). Thus, we can get sublinear sampling complexity for a very general class of estimation problems. Moreover, this was actually the simplest setting consider in the paper. One can deal with such functions of $$d$$ distributions, not just two distributions. One of the corollaries of the theorems is a sublinear tolerant tester for the property of being a mixture of distributions.

# Policy on reporting papers

While we at PTReview always look through the posted papers, we do not check for correctness. We make a serious attempt to make sure the paper is reasonable. In a few instances, we have decided not to post a (topically relevant) paper, because it looks absolutely wrong. Our position is: the benefit of doubt goes to the author, and a borderline paper should be posted. We are only curating relevant tech reports, not passing judgment on results.

In some borderline cases, readers familiar with the subject complained to us that the paper should be not be considered a scientific contribution (because of, say, unspecified algorithms, blatantly incorrect or unverifiable central claims). These are cases where we were also unsure of the paper. We have usually removed/not posted such papers.

If the paper author(s) feels that his/her paper should nonetheless be posted, then they should email us at little.oh.of.n@gmail.com. As long as the paper is not complete nonsense and appears to cite relevant history, we will defer to the authors’ wishes.

# News for June 2020

Sublinear algorithms in times of social distancing…always something exciting. This month we have a slew of results on sublinear algorithms for classic graph problems.

(Ed: We have removed a previously posted paper due to correctness concerns raised by our readers. Please look at the post on our paper policy.)

Palette Sparsification Beyond (∆ + 1) Vertex Coloring by Noga Alon and Sepehr Assadi (arXiv). A basic fact from graph theory is that any graph has a $$(\Delta+1)$$-coloring, where $$\Delta$$ is the maximum degree. Followers of property testing are likely familiar with a fantastic result of Assadi-Chen-Khanna (ACK) on sublinear algorithms, that gives a sublinear algorithm for $$(\Delta+1)$$-coloring. (The running time is $$\widetilde{O}(n^{3/2})$$, where $$n$$ is the number of vertices.) The key tool is a palette sparsification theorem: suppose each vertex is given a “palette” of $$(\Delta+1)$$ colors. Each vertex randomly sparsifies its palette by sampling $$O(\log n)$$ colors, and is constrained to only use these colors. Remarkably, whp the graph can still be properly colored. This tool is at the heart of sublinear time/space algorithms for coloring. This paper gives numerous extensions to this theorem, where one can tradeoff a larger initially palette for a smaller final sample. Another extension is for triangle-free graphs, where the initial palette is of size $$O(\Delta/\ln \Delta)$$ and the sample is of size $$O(\Delta^\gamma + \sqrt{\ln n})$$ (for parameter $$\gamma < 1$$. This leads to an $$O(n^{3/2 + \gamma})$$ time algorithm for $$O(\Delta/\ln \Delta)$$ coloring of triangle-free graphs.

When Algorithms for Maximal Independent Set and Maximal Matching Run in Sublinear-Time by Sepehr Assadi and Shay Solomon (arXiv). Taking off from sublinear coloring algorithms, one can ask if there are sublinear time algorithms for Maximal Independent Set (MIS) and Maximal Matching (MM). Alas, ACK prove that this is impossible. This paper investigates when one can get a sublinear time algorithm for these problems. For graph $$G$$, let $$\beta(G)$$ be the “neighborhood independence number”, the size of the largest independent set contained in a vertex neighborhood. This papers shows that both problems can be solved in $$\widetilde{O}(n \beta(G))$$ time. Examples of natural classes of graphs where $$\beta(G)$$ is constant: line graphs and unit-disk graphs. An interesting aspect is that MIS algorithm is actually deterministic! It’s the simple marking algorithm that rules out neighborhoods of chosen vertices; the analysis shows that not much time is wasted in remarking the same vertex.

Sublinear Algorithms and Lower Bounds for Metric TSP Cost Estimation by Yu Chen, Sampath Kannan, and Sanjeev Khanna (arXiv). This paper studies sublinear algorithms for the metric TSP problem. The input is an $$n \times n$$ distance matrix. One can 2-approximate the TSP by computing the MST, and a result of Czumaj-Sohler gives a $$(1+\varepsilon)$$-approximation algorithm for the latter, running in $$O(n\varepsilon^{-O(1)})$$ time. The main question is: can one beat the 2-factor approximation in sublinear time? This paper considers the graphic TSP setting, where the distance matrix corresponds to the shortest path metric of an unweighted graph. One result is a $$(2-\varepsilon_0)$$-approximation algorithm (for an explicit constant $$\varepsilon_0$$) that runs in $$\widetilde{O}(n)$$ time. For the important $$(1,2)$$ TSP setting (all distances are either 1 or 2), the paper gives a $$O(n^{1.5})$$ time 1.63-approximation algorithm. Interestingly, there is a lower bound showing that $$(1+\varepsilon)$$-approximations, for arbitrarily small $$\varepsilon$$, cannot be achieved in $$o(n^2)$$ time. One of the key tools is sublinear algorithms for estimating the maximum matching size, itself a well-studied problem in the community.

# Welcome Akash Kumar!

Let’s welcome our latest editor, Akash Kumar. Akash will be taking the place of Gautam Kamath, who has decided to pass the torch on. Let’s also thank Gautam for all the help with PTReview.

# News for March 2020

I hope all of you are keeping safe and healthy in these difficult times. Thank heavens we have our research and our math to keep our sanity…

This month has seen two papers, one on testing variable partitions and one on distributed isomorphism testing.

Learning and Testing Variable Partitions by Andrej Bogdanov and Baoxiang Wang (arXiv). Consider a function $$f:\Sigma^n \to G$$, where $$G$$ is Abelian group. Let $$V$$ denote the set of variables. The function $$f$$ is $$k$$-separable if there is a partition $$V_1, V_2, \ldots, V_k$$ of $$V$$ such that $$f(V)$$ can be expressed as the sum $$f_1(V_1) + f_2(V_2) + \ldots + f_k(V_k)$$. This is an obviously natural property to study, though the specific application mentioned in the paper is high-dimensional reinforcement learning control. There are a number of learning results, but we’ll focus on the main testing result. The property of $$k$$-separability can be tested with $$O(kn^3/\varepsilon)$$ queries, for $$\Sigma = \mathbb{Z}_q$$ (and distance between functions is the usual Hamming distance). There is an analogous result (with different query complexity) for $$\Sigma = \mathbb{R}$$. It is also shown that testing 2-separability requires $$\Omega(n)$$ queries, even with 2-sided error. The paper, to its credit, also has empirical studies of the learning algorithm with applications to reinforcement learning.

Distributed Testing of Graph Isomorphism in the CONGEST model by Reut Levi and Moti Medina (arXiv). This result follows a recent line of research in distributed property testing algorithms. The main aim is to minimize the number of rounds of (synchronous) communication for a property testing problem. Let $$G_U$$ denote the graph representing the distributive network. The aim is to test whether an input graph $$G_K$$ is isomorphic to $$G_U$$. The main property testing results are as follows. For the dense graph case, isomorphism can be property tested (with two-sided error) in $$O(D + (\varepsilon^{-1}\log n)^2)$$ rounds, where $$D$$ is the diameter of the graph and $$n$$ is the number of nodes. (And, as a reader of this blog, you probably know what $$\varepsilon$$ is already…). There is a standard $$\Omega(D)$$ lower bound for distributed testing problems. For various classes of sparse graphs (like bounded-degree minor-free classes), constant time isomorphism (standard) property testers are known. This paper provides a simulation argument showing that standard/centralized $$q$$-query property testers can be implemented in the distributed model, in $$O(Dq)$$ rounds (this holds for any property, not just isomorphism). Thus, these simulations imply $$O(D)$$-round property testers for isomorphism for bounded-degree minor-free classes.

# News for December 2019

Happy new year! And now for the last post of 2019 papers. We have found a diverse collection of six papers, ranging from classic property testing topics to new perspectives on sublinear computation.

Sublinear Optimal Policy Value Estimation in Contextual Bandits by Weihao Kong, Gregory Valiant, Emma Brunskill (arXiv). This isn’t our usual sublinear paper, but it is definitely of interest to us sublinear folk. Let’s start with a stripped down definition of the problem (or rather, game). There are $$K$$ “arms”, where the $$i$$th arm is represented by an unknown vector in $$\beta_i \in \mathbb{R}^d$$. We are presented with a “context”, which is a vector $$x \in \mathbb{R}^d$$. Our job is to choose an arm $$i \in [K]$$. We get the reward $$x \cdot \beta_i$$ (with some noise added). The contexts appears from a known distribution. To aid us, we observe the rewards of $$N$$ iid contexts, so we observe a total of $$T = KN$$ rewards. There has been much work on figuring out the minimum value of $$T$$ required to learn the optimal policy. One requires at least $$d$$ (the dimension) samples to estimate any of the arm vectors. This papers shows that one can actually estimate the expected reward of the optimal policy, without being able to describe it, with sublinear in $$d$$ (technically, $$\widetilde{O}(\sqrt{d})$$) samples. We see this a lot in property testing, where producing the “optimal” solution for a problem requires linear-in-dimension samples, but estimating the optimal value is much cheaper (consider, for example, the situation of linearity testing, where we wish to find the closest linear function).

Sublinear Time Numerical Linear Algebra for Structured Matrices by Xiaofei Shi and David P. Woodruff (arXiv). This follows the recent linear of advances in sublinear time linear algebra. Given a matrix $$A \in \mathbb{R}^{n \times d}$$, the aim is to get algorithms that only look at $$o(nnz(A))$$ entries (where $$nnz(A)$$ is the number of non-zeroes, or the support). Consider the classic talk of low rank approximation. Unfortunately, suppose one entry is extremely large, and the others are extremely small. One has to find this large entry for any reasonable approximation, which (in the worst-case) requires $$nnz(A)$$ queries into $$A$$. Thus, previous papers make structural assumption (such as, $$A$$ being a Vandermonde matrix) to get sublinear bounds. This paper gives a clean black box method to get a variety of such results. Basically, one can replace the usual $$nnz(A)$$ term in many algorithms, by $$T(A)$$, which is the time to compute the matrix-vector product $$Ay$$, for $$y \in \mathbb{R}^d$$. In many cases $$T(A) = \widetilde{O}(n)$$, which can be significantly smaller than $$nnz(A)$$. This paper gives such results for low-rank approximations and many regression problems.

Robust and Sample Optimal Algorithms for PSD Low-Rank Approximation by Ainesh Bakshi, Nadiia Chepurko, and David P. Woodruff. Consider the low rank problem discussed above. As mentioned in the previous paragraph, we need structural assumptions on $$A$$. Previous results gave sublinear time low-rank approximations assuming that $$A$$ is positive semidefinite (PSD). The aim is to get a rank $$k$$ matrix $$B$$ such that $$\|A-B\|^2_2$$ is at most $$(1+\epsilon)$$-times the optimal such approximation. The previous algorithm of Musco-Woodruff makes $$\widetilde{O}(nk/\epsilon^{2.5})$$ queries in to $$A$$, while there is a lower bound of $$\Omega(nk/\epsilon)$$. This gap between the complexities is resolved in this paper with an upper bound of $$\widetilde{O}(nk/\epsilon)$$ queries.

Constructive derandomization of query algorithms by Guy Blanc, Jane Lange, and Li-Yang Tan (arXiv). This paper discusses an intriguing angle to sublinear question: when can they be derandomized? Abstractly, consider a randomized algorithm $$R$$ that makes $$q$$ queries. Think of $$R$$ as a function $$R(x,r)$$, where $$x$$ is the input, and $$r$$ is the randomness. We would like to design a deterministic algorithm $$D$$ making, ideally, close to $$q$$ queries and approximates $$\mathop{E}_r[R(x,r)]$$. For starters, consider some distribution over $$x$$, and suppose we want $$\mathbb{E}_x[D(x) – \mathbb{E}_r[R(x,r)]] < \epsilon$$. By (the easy direction of) Yao’s minimax lemma, one can show the existence of such an algorithm $$D$$ that makes $$O(q/\epsilon)$$ queries. But how to explicitly construct it? Indeed, the first result of this paper gives a “meta-algorithm” that takes as input the description of $$R$$ (which is of size $$N$$), has running time $$poly(N)2^{O(q/\epsilon)}$$ and outputs a description of $$D$$. When $$R$$ satisfies the stronger property of “bounded error”, one can get a $$O(q^3)$$-query algorithm $$D$$ that approximates $$\mathop{E}_r[R(x,r)]$$ for all $$x$$ (again, the existence is proven by a classic theorem of Nisan). Overall, this paper gives a method to derandomize sublinear time algorithms, and I wonder if there could be some applications of this method for proving lower bounds. After all, Yao’s minimax theorem is the tool for property testing lower bounds, and any perspective on Yao’s theorem is likely of relevance.

Testing Membership for Timed Automata by Richard Lassaigne and Michel de Rougemont (arXiv). Property testing for regular languages is a classic result in sublinear algorithms. This paper focuses on the more complex notion of timed automata. The technical definition is quite complicated, but here’s an overview. There is a finite automaton and a collection of “clocks”. Imagine a string being processed, where each alphabet symbol appears with a new timestamp. Thus, the input word is called a “timed word”. The transitions of the automaton involve the new symbol read, as well as constraints involving the clock times and the timestamp. Thus, we can enforce conditions like “only transition if another symbol is read within a single time unit”. In general, deciding whether a timed word is accepted by a timed automaton is NP-complete. This papers studies the property testing viewpoint. The paper gives a new definition of “timed edit distance” between timed words. The main result shows that one can distinguish time words accepted by a timed automaton from words that are far (according to timed edit distance), by querying a constant number of word positions.

On the query complexity of estimating the distance to hereditary graph properties by Carlos Hoppen, Yoshiharu Kohayakawa, Richard Lang, Hanno Lefmann, and Henrique Stagni (arXiv). This paper concerns the classic setting of property testing of dense graphs. It is well-known that all hereditary graph properties are testable, and are moreover, one can estimate the distance to the property in time that only depends on $$\varepsilon$$. Unfortunately, the queries complexities have large tower dependencies on $$\varepsilon$$, arising from the use of the Szemeredi regularity lemma. The question of property testing in dense graphs can be reduced to finding “removal” lemmas (such as the classic triangle remove lemma). Such a lemma states that if at least $$\varepsilon n^2$$ edges need to be removed from $$G$$ to destroy all “forbidden subgraphs”, then there must be “many” forbidden subgraphs in $$G$$. There is much recent research on finding families of forbidden subgraphs, where the “many” (in the above statement) is at least $$poly(\varepsilon)$$ times the trivial upper bound. This paper shows that one can also estimate the distance to any hereditary property, in a query complexity that depends directly on the corresponding removal lemma parameters. As a compelling example, one can estimate the distance to being chordal in $$\exp(1/\varepsilon)$$ queries, a significant improvement over standard tower bounds.

# News for Sept 2019

Five Six papers this month: results on testing separations, linearity testing in $$\mathbb{R}^n$$, testing for regular languages, graph property testing, topological property testing, and Boolean rank.

Hard properties with (very) short PCPPs and their applications, by Omri Ben-Eliezer, Eldar Fischer, Amit Levi, and Ron D. Rothblum (arXiv). Probably, the most significant takeaway from this work is a (largest possible) separation between standard and tolerant property testing. PCPPs (Probabilistically Checkable Proofs of Proximity) are the “NP” variant of property testing, where the tester is aided by a proof string. Consider property $$\mathcal{P}$$. If $$x \in \mathcal{P}$$, there must be a proof string that makes the tester accept (with probability 1). If $$x$$ is far from $$\mathcal{P}$$ (in the usual property testing sense), for any proof string, the tester must reject with sufficiently high probability. PCPPs have played a role in the classical constructions of PCPs, but have also found uses in getting a better understanding of property testing itself. And this paper shows how PCPP constructions can be used to get property testing separations. The main result in this paper is a property $$\mathcal{P}$$ that (basically) requires $$\Omega(n)$$ queries to “property test”, but has a PCPP system where the proof length is $$\widetilde{O}(n)$$. ($$n$$ is the input length.) The main construction uses collections of random linear codes. Significantly, these constructions show a strong separation between standard vs tolerant property testing, and standard vs erasure-resilient property testing. (The latter is a recent variant by Dixit et al, where certain parts of the input are hidden from the tester.) There is a property that is testable in a constant number of queries, but requires $$\widetilde{\Omega}(n)$$ queries to test tolerantly (for any non-trivial choice of parameters). An analogous result holds for erasure-resilient testing.

Distribution-Free Testing of Linear Functions on R^n, by Noah Fleming and Yuichi Yoshida (arXiv). Linearity testing is arguably the canonical problem in property testing, yet there is still much to be learned about it. This paper considers functions $$f: \mathbb{R}^n \to \mathbb{R}$$, and the distribution-free setting. (In this setting, distance is measured according is an unknown distribution $$\mathcal{D}$$ over the input, and the tester can access samples from this distribution. For $$\mathbb{R}^n$$, the “standard” distribution would the $$n$$-dimensional Gaussian.) The main result is that linearity testing can be done in the distribution-free setting with $$\widetilde{O}(1/\varepsilon)$$ queries, assuming that the distribution is continuous. The primary technical tool, an interesting result in its own right, is that additivity $$(f(x+y) = f(x) + f(y))$$ can be tested in $$\widetilde{O}(1/\varepsilon)$$ queries. The significance of the testing result is cemented by an $$\Omega(n)$$ lower bound for sample-based testers.

Sliding window property testing for regular languages by Moses Ganardi, Danny Hucke, Markus Lohrey, Tatiana Starikovskaya (arXiv). Fix a regular language $$\mathcal{R}$$. Consider the streaming model, and the basic question of recognizing whether the string (being streamed) is in $$\mathcal{R}$$. Simple, you will say! Run the DFA recognizing $$\mathcal{R}$$ in constant space. Now, suppose there is a sliding window length of $$n$$. The aim is to determine if the past $$n$$ symbols (the “active window”) form a string in $$\mathcal{R}$$. Suprisingly (at least to me), there is a full characterization of the space required for randomized algorithms, and (depending on $$\mathcal{R}$$), it is either $$\Theta(1)$$, $$\Theta(\log\log n)$$, $$\Theta(\log n)$$, or $$\Theta(n)$$. In the interest of beating these lower bounds, suppose we wish to property test on the active window. It turns out the answer is quite nuanced. There are deterministic $$O(\log n)$$-space testers and randomized two-sided $$O(1/\varepsilon)$$-space testers for all regular languages. For randomized one-sided testers, there are multiple possibilities for the optimal space complexity, and there is a full characterization of these regular languages.

A characterization of graph properties testable for general planar graphs with one-sided error (It’s all about forbidden subgraphs) by Artur Czumaj and Christian Sohler (arXiv). Property testing of sparse graphs has been receiving more attention, but most results focus on the bounded degree setting. Unfortunately, many of these results break quite dramatically on sparse graphs with unbounded degrees. This paper focuses on property testing, within the class of unbounded degree planar graphs. (Meaning, the input is always assumed to be planar.) The results achieve a significant goal: as the title suggests, there is a complete characterization of properties that are constant-query testable with one-sided error. The easier part is in showing that all such properties can be reduced to testing $$H$$-freeness. The harder (remarkable) result is $$H$$-freeness can be tested in general planar queries with constant queries. (This is non-trivial even for triangle-freeness.) And, as is easy to conjecture but hard to prove, these results carry over for all minor-closed families. As a small indication of the challenge, most testers for bounded-degree graphs work by doing constant depth BFSes. When high degree vertices are present, this method fails, and we really need new ideas to deal with such graphs.

Near Coverings and Cosystolic Expansion – an example of topological property testing by Irit Dinur and Roy Meshulam (ECCC). In most algebraic settings, property testing results can be seen as local to global theorems. When do local constraints on a large object imply a global condition? This paper gives a topological instantiation of this phenomenon. We need to define the cover of a simplicial complex $$X$$. For concreteness, think of a 2D simplicial complex $$X$$, which is a hypergraph with hyperedges of size at most 3, where subsets of hyperedges are also present. A 2-cover is a simplicial complex $$X’$$ with the following property. It has two copies of each vertex of $$X$$. Each hyperedge of $$X$$ must have two “corresponding” disjoint copies in $$X’$$. Let the copies of vertex $$v$$ be $$v_0, v_1$$. Then, for every hyperedge (say) $$(u,v,w)$$ of $$X$$, there must be two disjoint hyperedges in $$X’$$ involving copies of the corresponding vertices. One can consider the property testing twist: if the neighborhoods of “most” vertices $$v$$ in $$X$$ satisfy these condition (with respect to the neighborhoods of the copies of $$v$$ in $$X’$$), then is $$X’$$ close to being a cover of $$X$$? Indeed, this paper proves that such a “property testing condition” holds iff $$X$$ is a high-dimensional expander.

Property testing of the Boolean and binary rank by Michal Parnas, Dana Ron, and Adi Shraibman (arXiv). The Boolean rank of a matrix $$M$$ is a fundamental quantity that appears in many lower bound constructions. (Recall that an $$n \times m$$ Boolean matrix $$M$$ has a rank $$r$$ if $$M$$ can be expressed as $$X \cdot Y$$, where $$X \in \mathbb{F}_2^{n \times d}$$ and $$Y \in \mathbb{F}_2^{d \times m}$$.) In the real-valued setting, results show that one can property test rank in $$poly(d/\varepsilon)$$ queries. This paper proves an analogous result for the Boolean rank. There is a surprise element here: over reals, the rank can be computed in polynomial time, and many of the geometric intuitions can be brought over to the property testing problem. On the other hand, the Boolean rank is NP-hard to compute exactly, yet we can still get a tester with $$poly(d)$$ query complexity. The paper also gives results for binary rank. For the binary rank, we require the component matrices $$X, Y$$ to be Boolean, but algebraic operations are over the reals. In the case, the tester has query complexity $$2^{2d}$$ (with varying dependencies on $$\varepsilon$$ for adaptive/non-adaptive testers). The intriguing open problem is whether $$poly(d)$$-query testers exist for binary rank.

# News for June 2019

We’ve got four papers this month. A mix of distribution testing, matrix problems, and a different paper on the power of sampling.

On the Complexity of Estimating the Effective Support Size, by Oded Goldreich (ECCC). In distribution testing, a classic problem is that of approximating the support size. By a (now) classic result of Valiant-Valiant, the complexity of this problem is $$\Theta(n/\log n)$$. This paper raises the question of approximating the “effective” support, and that too with more than just samples from the distribution. The $$\epsilon$$-support of discrete distribution $$\mathcal{D}$$ is the smallest support among any distribution $$\mathcal{D}’$$ such that $$\|\mathcal{D} – \mathcal{D}’\|_1 \leq \epsilon$$ (or TV-distance). Denote this as $$supp_\epsilon(\mathcal{D})$$. One can also consider a bicriteria version. Given approximation parameter $$f$$ and thresholds $$\epsilon_1 < \epsilon_2$$, we need to provide a number in the range $$[supp_{\epsilon_2}(\mathcal{D}), f\cdot supp_{\epsilon_1}(\mathcal{D})]$$. The primary model studied allows for random samples and evaluation queries (where one gets the probability of any known element of the domain). In this model, for arbitrary $$\epsilon_1, \epsilon_2$$, there is a continuum of algorithms, trading off query complexity with approximation. At one end, for $$f = O(\log n)$$, the query complexity is $$\widetilde{O}(1/\epsilon_1)$$. At the other end, for $$f=1$$, the query complexity is $$O(\log^* n)$$. (Here, $$n = supp_{\epsilon_1}(\mathcal{D})$$.) There are lower bounds showing the necessity of evaluation queries for subpolynomial query complexities.

Communication and Memory Efficient Testing of Discrete Distributions, by Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, and Sankeerth Rao (arXiv). This paper adds additional computational constraints to the distribution testing problem. In the streaming setting, the algorithm is only allowed a single pass over the random samples. It has $$m$$ bits of storage, much smaller than $$n$$, the distribution size. The aim is to minimize the number of samples required for uniformity (and closeness) testing. For uniformity testing in the streaming setting, the paper gives an algorithm that uses $$\widetilde{O}(m + n/m)$$ samples. The standard collision algorithm requires $$\Theta(\sqrt{n})$$ storage (to store the samples), while this result gives a non-trivial bound for $$m \ll \sqrt{n}$$. There are lower bounds showing that these bounds are basically tight. In the distributed distribution testing problem, there are a number of processors, each holding $$\ell$$ samples. A referee asks a question to the processor, whose answers are broadcast to everyone. The aim is to minimize communication cost. For uniformity testing in this setting, there is a protocol using $$O(\sqrt{n\log n/\ell})$$ bits of communication. As a sanity check, note that for $$\ell = \sqrt{n}$$, one processor could simply run the collision algorithm locally to report the result.

Querying a Matrix through Matrix-Vector Products by Xiaoming Sun, David P. Woodruff, Guang Yang, and Jialin Zhang (arXiv). Consider $$n \times d$$ matrix $$M$$ over some field $$\mathbb{F}$$. One gets access to this matrix through matrix-vector products, and wishes to test some matrix property. This is a natural model in many settings, and generalizes the classic setting of query access. A subtle point is that one can only right multiply with “query” vectors, and there are problems where left multiplication can change the complexity. (A nice example in the paper is testing if a square matrix is symmetric. With both left and right multiplications, this is easy, since we can directly access rows and columns. By only accessing columns, this is non-trivial.) This paper studies a number of problems, broadly classified as linear algebra problems, statistics problems, and graph problems. Some highlights are: testing symmetry can be done in $$O(1)$$ queries, and the maximum eigenvalue can be approximated in $$O(\log n)$$ queries adaptively (but there is an $$\Omega(n)$$ non-adaptive lower bound). For graph problems, here’s an interesting discovery. If $$M$$ is the adjacency matrix, connectivity requires $$\Omega(n/\log n)$$ queries. But if $$M$$ is the signed edge-vertex matrix, then this can be done in $$poly(\log n)$$ queries. This paper provides a number of interesting directions and problems to study.

The Adversarial Robustness of Sampling by Omri Ben-Eliezer and Eylon Yogev (arXiv). This isn’t a property testing paper, but how can one ignore a paper on understanding the power of sampling? It’s convenient to think of the following setting. An algorithm gets a stream of $$n$$ numbers, and has to answer some questions about the stream (say, the median or other quantiles). The simplest strategy is to take a small random sample of the stream. But what if an adversary was generating the stream, depending on the current sample? Under what circumstances is the sample still “representative” of the stream? The specific results of this paper require getting into set systems and VC dimension, which I’ll leave for the sake of simplicity. Let’s go back to the median question. To get an approximate median, a constant number of samples suffice. A consequence of the main result is that if one takes $$O(\log n)$$ samples, then this is robust to adversarial streams. On the other hand, lower bounds show that constant sized samples can be fooled by an adversary. The paper is a really interesting read, and is a nice take on “standard facts” that we take for granted.

# News for February 2019

A lean month by our (now high) standards. Three papers, on local computational algorithms for spanners, sublinear set cover, and quantum distribution testing.

Local Computation Algorithms for Spanners, by Merav Parter, Ronitt Rubinfeld, Ali Vakilian, Anak Yodpinyanee (arXiv). Consider the problem of computing a $$k$$-spanner for a graph $$G$$. This is a subgraph $$H$$, where (for any pair of vertices) the shortest path distance in $$H$$ is at most $$k$$ times the corresponding distance in $$G$$. A Local Computation Algorithm (LCA) essentially gives sublinear query access to (some spanner) $$H$$, without any preprocessing on $$G$$. In other words, an LCA takes as input an edge $$(i,j)$$ in $$G$$, and in sublinear time says “yes” or “no”. Despite no preprocessing, it is guaranteed that all the “yes” answers form a spanner, and the number of “yes” answers is small. One of the main challenges in LCAs for graph problems is getting a polynomial dependence on $$\Delta$$, the maximum degree. (Many results end up with an exponential dependence on $$\Delta$$, essentially assuming bounded-degree.) This paper give an LCA outputting spanners of optimal size for $$k=3,5$$, with query complexity $$n^{1-\alpha}$$ (for constant $$\alpha$$ depending on $$k$$). The second result works for general $$k$$, but has a query complexity that is $$O(\Delta^4n^{2/3})$$.

Set Cover in Sub-linear Time, by Piotr Indyk, Sepideh Mahabadi, Ronitt Rubinfeld, Ali Vakilian, Anak Yodpinyanee (arXiv). Set cover needs no introduction. This paper considers sublinear algorithms for set cover. It is convenient to think of the instance as a binary $$m \times n$$ matrix, for $$m$$ sets over a universe of size $$n$$. The query model gives access to the $$j$$th non-zero entry in a desired column/row. The aim is this paper is to actually construct a set cover, and not just approximate its size (the latter problem with initiated in Nguyen-Onak). The first result shows a black-box conversion of any set cover algorithm into a sublinear query complexity algorithm. Any $$\alpha$$-approximation for set cover can be converted into an $$O(\alpha)$$-approximation, that makes $$O(m(n/k)^{\beta} + nk)$$ (where $$k$$ is the optimal set cover size, and $$\beta < 1$$ depends on $$\alpha$$). For large $$k$$, the paper gives an $$O(mn/k)$$ query, $$O(\log n)$$-approximation algorithm. There are various lower bounds given, that show these dependencies are necessary.

Distributional Property Testing in a Quantum World by András Gilyén and Tongyang Li (arXiv). There are two aspects to quantum distribution testing. First, we can think of faster distribution testing (for classical problems) using quantum computation. Second, we can generalize the problem of distribution testing to its quantum equivalent, called density operator testing (this papers gives the first result for this quantum generalization). A distribution is a non-negative vector in $$\mathbb{R}^n$$ with unit $$l_1$$-norm. The quantum generalization, a density operator, is a PSD matrix in $$\mathbb{C}^n$$ with unit trace. This paper has a number of results, along both aspects. For example, for the well-known problem of testing equality of distributions under total variation distance, the papers gives an $$\widetilde{O}(\sqrt{n}/\epsilon)$$ tester (an improvement over classical lower bounds). For testing equality of density operators under the same distance, the paper gives an $$O(n/\epsilon)$$ tester (note that the trivial bound is $$O(n^2)$$, since the operator is a matrix).