# News for June 2022

We have four papers this month — three on sublinear-time graph algorithms and one on distribution testing!

Beating Greedy Matching in Sublinear Time, by Soheil Behnezhad, Mohammad Roghani, Aviad Rubinstein, and Amin Saberi (arXiv). Designing sublinear-time algorithms to estimate the size of maximum matching in a graph is a well-studied problem. This paper gives the first $$\frac{1}{2} + \Omega(1)$$ approximation algorithm that runs in time sublinear in the size of the input graph. Specifically, given a graph on $$n$$ vertices and maximum degree $$\Delta$$ in the adjacency list model, and a parameter $$\epsilon >0$$, the algorithm runs in time $$\tilde{O}(n + \Delta^{1+\epsilon})$$ and produces a $$\frac{1}{2} + f(\epsilon)$$ approximation to the maximum matching for some function $$f$$. It must be noted that a seminal work of Yoshida, Yamamoto and Ito (STOC, 2009) also gives a better than $$\frac{1}{2}$$ approximation sublinear-time algorithm for the same problem. However, the result of Yoshida et al. requires assumptions on the maximum degree of the input graph. An additional point worth mentioning is that the authors do not believe that their techniques will yield an approximation guarantee better than $$0.51$$, i.e., $$f(\epsilon) < 0.01$$ for all $$\epsilon$$.

Sublinear-Time Clustering Oracle for Signed Graphs, by Stefan Neumann and Pan Peng (arXiv). Consider a large signed graph on $$n$$ vertices where vertices represent users of a social network and signed edges (+/-) denote the type of interactions (friendly or hostile) between users. Assume that the vertices of the social network can be partitioned into $$O(\log n)$$ large clusters, where each cluster has a sparse cut with the rest of the graph. Further, each cluster is a minimal set (w.r.t. inclusion) that can be partitioned into roughly equal-sized opposing sub-communities, where a sub-community opposes another sub-community if most of the edges going across are negatively signed and most of the edges within the sub-communities are positively signed. This work provides a local oracle that, given probe access to a signed graph with such a hidden cluster structure, answers queries of the form “What cluster does vertex $$v$$ belong to?” in time $$\tilde{O}(\sqrt{n} \cdot \text{poly}(1/\epsilon))$$ per query. This result is a generalization of the same problem studied for unsigned graphs (Peng, 2020). The authors additionally show that their method works well in practice using both synthetic and real-world datasets. They also provide the first public real-world datasets of large signed graphs with a small number of large ground-truth communities having this property.

Sublinear Algorithms for Hierarchical Clustering, by Arpit Agarwal, Sanjeev Khanna, Huan Li, and Prathamesh Patil (arXiv). Consider a weighted graph $$G = (V,E,w)$$, where the set $$V$$ of vertices denotes datapoints and the weight $$w(e) > 0$$ of edge $$e \in E$$ denotes the similarity between the endpoints of $$e$$. A hierarchical clustering of $$V$$ is a tree $$T$$ whose root is the set $$V$$ and leaves are the singleton sets corresponding to individual vertices. An internal node of the tree corresponds to a cluster containing all the leaf vertices that are descendants of that node. A hierarchical clustering tree provides us with a scheme to cluster datapoints at multiple levels of granularity. The cost of a hierarchical clustering tree is $$\sum_{(u,v) \in E} |T_{u,v}| \cdot w(u,v)$$, where $$T_{u,v}$$ denotes the lowest common ancestor of the leaves $$u$$ and $$v$$. In this paper, the authors present sublinear algorithms for determining a hierarchical clustering tree with the minimum cost. In the query model with degree queries and neighbor queries to the graph, they give an algorithm that outputs an $$\tilde{O}(1)$$-approximate hierarchical clustering and makes $$\tilde{O}(n^{4-2\gamma})$$ queries, when the number of edges $$m = \Theta(n^{\gamma})$$ for $$1.5 \geq \gamma > 4/3$$. When the input graph is sparse, i.e., $$\gamma \leq 4/3$$, the algorithm makes $$\tilde{O}(\max\{n, m\})$$ queries, and when the graph is dense, i.e., $$\gamma >1.5$$, the algorithm makes $$\tilde{O}(n)$$ queries. They complement their upper bounds with nearly tight lower bounds. In order to obtain their upper bounds, they design a sublinear-time algorithm for the problem of obtaining a weak cut sparsifier that approximates cuts sizes upto an additive term in addition to the usual multiplicative factor. They also design sublinear algorithms for hierarchical clustering in the MPC and streaming models of computation.

Sharp Constants in Uniformity Testing via the Huber Statistic, by Shivam Gupta and Eric Price (arXiv). This paper revisits the fundamental problem of uniformity testing — i.e., to decide whether an unknown distribution over $$n$$ elements is uniform or $$\epsilon$$-far from uniform. This problem is known to be solvable optimally with probability at least $$1 – \delta$$ using $$s = \Theta\left(\frac{\sqrt{n \log (1/\delta)}}{\epsilon^2} + \frac{\log (1/\delta)}{\epsilon^2}\right)$$ independent samples from the unknown distribution. Multiple testers are known for the problem and they all compute a statistic of the form $$\sum_{i \in [n]} f(s_i)$$, where $$s_i$$ for $$i \in [n]$$ and $$f$$ is some function and make their decision based on whether or not the value of the statistic is above or below a threshold. For instance, the earliest known uniformity tester (Batu, Fortnow, Rubinfeld, Smith and White 2000; Goldreich and Ron 2011), also called the collisions tester, uses $$f(k) = \frac{k(k-1)}{2}$$. The current paper proposes a new tester based on the Huber loss. For $$\beta > 0$$, let $$h_\beta(x) := \min\{x^2, 2\beta x – \beta^2\}$$. The statistic that the authors use in their test is defined by the function $$f(k) := k – s/n$$, where $$s$$ is the number of samples and $$n$$ is the support size of the distribution. The authors show that their tester is better than all previously known testers as they achieve the best constants in the sample complexity.

# News for February 2022

This month has seen a flurry of activity in sublinear algorithms and a diverse collection of papers have come up, with topics ranging from differentially private sublinear algorithms to local testers for multiplicity codes. Apologies to the readers for the delay in putting this post together!

Almost-Optimal Sublinear-Time Edit Distance in the Low Distance Regime by Karl Bringmann, Alejandro Cassis, Nick Fischer and Vasileios Nakos (arXiv)

This paper considers the problem of gap edit distance, i.e., of determining if the edit distance between two strings $$x$$ and $$y$$ is at most $$k$$ or at least $$K$$. Their main result is an algorithm that runs in time $$O(n/k + \text{poly}(k))$$ and solves the problem for $$K = k \cdot 2^{\tilde{O}(\sqrt{\log k})}$$. The paper improves upon earlier results of Goldenberg, Krauthgamer and Saha (2019) and Kociumaka and Saha (2020) who solved the problem for $$K = k^2$$ with the same asymptotic guarantee on the query complexity.

One of the interesting takeaways from the paper is that the complexity of solving the gap Hamming distance and gap edit distance are similar in the low distance regime. For both, the complexity of solving the $$(k, k^{1+o(1)})$$-gap problem is $$n/k^{1 \pm o(1)}$$. This needs to be contrasted with the fact that solving $$(k, \Omega(n))$$-gap edit distance requires $$\Omega(\sqrt{k})$$ queries as shown by Batu, Ergun, Kilian, Magen and Raskhodnikova (2003), whereas $$(k, \Omega(n))$$-gap Hamming distance can be solved in $$O(1)$$ time.

These results are incomparable to those obtained by Goldenberg, Kociumaka, Krauthgamer and Saha (which was discussed in our November 2021 post), where they give a nonadaptive algorithm with complexity $$O(n/k^{1.5})$$ for $$(k, k^2)$$-gap edit distance problem. The algorithm in the present paper is adaptive and works faster for smaller values of $$k$$.

Privately Estimating Graph Parameters in Sublinear time by Jeremiah Blocki, Elena Grigorescu, Tamalika Mukherjee (arXiv)

Differentially private approximation algorithms for optimization problems on graphs is a well-studied topic. This paper opens up an exciting research direction by initiating a systematic study of the design of differentially private sublinear-time algorithms. The setting is that graphs are viewed as databases and two graphs are neighboring if they differ in an edge (or a node). An algorithm $$A$$ is $$\epsilon$$-differentially private if for every pair of edge-neighboring(or node-neighboring) graphs $$G, G’$$ and for every subset $$S$$ of outputs, $$\Pr[A(G) \in S] \leq \exp(\epsilon) \cdot \Pr[A(G’) \in S]$$.

The paper presents $$\epsilon$$-differentially private sublinear-time algorithms for well-studied problems such as estimating the average degree, the size of a min vertex cover and the size of a maximum matching. These algorithms access the input graphs via neighbor queries and degree queries.

In addition to providing a strong privacy guarantee, their algorithms nearly match the approximation and complexity guarantees of their non-differentially private counterparts. The main idea seems to be the formalization of a sensitivity notion, which they refer to as Global Coupled Sensitivity, and bounding it for the known sublinear-time algorithms for the aforementioned problems. Finally, they add Laplace noise calibrated with this sensitivity value to the output of the algorithms to make them differentially private.

Testability and Local Certification of Monotone Properties in Minor-closed Classes by Louis Esperet And Sergey Norin (arXiv)

One of the major interests in graph property testing is to characterize which properties are testable, i.e, can be $$\epsilon$$-tested with query complexity that depends only on the parameter $$\epsilon$$. The question of testability is well-understood in the dense graph model as well as the bounded degree model. This paper concerns itself with testability questions in the general model or the sparse model of graph property testing, where graphs are represented as adjacency lists with no bound on the maximum degree.

The authors prove that every monotone property of minor-closed graph classes is testable with one-sided error, where a property is monotone if it is closed under taking subgraphs and a graph class is minor-closed if it is closed under taking minors. A crucial fact to be noted here is that a tester is allowed to make only uniformly random nerighbor queries.

This result is a significant generalization of a 2019 result by Czumaj and Sohler, who proved that for every finite set of graphs $$\mathcal{H}$$, every $$\mathcal{H}$$-free property of minor-closed graph classes is testable with one-sided error, where a graph satisfies $$\mathcal{H}$$-freeness if none of its subgraphs belong to $$\mathcal{H}$$.

They show an interesting consequence of their results to designing a short local certification scheme for monotone properties of minor-closed graph classes. Roughly speaking, they show the existence of a prover-verifier system for the aforementioned testing problem where proofs of length $$O(\log n)$$ are assigned to each vertex and that verifier needs to observe only the proofs assigned to a vertex and its neighbors.

The plane test is a local tester for Multiplicity Codes by Dan Karliner, Roie Salama, and Amnon Ta-Shma (ECCC)

Multiplicity codes are a generalization of Reed-Muller codes and was first studied by Guruswami and Wang (2013) and Kopparty, Saraf and Yekhanin (2014). The messages here are polynomials of degree $$d$$ over $$m$$ variables and the codeword corresponding to a polynomial $$p$$ is the evaluation of $$p$$ and of all of its directional derivatives of order upto $$s$$ over all the points in $$\mathbb{F}_q^m$$, where $$q$$ is a prime power.

Even though multiplicity codes are known to be locally decodable, it was open whether they are locally testable. Local testers for Reed Muller codes work by restricting the evaluations to a uniformly random line in $$\mathbb{F}_q^m$$ and checking whether it corresponds to the evaluations of a degree $$d$$ univariate polynomial. The authors first show that such a tester does not work for the case of multiplicity codes when $$d$$ is large. They then show that a plane test is a good local tester for multiplicity codes even for larger values of $$d$$. Specifically, a plane test checks whether the restriction of a given word, which is purportedly the evaluation of a polynomial of degree $$d$$ and of its derivatives, to a uniformly random plane in $$\mathbb{F}_q^m$$ is a bivariate multiplicity code of degree $$d$$.

We conclude the post with a short note from Nader Bshouty and Oded Goldreich on a fundamental characterization result in property testing.

On properties that are non-trivial to test by Nader H. Bshouty and Oded Goldreich (ECCC)

A property on binary strings is nontrivial if for infinitely many $$n$$, the property contains at least one string of length $$n$$ and at most $$2^{n – \Omega(n)}$$ strings of length $$n$$. The note shows that every nontrivial property requires $$\Omega(1/\epsilon)$$ queries to $$\epsilon$$-test.