After either really busy months or really quiet months, we’re now on a moderately busy month (which is the way it should be). A collection of 7 papers, now showing an increased interest in sublinear graph algorithms. Of course, we still have some distribution testing, and a paper on sparse functional representations.
A note on approximating the average degree of bounded arboricity graphs by Talya Eden, C. Seshadhri (arXiv). Let us begin with this simple note, since it will set the stage for the other papers. Consider the basic problem of estimating the average degree of a (simple) graph \(G = (V,E)\), under the standard adjacency list model. We assume that \(V = [n]\), and an algorithm can get the degree of a vertex and query for uar neighbors. Classic results prove that there are \(\widetilde{O}(\varepsilon^{-2} n/\sqrt{m})\) query (and time) algorithms. Moreover, when the graph arboricity is \(\alpha\), one can get \(\widetilde{O}(\varepsilon^{-2} \alpha)\) query algorithms. These algorithms and results are unfortunately buried deep inside papers, or have extraneous \(\varepsilon\) and \(\log n\) factors. This note give a simple, clean presentation of these results; the main result just takes two pages.
Almost-Uniform Edge Sampling: Leveraging Independent-Set and Local Graph Queries by Tomer Adar and Amit Levi (arXiv). Same setting as the previous blurb. A result of Eden-Rosenbaum (among others) showed that the problem of sampling a uniform random edge can be done in the same query complexity as estimating the number of edges. This is quite interesting, since these problems (at least from a sublinear perspective) are not similar, and the underlying algorithms are quite different. This paper studies this phenomenon in more general models. Very recent results (from Jan 2026!) by Adar-Hotam-Levi showed that average degree estimation can be done much faster with Independent Set (IS) queries together with standard adjacency list queries. Essentially, for sparse graphs, one can get \(O(n^{1/4})\) queries, as opposed to the usual \(\sqrt{n}\) bound. This paper shows that edge sampling can be done the same time as edge estimation (with nearly matching lower bounds). This “complexity equivalence” is also proven for the model with only IS queries.
Testing Properties of Edge Distributions by Yumou Fei (arXiv). Let us bring in some distributions to the (dense) graph setting. Consider an input graph \(G\) with \(n\) vertices, represented as an adjacency matrix. Suppose there is an unknown distribution \(\mu\) over \([n]^2\) that we can sample from. Our aim is to test if (say) the graph is bipartite, where distance is defined according to \(\mu\). If \(\mu\) is uniform over all edges, then this is standard property testing. With arbitrary \(\mu\), we are looking at bipartiteness testing in the distribution-free setting. The classic Goldreich-Goldwasser-Ron result shows that bipartiteness testing (in the standard, uniform setting) can be done in \(poly(\varepsilon^{-1})\) queries. They also prove an \(\Theta(n)\) bound when the distribution \(\mu\) is uniform over an arbitrary subset of edges. This papers gives a bipartiteness tester for any \(\mu\), with the same asymptotic complexity. Moreover, there are results for testing triangle-freeness and square-freeness in this setting; the complexities are \(\Theta(n^{4/3})\) and \(\Theta(n^{9/8})\) respectively. (Note that in the standard setting, we get complexities independent of \(n\).)
Optimal Prediction-Augmented Algorithms for Testing Independence of Distributions by Maryam Aliakbarpour, Alireza Azizi, Ria Stevens (arXiv). Now onto distribution testing. Consider the problem of determining if a distribution \(p\) with support \([n_1] \times [n_2] \times \ldots [n_d]\) is a product distribution. This is a fundamental problem that goes back to the birth of statistics and Pearson’s \(\chi^2\) test. This paper studies the problem in the learning augmented setting. Suppose we are also given a complete description of a predicted approximate distribution \(\widehat{p}\), and a tolerance \(\alpha\). If \(\|\widehat{p} – p\|_{TV} \geq \alpha\), then the algorithm can output “Inaccurate Information”. So, if the predictor \(\widehat{p}\) is accurate, then the algorithm needs to give a correct output. The main result shows that one can improve on the optimal sample complexity by \(\alpha^{1/3}\) factors, so the approximate prediction can help avoid the lower bounds for independence testing.
Improved Local Computation Algorithms for Greedy Set Cover via Retroactive Updates by Slobodan Mitrović, Srikkanth Ramachandran, Ronitt Rubinfeld, Mihir Singhal (arXiv). Consider the classic set cover problem; given a collection of sets \(\mathcal{S}\) over a universe \(\mathcal{E} = \bigcup_{S \in \mathcal{S}} S\), find the smallest subcollection that covers \(\mathcal{S}\). For the sublinear perspective, assume that the input is represented as a bipartite incidence graph between \(\mathcal{S}\) and \(\mathcal{E}\), with adjacency list access. The aim is to get a Local Computation Algorithm (LCA) that computes a \(O(\log \Delta)\)-approximate set cover (where \(\Delta\) is the largest size in \(\mathcal{S}\)). An LCA is an algorithm that, given some \(S \in \mathcal{S}\), outputs (in sublinear time) whether this set is in the output subcollection. The main challenge is to bound the query complexity of this operation. The guarantee is that all the output sets should have the desired guarantees. There are distributed protocols that can output an \(O(\log \Delta)\)-approximate set cover in \(r = O(\log \Delta \cdot \log f)\) rounds (\(f\) is the largest number of sets that an element participates in). Parnas-Ron showed that one can get an LCA using \(\Delta^r\) queries. This paper gives an LCA that makes \(2^r\) queries, which avoids numerous roadblocks in previous work.
Approximate Butterfly Counting in Sublinear Time by Chi Luo, Jiaxin Song, Yuhao Zhang, Kai Wang, Zhixing He, Kuan Yang (arXiv). This is an interesting applied algorithms paper, that at its core, uses a sublinear algorithm. A significant amount of graph data is really a bipartite graph (like actor-movie, author-paper, etc.). A butterfly is a \(K_{2,2}\), which is also a 4-cycle. Counting butterflies is a common data mining problem, since a high density of \(K_{2,2}\)’s is usually indicative of some structure or clustering in the bipartite graph. There have been a number of practical algorithms to compute (or estimate) butterfly counts in bipartite graphs. This paper gives a theoretical sublinear algorithm that runs in \(\widetilde{O}(w\sqrt{m}/b + m/b^{1/4})\), where \(w\) is the wedge (2-path) count and \(b\) is butterfly count. This algorithm is implemented and compared with previous results. Interestingly, the comparison is done on queries and raw running time, and even the latter shows orders of magnitude improvement. This kind of work bridges the theory/practice divide, and shows the power of pure theoretical ideas (like heavy-light vertex partitioning, various edge sampling strategies) for practical algorithmics. Hopefully this inspires even more work along these lines!
Testing Sparse Functions over the Reals by Vipul Arora, Arnab Bhattacharyya, Philips George John, Sayantan Sen (arXiv). The problems of linearity/low degree testing are well-known to our readers. There is a recent line of work generalized such results to the real-valued setting. Consider \(f: \mathbb{R}^n \to \mathbb{R}\). Distance is measured in \(l_1\) over the Gaussian measure. This paper studies a number of properties parameterized by sparsity. Consider linear functions, where \(f(x)\) is linear iff \(f(x) = \sum_i c_i x_i\), for some constants \(c_i\). A function is \(k\)-linear if only \(k\) of the \(c_i\) coefficients are non-zero. So the function has a small representation, independent of the dimension \(n\). This paper gives a tester that makes \(\widetilde{O}(k\log k + 1/\varepsilon)\) queries. Similar results are given for \(k\)-sparse low-degree polynomials and the general problem of \(k\)-juntas. There is a significant subtlety in querying a real-valued function. One can imagine that the value \(f(x)\) is only obtained with some precision \(\eta\). There are results for \(k\)-linearity for the standard Boolean setting, but the precision and \(l_1\)-distance introduce challenges for generalizing to the real valued setting.