No Arabic abstract
Classical machine learning algorithms often face scalability bottlenecks when they are applied to large-scale data. Such algorithms were designed to work with small data that is assumed to fit in the memory of one machine. In this report, we analyze different methods for computing an important machine learing algorithm, namely Principal Component Analysis (PCA), and we comment on its limitations in supporting large datasets. The methods are analyzed and compared across two important metrics: time complexity and communication complexity. We consider the worst-case scenarios for both metrics, and we identify the software libraries that implement each method. The analysis in this report helps researchers and engineers in (i) understanding the main bottlenecks for scalability in different PCA algorithms, (ii) choosing the most appropriate method and software library for a given application and data set characteristics, and (iii) designing new scalable PCA algorithms.
Randomized algorithms provide solutions to two ubiquitous problems: (1) the distributed calculation of a principal component analysis or singular value decomposition of a highly rectangular matrix, and (2) the distributed calculation of a low-rank approximation (in the form of a singular value decomposition) to an arbitrary matrix. Carefully honed algorithms yield results that are uniformly superior to those of the stock, deterministic implementations in Spark (the popular platform for distributed computation); in particular, whereas the stock software will without warning return left singular vectors that are far from numerically orthonormal, a significantly burnished randomized implementation generates left singular vectors that are numerically orthonormal to nearly the machine precision.
Distributed optimization algorithms are widely used in many industrial machine learning applications. However choosing the appropriate algorithm and cluster size is often difficult for users as the performance and convergence rate of optimization algorithms vary with the size of the cluster. In this paper we make the case for an ML-optimizer that can select the appropriate algorithm and cluster size to use for a given problem. To do this we propose building two models: one that captures the system level characteristics of how computation, communication change as we increase cluster sizes and another that captures how convergence rates change with cluster sizes. We present preliminary results from our prototype implementation called Hemingway and discuss some of the challenges involved in developing such a system.
In this paper we study fractional coloring from the angle of distributed computing. Fractional coloring is the linear relaxation of the classical notion of coloring, and has many applications, in particular in scheduling. It was proved by Hasemann, Hirvonen, Rybicki and Suomela (2016) that for every real $alpha>1$ and integer $Delta$, a fractional coloring of total weight at most $alpha(Delta+1)$ can be obtained deterministically in a single round in graphs of maximum degree $Delta$, in the LOCAL model of computation. However, a major issue of this result is that the output of each vertex has unbounded size. Here we prove that even if we impose the more realistic assumption that the output of each vertex has constant size, we can find fractional colorings of total weight arbitrarily close to known tight bounds for the fractional chromatic number in several cases of interest. More precisely, we show that for any fixed $epsilon > 0$ and $Delta$, a fractional coloring of total weight at most $Delta+epsilon$ can be found in $O(log^*n)$ rounds in graphs of maximum degree $Delta$ with no $K_{Delta+1}$, while finding a fractional coloring of total weight at most $Delta$ in this case requires $Omega(log log n)$ rounds for randomized algorithms and $Omega( log n)$ rounds for deterministic algorithms. We also show how to obtain fractional colorings of total weight at most $2+epsilon$ in grids of any fixed dimension, for any $epsilon>0$, in $O(log^*n)$ rounds. Finally, we prove that in sparse graphs of large girth from any proper minor-closed family we can find a fractional coloring of total weight at most $2+epsilon$, for any $epsilon>0$, in $O(log n)$ rounds.
More than two decades ago, combinatorial topology was shown to be useful for analyzing distributed fault-tolerant algorithms in shared memory systems and in message passing systems. In this work, we show that combinatorial topology can also be useful for analyzing distributed algorithms in failure-free networks of arbitrary structure. To illustrate this, we analyze consensus, set-agreement, and approximate agreement in networks, and derive lower bounds for these problems under classical computational settings, such as the LOCAL model and dynamic networks.
In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems.