Source: Top 11 Tools For Distributed Machine Learning – Analytics India Magazine. A centralised solution is not the right choice when data is inherently distributed or too big to store on single machines.
There are two fundamentally different and complementary ways of accelerating machine learning workloads:
- By vertical scaling or scaling-up, where one adds more resources to a single machine
Or
2. By horizontal scaling or scaling-out, where one adds more nodes to the system
But when it comes to the degree of distribution within a machine learning ecosystem, they are classified as:
- Centralised
- Decentralised
- Fully Distributed
Centralised systems employ a strictly hierarchical approach. But the distributed system consists of a network of independent nodes and where no specific roles are assigned to certain nodes.
A centralised solution is not the right choice when data is inherently distributed or too big to store on single machines. For instance, think about astronomical data that is too large to move and centralise.
In a recent work published by the researchers at Delft University of Technology, Netherlands, they wrote in detail about the current state-of-the-art distributed ML models and how they affect computation latency and other attributes.
The advantages of using distributed ML models are plenty, it is beyond the scope of this article, however, here we list down of popular toolkits and techniques that enable distributed machine learning:
MapReduce and Hadoop
MapReduce is a framework for processing data and was developed by Google in order to process data in a distributed setting. First, all data is split into tuples during the map phase, which is followed by the reduce phase, where these tuples are grouped to generate a single output value per key. MapReduce and Hadoop heavily rely on the distributed file system in every phase of the execution.
Apache Spark
To read more… Top 11 Tools For Distributed Machine Learning.
Comments are closed, but trackbacks and pingbacks are open.