Why the Machine Learning industry can’t grow without Open Source

Toward the end of 2016, Google DeepMind made their machine learning platform, DeepMind Lab, publicly available. Despite warnings from experts like Professor Stephen Hawking, Google’s decision to expose its software to other developers is part of a movement to further develop the capabilities of machine learning. They aren’t the only ones though. Facebook made its deep learning software public last year, and Elon Musk’s non-profit organization OpenAI released Universe, an open software platform that can be used to train AI systems. So, why have Google, OpenAI, and others made their platforms public, and how will this affect the adoption of machine learning?

Open source machine learning… Why?

The examples mentioned gives us a better picture. If you look closely, machine learning has always been open-source, and open R&D is the fundamental reason why machine learning is where it is today.

By making its machine learning platform available to the public, Google has validated an increased consciousness about its AI research. There are various advantages to making the software accessible such as finding new talent and capable startups to add to the Alphabet Inc. family. At the same time, developers can access DeepMind Lab, which will help address one of the key issues with ML research – the dearth of training environments. OpenAI has introduced a new virtual school for AI, Universe, which uses games and websites to train AI systems.

Making machine learning platforms publicly available is a much-needed move now.

5 advantages of open source in machine learning projects

Reproducing scientific results and fair comparison of algorithms: In machine learning, numerical simulations are frequently used to provide experimental validation and comparison of methods. Preferably, such a comparison between methods is based on a rigorous theoretical analysis. Open source tools and technology offer an opportunity to thoroughly conduct research using publicly available source code without depending on the vendor.

Quick bug finding and fixing: When you carry out machine learning projects using open source software, it becomes easy to detect and resolve bugs in the software.

Accelerate scientific development with low-cost, reusing methods: It is a known fact that scientific progress is always made based on existing methods and discovery, and the machine learning field is not an exception. The availability of open source technologies in machine learning can leverage existing resources for research and projects greatly.

Long-term availability and support: Whether it is an individual researcher, developer, or data scientist, open source might serve as a medium to ensure that everyone can use his/her research or discovery even after changing jobs. Thus, the chances of having long-term support are increased by releasing code under an open source license.

Faster adoption of Machine Learning by various industries: There are notable paradigms of the open source software that has supported the creation of multi-billion dollar machine learning companies and industries. The main reason for the adoption of machine learning by researchers and developers is the easy availability of high-quality open source implementations for free.

Accelerating the adoption curve of open source machine learning

The advancement of open-source machine learning will enable a steeper adoption curve of Artificial Intelligence thus encouraging developers and startups to work towards making AI smarter. The availability of software platforms is changing the way in which businesses develop AI, encouraging them to follow in the footsteps of Google, Facebook, and OpenAI’s by being more transparent about their research.

The shift toward open machine learning platforms is an important phase in ensuring that AI works for everyone, instead of just a handful of tech giants.

From my perspective, there are three reasons for tech giants to release open-source machine learning projects:

To hire engineers who have already started to engage with the open source community and have built an understanding via an open-source project
To control a machine learning platform that works best into their broader SDK or cloud-platform strategy
To grow the entire market because their market share has reached a saturation point

When a startup releases an open-source project, it triggers awareness, some of which gets converted into paid customers and recruitment. Startups, by their very definition, are trying to get a foothold in a specific market instead of growing an existing market. Open-source is frictionless. It costs nothing to serve another organic user and enable organizations to solve real problems, thus allowing the code to have a greater impact.

Instead of disrupting the startups that build proprietary technologies, open-source has given the world a taller pair of shoulders to stand on. One of the knock-on effects may be a shift in focus on where the value lies. With the commoditization of the entire AI technology stack, the focus shifts from core machine learning technologies to building the best models–and this requires a vast amount of data and domain experts to create and train the models. Large incumbent businesses with an existing network effect have a natural advantage.

Best frameworks in open source machine learning

There is a wide range of open source machine learning frameworks available in the market, which enable machine learning engineers to:

Build, implement and maintain machine learning systems
Generate new projects
Create new impactful machine learning systems

Some of the important frameworks include:

Apache Singa is a general, distributed, deep-learning platform for training big deep-learning models over large datasets. It is designed with an instinctive programming model based on layer abstraction. A variety of popular deep learning models are supported, namely feed-forward models including convolutional neural networks (CNN), energy models like restricted Boltzmann machine (RBM), and recurrent neural networks (RNN). Many built-in layers are provided for users.
Shogun is among the oldest and most revered machine learning libraries. Shogun was created in 1999 and written in C++, but isn’t limited to working in C++. Thanks to the SWIG library, Shogun can be used in languages and environments such as:
- Java
- Python
- C#
- Ruby
- R
- Lua
- Octave
- Matlab

Shogun is designed for unified large-scale learning for a broad range of feature types and learning settings, like classification, regression, dimensionality reduction, clustering, etc. It contains several exclusive state-of-the-art algorithms, such as a wealth of efficient SVM implementations, multiple kernel learning, kernel hypothesis testing, Krylov methods, etc.

TensorFlow is an open source software library for numerical computation using data flow graphs. TensorFlow performs numerical computations using data flow graphs. These elaborate the mathematical computations with a directed graph of nodes and edges. Nodes implement mathematical operations and can also represent endpoints to feed in data, push out results or read/write persistent variables. Edges describe the input/output relationships between nodes. Data edges carry dynamically-sized multi-dimensional data arrays or tensors
Scikit-Learn leverages Python’s breadth by building on top of several existing Python packages — NumPy, SciPy, and matplotlib — for math and science work. The resulting libraries can be used either for interactive “workbench” applications or be embedded into other software and reused. The kit is available under a BSD license, and therefore, it’s fully open and reusable. Scikit-learn includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.). Since scikit-learn was developed by a large community of developers and machine-learning experts, promising new techniques tend to be included in short order.
MLlib (Spark) is Apache Spark’s machine learning library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs. Spark MLlib is regarded as a distributed machine learning framework on top of the Spark Core which, mainly due to the distributed memory-based Spark architecture, is almost nine times as fast as the disk-based implementation used by Apache Mahout.
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology. Amazon Machine Learning provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology. It connects to data that is stored in Amazon S3, Redshift, or RDS, and can run binary classification, multiclass categorization, or regression on the said data to create a model
Apache Mahout, is a free and open source project of the Apache Software Foundation. It’s goal is to develop free distributed or scalable machine learning algorithms for diverse areas like collaborative filtering, clustering and classification. Mahout provides Java libraries and Java collections for various kinds of mathematical operations. Apache Mahout is implemented on top of Apache Hadoop using the MapReduce paradigm. Once Big Data is stored on the Hadoop Distributed File System (HDFS), Mahout provides the data science tools to automatically find meaningful patterns in these Big Data sets thus turning this into ‘big information’ quickly and easily

The final say

Machine learning can indeed solve real scientific and technological problems with the help of open source tools. If machine learning is to solve real scientific and technological problems, the community needs to build on each other’s open source software tools. We believe that there is an urgent need for machine learning open source software, which will fulfill several concurrent roles, which include:

Better means for reproducing results
Mechanism for providing academic recognition for quality software implementations
Acceleration of the research process by allowing the standing on shoulders of others (not necessarily tech giants!)

Chandrashekhar Deshpande

Chandrashekhar has been working in the PR and Marketing industry for 5 years and likes to believe that everyday is his first day at work. Idolizes Jim Morrison. Also, is hopelessly in love with music and travelling.