The Way to Applied Machine Learning

Soujanya Syamal
Apr 23, 2021
8 min read

Machine learning researchers astonish us with new discoveries and inventions every year. A dozen artificial intelligence conferences exist where researchers push the limits of science and demonstrate how neural networks and deep learning architectures can tackle new problems in areas including computer vision and natural language processing.

However, applying machine learning to real-world applications and business problems, also known as "applied machine learning" or "applied AI," poses obstacles that aren't present in academic and science study. Applied machine learning necessitates tools, expertise, and knowledge that extend beyond data science, allowing AI algorithms to be integrated into applications that are used by thousands or millions of people on a daily basis.

In their latest book Real World AI: A Practical Guide for Responsible Machine Learning, Alyssa Simpson Rochwerger and Wilson Pang, two accomplished practitioners of applied machine learning, address these challenges. Rochwerger, a former director of product at IBM Watson, and Pang, the CTO of Appen, use their personal experiences and expertise to provide several examples of how businesses have successfully or unsuccessfully integrated machine learning into their products and business models.

Real World AI discusses how product leaders can avoid making the mistakes of others by understanding the common problems and drawbacks of machine learning strategies. Here are four of the major issues posed by Rochwerger and Pang in their novel.

Defining the Problem

The ability to identify the problem you want to solve is a challenge that all software engineering activities face. Any seasoned developer will tell you that "doing the right thing" is not the same as "doing the right thing." The problem definition is critical in applied machine learning because it influences the technology, data sources, and people who will be working on your product.

In their book Real World AI, Rochwerger and Pang write,

“Only 20% of AI in pilot stages at major companies makes it to market, and many struggle to satisfy their customers as well as they could.” “Sometimes it's because they're attempting to solve the incorrect issue. Others fail to account for all the variables—or latent biases—that are critical to the success or failure of a model.”

Consider the issue of image classification. Deep neural networks are capable of performing such tasks with astonishing precision. However, if you want to use them in a real-world application, you'll need a clear description of the problem to figure out what kind of model, data, talent, and investment you'll need.

There are plenty of pre-trained convolutional neural networks (e.g., ResNet, Inception) and public datasets (e.g., ImageNet and Microsoft COCO) that you can use out of the box if you want a neural network to mark the files in your image folder. You can run your images through the deep learning model that you set up on your own server. You can also use an API-based service like Amazon Rekognition or Microsoft Azure Computer Vision to get started. In this scenario, inference will take place on the servers of the service provider.

However, suppose you work for a large agriculture company and want to create an image classifier that can detect weeds in crops using drones. Hopefully, the technology would allow your company to transition to precision herbicide application, reducing costs, waste, and chemical side effects. You'll need a more advanced approach in this situation. You'll need to think about the machine learning model's and data's constraints. You'll need a neural network that's light enough to run on edge devices' computing resources. You'll also need a unique dataset of labelled photos of weed and non-weed plants.

Determining how well you want to solve the problem is part of identifying the problem in machine learning. In the case of image archive labelling, for example, you shouldn't have much of a problem if your machine learning model mislabels five out of every hundred images. However, if you're building a cancer-detection neural network, you'll need to set a far higher bar. Any case that goes unnoticed can have life-altering implications.

Collecting "Training Data"

Gathering and arranging the data required to train models is one of the most difficult aspects of applied machine learning. In comparison, in scientific research, training data is typically accessible, and the aim is to construct the best machine learning model possible.

In their book Real World AI, Rochwerger and Pang write,

"When developing AI in the real world, the data used to train the model is much more critical than the model itself. This is a reversal of the traditional academic paradigm, in which data science PhDs spend the majority of their time and resources developing new models. However, the data used to train models in academia is only intended to demonstrate the model's functionality, not to solve real-world problems. High-quality and reliable data that can be used to train a working model is extremely difficult to come by in the real world.”

Public databases are not useful for training models in many applied machine learning applications. Either collect your own data or purchase it from a third party. Both solutions come with their own collection of difficulties.

In the herbicide surveillance situation, for example, the company would need to collect a large number of photographs of crops and weeds. The engineers would need to take photographs in a variety of lighting, environmental, and soil environments in order for the machine learning model to function reliably. They'll need to mark the images as "apple" or "weed" once they've gathered the data. Data labelling necessitates manual labour, is a taxing task, and has spawned an entire industry. Data labelling services for AI applications are provided by hundreds of platforms and businesses.

The training data in other settings, such as healthcare and banking, will include confidential information. Outsourcing labelling activities in these situations can be tricky, and the product team would have to be cautious not to violate privacy and security regulations.

Other applications, on the other hand, can have data that is fragmented and dispersed through many databases, servers, and networks. As companies pull data from several sources, they'll run into issues like database schema inconsistency, mismatching conventions, incomplete data, obsolete data, and more. In such instances, one of the machine learning strategy's key challenges would be to clean the data and integrate various sources into a data lake that can help the training and maintenance of ML models.

Verifying data accuracy and provenance is often critical to the quality of machine learning models as data comes from several databases. Rochwerger and Pang warn that "it's extremely normal in an organisation to find data dispersed around databases in various departments without any details about where it came from or how it got there." “It's quite likely the data has been modified or manipulated in some way when it travels from the point where it's obtained to the database where you find it. You could end up with a useless model if you make assumptions about how the data you're using got there.”

Maintaining ML Models

Machine learning models are prediction devices that look for trends in data collected from the outside world and predict future outcomes based on current data. When the world around us evolves, so do the data trends, and models based on historical data begin to fail.

“AI isn't a set it and forget it' device that will continue to produce results without human interference. To continue to provide useful, desired performance, it needs constant maintenance, management, and course correction,” Rochwerger and Pang write in Real World AI.

The covid-19 pandemic, for example, resulted in a worldwide lockdown and changed many living patterns, disrupting many machine learning models. Machine learning models used in supply chain management and sales forecasting, for example, became outdated as shopping shifted from brick-and-mortar to online retailers and needed to be retrained.

As a result, a critical component of any effective machine learning strategy is ensuring that you have the infrastructure and processes in place to capture and upgrade new data on a regular basis. You'll still need to find out how to mark the latest data if you're using supervised machine learning models. You can do this in some cases by providing resources that allow users to provide feedback on the machine learning models' predictions. In certain cases, you'll have to manually mark new data.

according to Rochwerger and Pang-

“Don't forget to set aside funds for your model's continuing preparation. Models must be updated on a regular basis, or they will become less reliable as the real world shifts around them."

Make a Perfect TEAM

Your models will have an effect on people's jobs and lives (as well as your company's bottom line) if you use applied machine learning. As a result, a small group of data scientists would seldom be able to execute an effective machine learning strategy.

“It's rare for a business issue to be solved solely by a blueprint. The majority of problems are multifaceted and necessitate a diverse set of skills—data pipelines, technology, user experience, and business risk analysis,” Rochwerger and Pang write in Real World AI. “To put it another way, machine learning is only useful if it is integrated into a business process, a customer interface, or a product and then released.”

A cross-functional team of people from various disciplines and backgrounds is needed for applied machine learning. And they aren't all technical.

Subject matter experts would need to double-check the accuracy of the training data and the model's conclusions. Product managers would need to define the machine learning strategy's business goals and expected outcomes. Via interviews with and input from system end-users, user researchers will aid in the validation of the model's results. In addition, an ethics committee will be established to recognise sensitive areas where machine learning models could cause harm.

Rochwerger and Pang write that “the nontechnical components of a good AI solution are just as important, if not more important, than the strictly technical skills required to construct a model.”

Applied machine learning necessitates technological assistance in addition to data science expertise. Computer engineers will be needed to assist in the integration of the models into the organization's other software. During training and maintenance, data engineers would need to set up the data infrastructure and plumbing that feeds the models. Furthermore, the IT department will be responsible for providing the computing, network, and storage services needed to train and serve the machine learning models.

“Without access to the data, resources, and infrastructure required to ingest each dataset, save it, transfer it to the right location, and manipulate it,” Rochwerger and Pang write, “it will be impossible to achieve success even with a wonderful business plan, a well-articulated, unique issue, and a great team.”

Developing a "Right Strategy" of Machine Learning

These are only a few of the biggest challenges you'll face while using machine learning in the real world. To make your machine learning strategy work, you'll need a few more components. Rochwerger and Pang address pilot projects, the "design vs. buy" debate, production problems, security and privacy concerns, and the ethical challenges of applied machine learning in their book. They offer a lot of real-world examples of how to do it correctly and avoid botching your machine learning project.

“There's no need to be scared of artificial intelligence. It isn't magic, and it certainly isn't rocket science. You can do this, and you can do it well, with hard work and the right team working together collaboratively,” Rochwerger and Pang write.