Compared to even just a few years ago, the tools available for data scientists and machine learning engineers today are of remarkable variety and ease of use. However, the availability and sophistication of such tools belies the ongoing challenges in implementing end-to-end data analytics use cases in the enterprise and in production.
It is common knowledge that the biggest lift in applying machine learning in production is variously “data engineering,” “data modeling,” “building a data pipeline,” or some other logistical effort notionally divorced from the algorithms and objective functions of data science. To me, however, data logistics and data science are not separate compartmentalized efforts, and neither are the functions of data labeling and algorithmic feedback. What the end-users of data analytics want actually meshes with that AI engineers need--“collaborative AI” systems that are designed from the ground up to engage users (whether enterprise or consumer) to provide feedback that iteratively tunes and improves algorithm performance.
Many enterprise use cases involve photo labeling, text translation, and other common image and natural language use cases. For these, there is an embarrassment of choices of services: Google Cloud AI, Azure ML Studio, Amazon ML Service, and Dask, and development frameworks: TensorFlow, Torch, Caffe, and scikit-learn. In some cases, little knowledge of algorithmic internals and tuning is required, just a sufficiently large labeled dataset and some engineering and devops know-how. Make no mistake: this is a tremendous stride forward in bringing cutting-edge algorithm use to the enterprise. But that one phrase, “a sufficiently large labeled data set” can be a show-stopper.
Figure: Image Caption Examples (https://arxiv.org/pdf/1411.5654.pdf)
Relatively mature algorithms such as image classification and machine translation presuppose standard data formats, often some level of data normalization, and have been under development by a large community for many years. Other types of data can be exploited by machine learning, but these same processes are essential and must still be addressed. Data standardization can be handled as an engineering task, but must target analytics use cases and incorporate domain knowledge. However, for new or novel algorithmic uses, some method to actively engage users in providing interactive feedback on an ongoing basis is essential so that the algorithm can be iteratively improved.
At a high level, machine learning methods involve specifying two things: a model or function that maps input data to output data and a way to measure error in the mapping for particular settings of the model (or parameters). The exercise then becomes finding the best model parameters for the selected model, given the available data (input and output). This presupposes that there is actually appropriate input data, all in the right format, and in the same format, without problems such as missing data, or errors, or other data regularity problems. It also presupposes that there is an appropriate matching set of output data for each input. There are models that do not require a matching set of output data, but the vast majority of current machine learning models--especially those that are “off the shelf”--are supervised learning models that do require a specified output for every input datum.
As we already mentioned, for common use cases like image labeling, model selection, input data selection, and output labeling are all relatively straightforward. But what about for market survey data, IoT telemetry, police incident reports, or security logs? Certainly there are natural language and other tools that can be brought to bear to assist in processing these and other unstructured data types. The first task, however, is to turn all this unstructured data into structured data so it can be modeled.
This is not the same challenge as ingesting unstructured data for search and query. There are other well-developed techniques and technologies for doing this in a way that makes minimal assumptions about the structure of the data, such as word2vec and other unstructured learning algorithms. Such methods might be useful for providing structure to data for downstream learning, but they are poor at surfacing insights on their own.
Although there can be an aversion to making assumptions about data when building a processing pipeline, one always makes assumptions in data processing. For example, one might make the assumption that there is a sufficient amount of labeled data to uniquely constrain a parametric model to the best fit separation of the input data by label--i.e, to properly train a classifier algorithm. But with real-world data, class imbalance, missing data, and other data issues, coupled with poor model selection, this assumption may not hold. Given this, it may be less of a stretch to make other assumptions about one’s data that can potentially improve modeling performance.
Probably the best advice for a data scientist or data engineer working with unstructured data or non-image data would be to work closely with knowledgeable end-users to develop and document a baseline data model. Specifically, focus on:
Standardizing the data (schema, field ranges, etc.)
How to handle data issues like data from different sources, missing data, etc.
Enrichment, including what some data can inform about other data
Feature selection, which is an essential data reduction step that’s much better for domain experts to inform directly
It is tempting to go overboard with any or all of these, but it is always better to at least start with minimal assumptions, such as those regarding data availability, links among data, etc.
The machine learning algorithm actually sits after all this--after the data standardization, after the enrichment, and after the feature selection. This is why many production data scientists and data engineers report that upwards of 90% of their effort is on data logistics (https://youtu.be/fZXQZNKFUVE). Because it is. And one cannot shortcut it. Or rather one can, but leaving this work out or doing it incorrectly leads to common failures like: “it worked on the test data,” and “it worked well initially, but performance fell off over time.” These sorts of production failures can be truly challenging to overcome, but careful design and architecture can remove systemic causes.
End-user facilitated data logistics design can yield well-structured input data and may also suggest appropriate downstream machine learning models, but this only solves one part of the problem. There is another, potentially more engaging area for man-machine collaboration: iterative data labeling.
As already noted, most learning models require both structured input data and corresponding output data. A well-engineered data pipeline can help address the input side of the equation, but even assuming an effort to kickstart with an initial set of data labeled by beta users--for example, some source of ongoing output data (i.e. labels)--is required for supervised learning models. Even if one is using an unsupervised model (which doesn’t require data labels), user feedback is still essential to gauge model accuracy.
This need not be an intractable problem for the data science team or an overwhelming burden for the project. In fact, this situation contains within it the seed of a better design pattern for machine learning: engaging users in the interactive feedback of model performance. Users will rarely give explicit feedback, but a rich amount of implicit feedback is available via instrumented interfaces. The nature of the this feedback is entirely application and model dependent, and is another way to encode domain knowledge while engaging and delighting users. Online recommender systems offer some guidance here on user actions such as: selection, linger time, compilation (for sets), removal, etc. Models can be constructed or modified to incorporate such feedback to represent category membership, binary indicators, and many other points for tuning in the model. There is a rich set of literature and industry know-how on handling such noisy feedback mechanisms, including user bias and improving interface design.
Figure: Typical non-integrated or add-on algorithmic workflow
Figure: Integrated and collaborative algorithmic workflow
The ongoing and rapid surge in machine learning and AI performance and implementation options, although relatively narrowly focused on a few specific use cases and data types, can be extended to the vast amount of accumulating enterprise data and, frankly, desperate data consumers (aren’t we all?). The tedious tasks of data grooming and correlation that are inherently domain and task focused should be fully committed to during design and development, rather than performed as an ad-hoc post-query search function. This bottom-up approach enables human-machine interfaces that can engage users interactively and collaboratively with the algorithm so that working together, as a team, they can achieve their data-enabled goals.