50% of the work actually has to do with data collection and preparation, like data cleansing, parsing, atomization, etc. Many people would quickly sign up for freemium ML training services and be confident of hitting the ground running in weeks after a small success using vendor’s sample data. But the reality is much more daunting. Take either Google ML cloud service or Amazon Comprehend, for example. It seems bringing your data is a brilliant idea only when you realize that the integration you first imagined is much more complicated. Here is the problem. What would the data preparation process look like? How would the training data feed into your ML engine? Do you expect business users to use excel or word docs to prepare you for the training data? Probably not, so now you will need a UI.
“The success of NLP projects is not just about ML, nor just about training.”
Oh, sweet UI. Now more problems surface. How do you track their training progress and efforts? How do you keep them engaged? How do you help them divide and conquer the training tasks? You probably are not going to have just one label.
Then from a technology standpoint – What data framework to use? How would continuous integration work? How would people interact with the data? What about labeling? How would labels be stored and used? Do you need versioning? How do you design the training feedback loops? Last, who would do all these?
See, all of a sudden, the straightforward project seems much more daunting now. But that is not all.
Unlike image labeling, domain-specific NLP projects require much more.The problem with most ML projects is that people felt the most critical part is “ML.” Hence, they hire plenty of ML engineers to realize that weeks have gone by, and not much has happened. People discount the efforts required from everyone else for success. Every ML engineer knows that training would depend on having a large enough database with useful labeled data. Yet, somehow, IT managers and business managers felt only the others are responsible for data labeling.
You need to align the expectationThe misaligned expectation now leads to the core problem for the success – how do you convince people to label your data that the outcome will likely replace them. After all, the underlying text for an ML project is to automate the organization’s inefficient portion.
For NLP projects to succeed, you need to build an education component to create the “aha” moments for your business users.
For NLP projects to succeed, the incentives need to be well-aligned. Telling business teams that the goal is not to replace them but to make them better is very hard for people to believe. You will have to design the system in such a way for them to see it. For one of the projects we did, that is what we have done. We built an education component into the process to get better while they trained the bot. If you wish to learn how we did it, feel free to reach out – firstname.lastname@example.org.