
AI technologies may have several decades of history, but there has always been a missing part in terms of getting access to large amounts of data. Things have changed, for the first time, thanks to digital capabilities, we now have not only all types of data, there’s real-time, on-line and always-available access to a wide array of data stream.
To teach a computer to understand information like what objects are contained in an image and how things can be properly categorized requires large volumes of data, the convergence of data and artificial intelligence are leading us into a promising path to shape the future of maximized AI benefits.
Data to Artificial Intelligence – “Garbage in, garbage out”
Industrial development is a result of international competition, the development of industry-leading companies is often the main force to drive the industry to its maturity.
“Poor data quality is enemy number one to the widespread, profitable use of machine learning” says Thomas C. Redman — aka “The Data Doc” — one of the original pioneers of data quality management. We all know any application of AI and ML will only be as good as the quality of data, basically, what you’ll get is highly dependent on the input.
(Photo by Kevin Ku from Pexels)
What has flawed data cost us?
To accelerate progress in developing AI technologies, the overall performance of data collection and annotation directly reflects the training result and its iteration speed. Face recognition technology has made incredible improvements in recent years, but no matter how good the machine is if it is still be fed the wrong data, no good outputs will we get. Here are two examples:
“Mistaken recognition of Dong Mingzhu” Incident, Dong Mingzhu, president of China’s biggest air conditioning maker, had her image flashed up on a public display screen in the city of Ningbo, with a caption saying she had illegally crossed the street on a red light. And turned out, it was the camera mistook a bus-side ad for her real face.
And few months before Dong Mingzhu’s false accusation, Amazon’s face recognition tool “Rekognition” matched 28 members of Congress i.e. lawmakers with mugshots and incorrectly identified them as lawbreakers. It raised the discussion of how unreliable face recognition is at the current stage and to what degree should we let technologies intervene in our systems. I mean, if innovative technology means putting innocent people behind bars, we could do much better without.
Workflow of Awakening Vector – a superior data service provider especially when it comes to data quality
Labor-intensive as it is, current data annotation workflow relies too much on labor. In fact, based on 2 years of experience working as the co-founder of Awakening Vector, I see the value of this industry lies in efficient coordination among all team members, which could be further broken down to agile data distribution, collection and quality assurance system, a platform that leveraging quality, efficiency and data security.
We have developed an online SaaS platform Labelhub, we offer a flexible workflow involving automatic data distribution, annotators performance tracking, centralized database and sufficient templates for nearly all use cases.
Small and medium-sized AI companies with a limited budget might be stuck with what they can get, but the top players are more likely to turn to high-quality data service providers.
All in all
Instead of diving into refining ML algorithms, it would be best to take steps to mitigate risks by flawed or incomplete data. Essentially, the more you understand your data, the more likely for the model to get successful results.