Author: Chao Jiang | February 27, 2020
Data labeling is one of the most critical steps in building an AI model. You may have an overall idea of how to manage the data annotation project. Now it’s time to decide whether to label your data in-house or outsource it to a third-party service. Every single one of my clients ask me the same questions: should I do my labeling in house or outsource it to a third-party service provider? My answer is always “It depends”.
Each option has its advantages and disadvantages. An effective decision should take into consideration both clear knowledge of the project, including the ML and company resources, and experience in labeling data for AI models.
The best way to share my knowledge with you, the AI engineer, is probably by answering the most commonly asked questions by my clients, as you might be looking for answers to the same questions.
First things first, what is data labeling or data annotation?
Data labeling (also called data annotation) is the process in which annotators (humans) manually tag various types of data such as text, video, images, audio, etc. through computers and other tools. This manually labeled data set then is packaged and fed to a machine learning algorithm to train an AI model.
There are two choices to do data labeling: One way is to do it in-house, which requires the company to build a labeling tool (usually a software) or to customize and implement the right open source product. Another way is to outsource the work to professional companies such as Awakening Vector.
How to build a data labeling tool in-house?
This might sound expensive and time-consuming, especially when you already have your product roadmap setup with a deadline. Good news is, you do not need to drag half of your software engineering team to work overtime to build a labeling software from scratch. There are many handy tools already exist. All you need to do is to customize the existing tools to suite your specific data requirements. Commonly used labeling tools include Labelme, Labellmg, LabelHub, VGG (VIA), CVAT, Labox, etc. These tools are slightly different when it comes to label set, format and compatibility for data export. Make sure you choose the right labeling tool based on the requirements of your project.
What are the advantages and disadvantages of in-house labeling vs outsourced labeling?
The main advantage of building an in-house labeling tool is flexibility. If your project has highly specific labeling requirements and none of the SAAS labeling platform mentioned above suites your needs, you do not have a choice but to build your own labeling tool. This is especially true in the case of start-ups, where the labeling requirements might change frequently due to product involvement.
There are many disadvantages of in-house data labeling as well. The main challenge is capacity and resources. If you have a limited number of software engineers in your team, building an annotation tools might delay your product roadmap drastically. Additionally, what you built in-house with two developers in three weeks, is probably not as sophisticated as the labeling tool built by 20 developers and iterated over two to three years. Third party data annotation tools are usually more sophisticated and comes with experienced annotators ready to work for you.
What kind of projects are suitable for outsourced data labeling?
A perfectly suitable project would be when you are clear about the rules and standards for your training data. In this case, outsourcing is your best choice. Why? This type of project is characterized by large data volume and relatively simple scenes. Engineers are very clear about how data should be handled. For this type of project, it is generally recommended to use a third-party data annotation service for speed and cost efficiency.
When is the best time to engage third party labeling services?
You might start with a muddy picture of what you need for data labeling. As your AI model evolves. In the beginning, your labeling project might involve lots of subjective judgment elements and complex scenes. At this point, it is better to keep the annotation inhouse to keep it agile. As your labeling requirements become clearer and data volume increases, you should consider adding outsourced services to increase capacity of your labeling operation. Awakening Vector has helped many clients like this through a fully integrated team.
In summary, there is no absolute better or worse when you compare in-house or outsourced data labeling for AI model. It is a judgmental decision which might need to be changed as your project evolve. When choosing a data labeling method for a product, the first step is to analyze the nature of your project and resource at your disposal.
If you have specific questions regarding your labeling project, please feel free to reach out to us at: service@awkvect.com
About the Author:
Chao Jiang
Product Manager at Awakening Vector
As a serial entrepreneur, Chao Jiang joined Awakening Vector in May 2018. He developed numerous customized solutions for artificial intelligence companies with a deep understanding of data labeling. He also helped many artificial intelligence companies improve data efficiency and reduce cost in the data labeling process.