Image annotation is the process of selecting objects in images and labeling objects with classes, attributes, and tags to build a set of training data for machine learning models. Preparing image data in this fashion is the backbone of computer vision in AI. For example, to build a computer vision model to recognize roof types in satellite images, one needs to annotate tens of thousands to millions of images of roofs in different cities, weather conditions, etc.
Other than for aerial imagery, annotated data is used extensively in autonomous driving, security and surveillance, medical imaging, robotics, retail automation, AR/VR, etc. The increase in image data and computer vision applications requires a huge amount of training data. Data preparation and engineering tasks represent over 80% of the time consumed in AI and machine learning projects. Therefore over the last few years, many data annotation services and tools have been created to cover the needs of this market.
Many aerial imaging companies are trying to solve some of the hardest problems in the world in areas such as deforestation, agriculture, home insurance, construction, security, and others. In most of these applications, the objects from the satellite or drone imagery footage are far from having rectangular shapes. Instead of rectangular localization or counting the number of objects in the space (using bounding boxes), companies often need tools for calculating the exact pixels from the aerial image data.
While there is a significant need for pixel precision to determine various objects in aerial images, the most common data labeling technique still remains the bounding box as it is relatively straightforward, and many object detection algorithms (YOLO, Faster R-CNN, etc.) were developed with this method in mind. However, not only are box annotations ill-suited for aerial imagery tasks but they also do not allow for reaching superhuman detection accuracies regardless of how much training data you use. This is mainly because of the additional noise around the object that is included in the box area. Whereas instance segmentation algorithms trained on the same backbone neural network perform 3-5% more accurately (mAP score) compared to when trained only on bounding boxes.
Pixel-precision in aerial imagery
Pixel accuracy can provide tremendous advantages for aerial imagery computer vision applications. Yet, the most common tools for such annotations rely heavily on slow point-by-point object selection tools where the annotator has to go through the edges of the objects. This is not only extremely time-consuming and costly but is also very sensitive to human errors. For comparison, such pixel-accurate annotation tasks usually require up to 10x more time than putting simple bounding boxes. As a result, in many cases, companies are stuck using bounding boxes for annotations, while in other cases, companies struggle to gather large amounts of pixel-accurate annotations.
AI segmentation based approaches
Given the significant amount of human effort required to annotate images, the research community has made extensive efforts towards creating more efficient pixel-accurate annotation methods. There have been approaches that use segmentation-based solutions (i.e. SLIC Superpixels, GrabCut) for pixelwise annotation. However, these approaches perform segmentation based on the pixel colors and often show poor performance and unsatisfactory results in real-life scenarios such as aerial imagery. Hence, they are not commonly used for such annotation tasks.
Over the last 4 years, NVIDIA has done extensive research with the University of Toronto regarding pixel-accurate annotation solutions. Their research mainly concentrates on generating pixel-accurate polygons from the given bounding box and includes the following papers—Polygon RNN, Polygon RNN++, Curve-GCN, Deformable Grid—published at CVPR in 2017, 2018, 2019, ECCV 2020 respectively. In the best-case scenario, generating a polygon with these tools requires at least two precise clicks (i.e. generating a bounding box) and hope that it will capture the target object accurately. However, the proposed polygons are usually inaccurate and it can take much more time than expected (see the example below).
Another problem with such polygon-based approaches is the difficulty of selecting donut-like objects (topologically speaking), where one needs at least two polygons to describe such objects.
A novel approach to pixelwise annotation
We realized that the easiest and fastest way for pixelwise annotation would be to develop a method to select objects with just one click. It would be important for that method to account for the different scenarios missed by the various pixelwise annotation methods previously described. This led us to develop our Smart Segmentation technology, which takes a novel approach to edge detection allowing the user to select objects with one click while overcoming the limitations from which other algorithms suffer.
Our experiments showed that with Smart Segmentation, pixelwise annotations can be accelerated by 10x without compromising annotation quality. Here is an example of how it works:
We also analyzed the advantages of our solution compared to other AI or segmentation-based approaches:
- The segmentation is performed offline allowing annotation of up to 10-megapixel images in real-time.
- Unlike SLIC superpixels, the segmentation solution accurately generates non-homogeneous regions, allowing users to select both large and small objects with just one click.
- The software allows us to change the number of segments instantly which enables selecting even the smallest objects.
- The self-learning feature even further improves segmentation accuracy. Consequently, after a few hundred annotations, dramatic changes in the segmentation accuracy can be observed, further accelerating the annotation process.
- Compared to box-to-polygon-based techniques discussed above, the software allows selecting donut-style objects with just a click.
- As the amount of annotated data increases, our software allows automatic pixel-accurate annotation and the acceleration of image labeling.
Even compared to the speed of basic bounding box annotation, which requires at least two precise clicks to annotate one object, we need only one approximate click within the segment at times making it even faster than generating a bounding box.
With Smart Segmentation, we are able to increase the speed of pixelwise annotations to that of bounding boxes while at the same time finally allowing computer vision teams to build models that can reach superhuman accuracy levels of detection otherwise not achievable with bounding boxes. Furthermore, since pixel precision removes the noise that exists in bounding boxes, far less data is needed to reach similar levels of accuracy.
Concluding Remarks: We’re only beginning to scratch the surface of computer vision applications and identifying problems that can be solved by our industry. As these problems become more complex and the requirements for accuracy become more strict, it will be necessary to improve the quality of training data to meet these performance demands. Moving from bounding boxes to pixel-accurate annotations and finding scalable ways to do so are key to achieving such high-quality data. In addition, with the new era of tremendous computational power and the advance of new algorithms, pixel-accurate annotations are becoming the new norm giving rise to more precise and sophisticated vision and analytics.
SuperAnnotate overview
SuperAnnotate is an annotation platform that enables computer vision teams to rapidly annotate even the most complex pixel-accurate annotation projects. Our platform leverages ML and workflow-based features to help companies increase annotation speeds by up to 10x, while dramatically improving the quality of training data and increasing the efficiency of managing annotation projects. We also have integrated services on the platform, giving our customers the ability to access thousands of professionally managed outsourced annotators armed with our lightning-fast tooling. With SuperAnnotate, companies can develop the fastest and most scalable computer vision data pipelines.
SuperAnnotate's platform is particularly effective in aerial imagery, where pixel-accurate and polygon annotations are heavily used. Our tools for pixel-accurate annotations, including our best-in-class automated edge detection feature, provide rapid acceleration of annotation times while delivering higher-quality annotations.
We are venture-backed by leading VCs such as Point Nine Capital, Runa Capital, Fathom Capital, Berkeley SkyDeck, Plug and Play Ventures, and SmartGateVC.