AI in image recognition: practical application of technologies with examples – Part 2

Máté Kasó

2023-10-16

In this article, we will look at practical applications of AI. We will mainly review state of the art image recognition neural networks. In a following article we will also cover large language models (LLM).

Image recognition or text analysis are just a few examples of how AI can be used in practice. The following cases illustrate how these technologies work in real life!

Artificial intelligence-based image recognition and processing

This broad area includes classification of image content into known, predefined classes, detection of specific objects in the image, or pixel segmentation. In addition, they can also describe the content of the image in natural language text, or even perform universal, prompt-based free-text object searches! Not only can we detect images, but we can also create or modify images based on human imagination with the nowadays very popular image generators!

There are methods for 3D reconstruction based on multiple angles of view, or even inferred from monocular (1 image). By combining images from several cameras at once, we can turn them into a cheap 3D scanner! The best solution is NeRF (Neural Radiance Fields), but distance can also be well calculated from images of stereoscopic cameras (side by side like an eye).

Image recognition is extremely versatile, we can make smart cameras that don’t just look, they see! They can not only recognise objects, but also evaluate them, e.g. whether a product is intact or damaged; what kind of car is in the picture; how many people or who is in the picture…

One of the biggest advantages of neural image processing is its computational power: with the right knowledge, it can be used locally offline on almost any computer or even a mobile phone, and can be trained to recognize new categories with moderate computational power!

Below we present state-of-the-art neural networks that are open source, freely available and offer state-of-the-art performance:

Object detection and segmentation, You Only Look Once (YOLO) with v8
Image recognition with YOLO (You Only Look Once) neural networks:

This mesh is the most accurate and fastest image recognition solution available today, which can also segment (cut around)! It can segment 80 common objects by default and detect the presence of 1000 objects in the image. The latter is the classification, when it doesn’t return where the object is in the image, only the % chance that it is visible somewhere.

A powerful GPU can run on an accelerated computer in a few milliseconds, so it can even be used for live video! It can even be used on a Raspberry PI in seconds.

It is also possible to fine tune it to new categories to be recognised in a few hours (although this requires a sufficiently powerful GPU)! In this case, you need a few 100 images per category, but it is possible with less if you augment them well: then you can create orders of magnitude more variations of the images, combining different image distortion, colour and contrast adjustment operations, which prevents overfitting, so you don’t just “memorise” the images.

During the training, we should preferably provide the mesh with images that reflect real use cases, something that clearly shows a part of the object to be recognized. The more varied the backgrounds and settings, the better to show it the object to be recognised, as this will help it to correctly locate the details (i.e. the object) in each image.

As always, of course, the results of the teaching, i.e. the accuracy, need to be constantly checked, in the case of examples never seen before!

GroundingDINO: “Zero shot” object detection without fine tuning
Image recognition without fine tuning, using the GroundingDINO project:
Picture: search prompt: person, boat, wheel
Ábra: keresési prompt: sail, sailboat, bottom part, water
Picture: search prompt: sail, sailboat, bottom part, water
Picture: search prompt: long hair

This is the result of a very interesting research, where the categories to be recognized were not taught to the neural network in the usual way (so far, 1 category corresponded to 1 output neuron with a given index), but a detailed free text description was given for each image to be taught. The neural network learned by itself the relationship between the words or phrases and the objects, their positions and events currently visible in the image. With this method, there is no need to pre-select the objects to be recognized, the network will figure out by itself during training which tokens are associated with which pixel sets.

This allows both the recognition of a huge number of categories and a better understanding of the relationships between objects in the image. To use it, all you have to do is type in a free word what you want to find and the grid will display it for you! It is not as accurate as its traditional counterparts, but it allows much more extensive, out of box operation! It doesn’t necessarily require long and slow post training (that would be more cumbersome here) and reference training images. With some prompt engineering you can instantly recognize new objects.

3D model calculation

If you have pictures of an object or scene from multiple angles, we can reconstruct it in 3D! The easiest way to do this is to use 2 stereoscopically placed cameras! This technique works in the same way as our vision, we calculate the distance to each pixel from 2 side by side images from 2 cameras. The resulting image is called a depth map. We can use several solutions to do this: the current leader is Nvidia’s StereoDNN technology, which also relies on neural networks, but we can also offer a traditional solution using OpenCV, but much faster, algorithmic or even custom:

Ábra: a bal és jobb kamera képe, alul az abból számolt eredeti és szűrt mélység térkép.
Picture: left and right camera image, below the original and filtered depth map calculated from them.

If we have not just side-by-side images, but several images taken from significantly different angles, then the whole scene can be reconstructed, but this requires different methods. Until now, this was only possible by photogrammetry, which required hours of calculations if good quality was required, and there was no guarantee of a usable result. This has recently changed with Neural Radiance Field (NeRF) reconstruction!

Not only can an image be reconstructed in 3D from multiple angles, but now it can even be inferred from 1 photo! A neural network such as MiDaS can “imagine” 3D depth based on existing object knowledge and visual cues, just as a human would imagine it when seeing a picture:

In the future, AI-driven 3D reconstruction will be an integral part of the expanding VR, AR technology, both in converting old footage and in better recognizing the environment!
Body position definition

Along these lines, we can not only locate objects, but also their location or the location of their details! Perhaps the most obvious and difficult example of this is the localisation of dynamically changing parts of people, but there are now AI-based solutions for this, such as OpenPose:

These nets can tell facial features, body position and even finger position in a matter of seconds! They have significant application potential, especially for video recordings or in combination with 3D reconstruction! They can be used for gesture control, behaviour analysis, automatic surveillance, person identification and tracking…

Universal object segmentation

Meta’s latest, recently open-sourced, locally fast-to-run Segment Anything Model (SAM) neural network has been taught to recognize objects in a generalized way. While it has no knowledge of what they are, it has a good understanding of the logically distinct details in a picture! This is of limited use on its own, but it is great for filtering in almost any image processing pipeline to speed up and increase accuracy! Especially useful when combined with a 3D depth map for better localization of contiguous objects. It is also very useful for fast annotation of individual datasets!

Image generation and modification with Stable Diffusion

The latest generative AIs can produce almost any image, in any style (photo, painting, graphics, 2D, 3D…) based on a natural text prompt. This has an unprecedented number of applications in almost any image industry! There are several paid services built on top of this (DallE, Midjourney…), but the top one at the moment is the open source, free and locally executable Stable Diffusion.

Not only does it rival all others in quality, it also far surpasses them in controllability! Not only can images be generated based on textual descriptions, but using a tandem with another neural network called control net, the image generation process can be precisely controlled: based on reference images, 3D depth map, edges, drawing, color, body position, or semantic labels (labeled regions)!

Among the countless uses, I would like to highlight one: it can turn an existing low-quality 3D rendered image into a photo-quality image in minutes! Then you can even add completely new elements, with simple draws and promt! The example below shows a 3D render of a house, then converted to a real photo-like version with StableDiffusion:

It’s so versatile and simple that it can even make old video game graphics look real! As an example, a screenshot from GTA Vice City has been transformed, to varying degrees, on a local computer, with its own settings, neurally quasi-re-rendered:

Image generation and modification can be run locally with Stable Diffusion

While generative AI is still in its infancy, it is already a technology that touches the foundations of the entire entertainment and arts industry, and in the right hands, can generate unimaginable results in minutes!

Summary:

In this article we have explored practical applications of AI, with a focus on image recognition and processing. We presented recent developments in neural networks, such as YOLO v8 and GroundingDINO, and their capabilities in object detection and segmentation. We covered the possibilities of 3D reconstruction, where we are able to extract depth information from even a single image. The paper highlights the extent to which AI can transform the field of image processing and the role it will play in future VR and AR technologies.

If you would like to integrate artificial intelligence into the operation of your business, company, startup or product, contact us, our experts will help you from the initial consultation, through consulting and mapping the possibilities to the implementation.

Join our newsletter!

If you liked our article, subscribe to our newsletter!