Whether its hands with a lot of fingers or palms that are extremely long, AI just doesn’t seem to get it right.
This year was a milestone for AI from winning an award at an art exhibition (Kuta, 2022) to passing exams with high difficulties, it has proved to be a technology that can change the future of human lives. However, AI-image generators just cannot make a human hand accurately. Although this has been a concern since the success of Dall-E 2 and all of its competitors, this issue gained attention when a Twitter user posted a collection of photographs that he made by using the AI generator Midjourney.
Midjourney is getting crazy powerful—none of these are real photos, and none of the people in them exist. pic.twitter.com/XXV6RUrrAv— Miles (@mileszim) January 13, 2023
Miles Zimmerman, a 31 year old programmer from San Francisco was using Midjourney, an AI-image generator and with the curiosity of seeing how accurately an image can generated, he used ChatGPT to come up with a very detailed description “A candid photo of some happy 20-something year-olds in 2018 dressed up for a night out, enjoying themselves mid-dance at a house party in some apartment in the city, photographed by Nan Goldin, taken with a Fujifilm Instax Mini 9, flash, candid, natural, spontaneous, youthful, lively, carefree, — ar 3:2.” And within seconds, MidJourney came with the images that matches the description accurately. We can see in the above post by Miles himself stating that how eerily accurate they are.
At a glance, these images look very good, with the close to reality portrayal of the faces, skin and hair. Although they look a little plastic, the AI does a good job at showcasing how such a scenario would look. But, when Miles had a closer look at the image generated, the woman who is smiling with an instant camera in her hand, had an abnormal amount of fingers, 9 to be precise. The other girl in the image has the correct number of fingers but are extremely long. Almost all of them have abnormal teeth too. Algorithms are the major players behind this, they work on the input-throughput-output model. The throughput region would be responsible for processing the text obtained and converting these into images (Just, 2017). Depending on what the system is fed, it goes through heaps of data that Kate Crawford describes as computationally intense datasets to find out the image that matches a caption (Crawford, 2021). Generating hands is something that the AI systems can’t get their head around just due to the complexity of drawing hands.
Hand impressions on cave walls are the first known form of art produced by Homo sapiens, and they are among the most challenging items to depict in a drawing or painting (Vergano, 2014) . Hands have a crucial role in the world of art. Human hands are still shown in artworks from ancient Greece and mediaeval Europe as being flat and lacking in detail. Given our history with these depictions, it kind of makes sense that something new being created in AI would also struggle with the similar problems we faced as a human race.
So why is such a simple task causing such a big issue?
“These are 2D image generators that have absolutely no concept of the three-dimensional geometry of something like a hand,” says Prof Peter Bentley, a computer scientist and author based at University College London (Hughes, 2023). These programmes’ function because they have been trained to identify connections between the countless images that have been taken from the web and the text descriptions that go along with them. Eventually, the programme “understands” that, for example, the word “dog” relates to the photo of a canine. “Datasets” are the term for these pictures and their explanations (Dixit, 2023).
As we can see in the above image generated by Dall E 2, generating images of hands wouldn’t pose a big issue if all you wanted was a very general picture of a hand. But as soon as you give the models context, the problem arises. The AI system will find it difficult to accurately duplicate something if it cannot comprehend the 3D nature of a hand or the context of a scenario. For example, a hand holding a knife, a man throwing a ball at a game will instantly confuse the system due to the lack of a 3D representation of the hand or the structure of the object being held. “It’s generally understood that within AI datasets, human images display hands less visibly than they do faces,” a Stability AI spokesperson told BuzzFeed News (Dixit, 2023). “Hands also tend to be much smaller in the source images, as they are relatively rarely visible in large form.” From the various sources of image generation, it is safe to say that not just Dall E 2, but almost every image generating AI model struggles with creating hands.
What we can see is the early stage of developments being made in algorithms. AI is just algorithm with a few extra steps, we kind of have to agree with what Kate Crawford says ‘AI is neither Artificial nor Intelligent’. It is at a stage where the algorithm is just replicating stuff rather than understanding the meaning or the complex structure behind it. Given it is an early stage, we can cut some slack for these models. It is rather an inevitable journey that these models have to go in order to get to a stage where it can really blow our minds. Just like how computers started out systems to do simple tasks and has come to a point where it can sustain an AI on it.
How can we overcome this brick wall for AI?
These models are therefore capable, fantastic even, but they are still far from producing flawless pictures. What would have to happen in order to fix this issue and eventually produce a hand that doesn’t resemble something out of an alien movie?
“This could all change in the future. These networks are slowly being trained on 3D geometries so they can understand the shape behind images. This will give us a more coherent image, even with complicated prompts,” says Proffessor Bentley. “Getting enough 3D design data could take time. At the moment, we’re getting the easy results in the form of these 2D images. It is easy to trawl the internet and get a million images without the context.” he added.
On further research regarding the topic, I happen to stumble upon a paper that was published by OpenAI in December 2022 which teases the idea of generating 3D figures of objects from mere text prompts. This idea has been coined as Point-E which uses a lot of GPU computation to make 3D figurines of objects that the users want to view.
A high-level overview of our pipeline. First, a text prompt is fed into a GLIDE model to produce a synthetic rendered view. Next, a point cloud diffusion stack conditions on this image to produce a 3D RGB point cloud (Nichol, 2022). Here, we see the representation of how the algorithm engages with the process of creating the 3D clouds, they happen to have created a new technology called GLIDE which they use in this process to create a reference for the system to draw from.
As we can see from these images, Point-E isn’t actually creating a whole flowing structure, unlike a conventional 3D model. Instead, a point cloud is being produced (hence the name)(Nichol, 2022). This only indicates a scattering of 3D points in a space which kind of looks like the object. That would clearly not appear to be very much, which is why the model has a second stage which converts these points into meshes. This results in the creation of something that actually has shapes and edges.
Point-E comes with its own limitations that need to be improved
When a new system is created, it always has its own factors that need to be considered before implementing it in a large scale. As OpenAI has said in the research paper, the system tends to lack some 3D points while generating the object resulting in blocky, disarranged points.
The team had to train the model in order for it to work. Similar to Dall-E 2, the text-to-image portion of the algorithm was trained using written cues. This referred to photographs that have alt-text to aid the model in deciphering the contents of the image. It was then necessary to train the image-to-3D model similarly. Similar training was provided for this, and a collection of photos and 3D models were provided so that Point-E could comprehend the connections between the two. Using a vast array of data sets, this training was done millions of times. Point-E was able to replicate coloured rough approximations of the requests in its initial tests of the model, but these representations were still far from being correct (Nichol, 2022).
From these models, we see the difficulty of bringing these models to life. But are these technologies a boon for humans? We can only see the tip of the iceberg of what these systems are capable of. As this technology is still in its infancy, it will probably take some time before we see Point-E produce realistic 3D representations and much longer before the general public can engage with it in the same way that Dall-E 2 or ChatGPT allows.
Future of AI and Humans
The purpose of the majority of contemporary artificial intelligence systems rapidly comes into doubt. There are growing concerns that platforms like ChatGPT and Dall-E 2 may replace creatives and artists. The same issues will probably surface with Point-E. There is a sizeable market for 3D design, and while Point-E can’t now match a 3D artist’s work precisely, it may be able to do so in the future. Whether we see Point-E on its own or a technology that aids Dall E 2, we can see that the AI and Image generation has a long way to go, but a long way doesn’t mean it takes a lot of time. With the rise of these technology, I feel the future of AI and human have to intertwine to make the most of it. Humans may end up losing jobs but the evolution of these new technology will create a new job sector of people maintaining and improving the things we can create with it . Yet, given that there are rumours that OpenAI spends well into the millions of dollars each month to maintain these projects, it is probable that the cost of using and operating this sort of software will be high, especially for something as complex as 3D rendering.
During the next ten years, applications of artificial intelligence are anticipated to have a significant influence on our society and economy. We are currently in the early stages of what many reliable experts consider to be the most promising period for technological innovation and value creation in the near future.
Chayka, K. (2023, March 9). The uncanny failure of a.i.-generated hands. The New Yorker.https://www.newyorker.com/culture/rabbit-holes/the-uncanny-failures-of-ai-generated-hands
Crawford, Kate (2021) The Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence. New Haven, CT: Yale University Press, pp. 1-21.
Dixit, P. (2023, January 31). AI image generators keep messing up hands. Here’s why. BuzzFeed News. https://www.buzzfeednews.com/article/pranavdixit/ai-generated-art-hands-fingers-messed-up
Hughes, A. (2023, February 4). Why AI-generated hands are the stuff of nightmares, explained by a scientist. BBC Science Focus Magazine. https://www.sciencefocus.com/future-technology/why-ai-generated-hands-are-the-stuff-of-nightmares-explained-by-a-scientist/
Just, Natascha & Latzer, Michael (2019) ‘Governance by algorithms: reality construction by algorithmic selection on the Internet’, Media, Culture & Society 39(2), pp. 238-258.
Kuta, S. (2022, September 6). Art made with artificial intelligence wins at state fair. Smithsonian Magazine.https://www.smithsonianmag.com/smart-news/artificial-intelligence-art-wins-colorado-state-fair-180980703/
Nichol, A. (2022, December 16). Point·E: A System for Generating 3D Point Clouds from Complex Prompts. Arxiv.Org. https://arxiv.org/pdf/2212.08751.pdf
Stability AI. (n.d.). Stability AI. Retrieved April 9, 2023, from https://stability.ai
Vergano, D. (2014, October 8). Cave paintings in Indonesia redraw picture of earliest art. National Geographic. https://www.nationalgeographic.com/science/article/141008-cave-art-sulawesi-hand-science