ChatGPT now also understands images and voice commands

The company OpenAI is constantly improving the ChatGPT chatbot. The new version allows users to activate ChatGPT with voice and images, and with this new questions and concerns arise. So what does the new #141 version bring and when?

Most of the changes that OpenAI is making to ChatGPT relate to what the AI-powered bot can do: what questions it can answer, what information it can access, and so on. This time, however, it also changes the way you can use ChatGPT yourself. The company is introducing a new version of the service that allows you to interact with an artificially intelligent AI bot not only by writing sentences in a text field, but also by also by talking to him or just uploading a picture. The new features will be available to those who pay the Plus subscription in the coming weeks, while others will receive the new functionality "soon after".

The voice command part is not niÄ shockingly new: you tap a button and speak your question, ChatGPT converts it to text and passes it to a large language model, retrieves the answer and converts it back to speech, and answers you by voice. This should resemble a conversation with Alexa or the Google Assistant, except that – so OpenAI hopes – the answers will be better because of the improved underlying technology. Most virtual assistants seem to be revamping and incorporating big language models - and OpenAI is one step ahead for now.

OpenAI's excellent Whisper model does much of the speech-to-text conversion, and the company is also introducing a new text-to-speech model that is said to be able to create "human-like audio from just text and a few seconds." exemplary speech." You'll be able to choose a voice for ChatGPT from five options, but OpenAI seems to think the model has much more potential. For example, OpenAI works with Spotify to translate podcasts into other languages, while preserving the sound of the voice of the person hosting the podcast. There are many interesting uses for synthetic voices, and OpenAI could be a big part of that industry.

Regardless, the fact that you can create a decent synthetic voice with just a few seconds of audio opens the door to all sorts of potentially problematic use cases. "These capabilities present new threats, such as the possibility of malicious actors impersonating public figures and the like," the company's blog announcing the new features said. For this very reason, the model is not available for wider use and will be much more controlled and limited to specific use cases and partnerships.

The image search feature is somewhat similar to Google Lens. You take a photo, and ChatGPT will try to figure out what you're asking and respond accordingly. You can also use the drawing tool in the app to make the question as clear as possible, or speak or type questions related to the picture. This is where the nature of ChatGPT comes in particularly handy: instead of running a search, getting the wrong answer, and then running a new search, you can prompt the bot and improve the answer during the process. This is very similar to what Google is doing with multimodal search.

Obviously, the inclusion of images in ChatGPT also has its disadvantages. One of them is when you use ChatGPT “in person”: OpenAI says it has deliberately limited “ChatGPT's ability to analyze and make direct statements about people”. Both for accuracy and privacy. This means that one of the most sci-fi visions of artificial intelligence—the ability to look at someone and tell who they are—isn't going to come true any time soon. Which is probably a good thing.

Almost a year after ChatGPT's heyday, it seems that OpenAI is still trying to figure out how to give its model more functions and capabilities without creating new problems and negative aspects of use. With new releases, the company has tried to walk that fine line by consciously limiting what its new models can do. But the fact is that this approach will not always work. When there will always be moreÄ people using voice control and image search, and as ChatGPT gets closer to becoming a truly multi-modal, useful virtual assistant, it will become increasingly difficult to maintain all these safeguards.