ENFR

Tech • IA • Crypto

Aujourd'hui Vidéos Récaps vidéo Tous les topics Top articles Archives

Nous lançons trois modèles audio dans l’API

IAOpenAI7 mai 20264:05

0:00 / 0:00

INTRO

OpenAI a présenté des modèles audio en temps réel capables de traduction multilingue en direct et d’exécution de tâches pilotées par la voix, avec raisonnement et intégration aux systèmes.

POINTS CLÉS

Lancement de modèles audio en temps réel

OpenAI a dévoilé de nouvelles capacités audio en temps réel dans son API, mettant en avant deux systèmes: GPT Realtime Translate et GPT Realtime 2. Ces modèles sont conçus pour traiter la parole instantanément, permettant à la fois la traduction en direct et une assistance vocale interactive. Cette sortie marque une volonté de faire de la voix une interface principale des systèmes numériques.

Traduction en direct dans 70 langues

Le modèle GPT Realtime Translate peut traduire la parole en temps réel dans environ 70 langues. Il commence à traduire en cours de phrase en identifiant des éléments linguistiques clés comme les verbes, produisant un rendu proche d’une conversation naturelle. Le système reste fluide même lorsque les locuteurs changent brusquement de langue ou utilisent des termes techniques.

Flux conversationnel naturel

Contrairement aux outils classiques qui attendent des pauses, le modèle produit une sortie continue, créant l’effet d’un dialogue en direct. Il préserve la structure des phrases et le ton, facilitant la communication dans des contextes multilingues comme les présentations internationales, le support client ou l’éducation. Il peut aussi changer de langue dynamiquement sans interruption.

Agents vocaux avec raisonnement et actions

Le modèle GPT Realtime 2 introduit des capacités de raisonnement dans les assistants vocaux, leur permettant d’interpréter les demandes, d’accéder aux données et d’exécuter des tâches. Lors des démonstrations, il a récupéré des détails de calendrier, identifié des participants à des réunions et répondu de manière conversationnelle, tout en conservant le contexte et le timing.

Intégration avec des systèmes externes

Le modèle peut se connecter à des outils externes tels que les calendriers, systèmes CRM, tableaux de bord et appareils connectés. Dans un exemple, il a mis à jour une entrée CRM avec des résumés de réunion et des prochaines étapes après avoir récupéré le contexte pertinent. Cette intégration permet aux agents vocaux d’agir directement dans les flux de travail existants.

Utilisation parallèle des outils et retours utilisateur

GPT Realtime 2 prend en charge les appels d’outils en parallèle, permettant plusieurs actions en arrière-plan simultanément. Pendant ces प्रक्रces, le système informe l’utilisateur de sa progression via de courtes mises à jour ou « préambules », assurant de la transparence lorsque les tâches prennent plusieurs secondes.

Écoute continue sans interruption

Une fonctionnalité clé est l’écoute persistante: l’assistant reste conscient du contexte de la conversation sans interrompre tant qu’il n’est pas sollicité. Les utilisateurs peuvent ainsi parler naturellement, faire des pauses et reprendre sans réinitialiser le système, renforçant l’impression d’un dialogue continu.

Maintien du contexte et adaptabilité

Les modèles conservent le contexte conversationnel dans le temps, leur permettant de gérer des interactions en plusieurs étapes et des instructions évolutives. Cela prend en charge des cas d’usage plus complexes comme la préparation de comptes rendus, le suivi de tâches ou la coordination entre plusieurs applications.

Implications pour les interfaces vocales

Ces avancées positionnent la voix comme une interface principale plus viable pour l’interaction numérique. En combinant traduction, raisonnement et exécution d’actions en temps réel, les modèles réduisent les frictions entre communication humaine et exécution logicielle.

CONCLUSION

Ces nouveaux modèles audio en temps réel illustrent un basculement vers une informatique fluide pilotée par la voix, alliant traduction instantanée et exécution intelligente de tâches au sein de systèmes connectés.

Transcription complète

Hey everyone, we're introducing new real-time audio models in the OpenAI API. In this demo, I'll show two of them. GPT Realtime Translate for live translations and GPT Realtime 2 for voice agents that can follow instruction and take actions. Let's start with translations cuz that one feels so magical. I speak French, but say I need to present to an audience around the world. The English you'll hear is the model's live audio output captured directly from this laptop with transcriptions. Now, as I start speaking in French, we'll lower the volume of my mic and increase the one from the model so you can have a real feel for it. No edit to the audio. Let's give it a try. What's really impressive is that the model can listen to me and translate while I'm speaking. It waits for the key word like the verb. start translating right away and the result is a much more natural conversation just like a dialogue between two people. I can even interrupt in German and the model switches effortlessly between my German and your French and we can even include technical terms like GPT real time open AI or computer use and the model has no trouble handling that. Isn't that amazing? The model can translate across 70 different languages in real time, really following the shape of every sentence. So whether you're building a media platform or tools for customer support or education, we believe that this can help you break down the language barriers. And this model is just one of the ways we're improving voice intelligence. So for the next demo, let's talk about GPT Realtime 2, our new model that brings intelligent reasoning to voice agents. So, let's bring up my phone and take a look at my personal voice assistant. >> Hi there. >> Hi again. What's up? >> Yeah, I have uh a customer meeting coming up. Can you take a look at my calendar? >> You have a meeting with Sable Crust Robotics in 12 minutes. And you're meeting with Alex Kim, their CTO. >> Great. Thank you. Oh, please stay quiet for a second until I say back to demo. Ramon, don't forget now that these models have things like reasoning and parallel tool calling, it's even more important to use things like preamles. This way, the model can explain itself and update the user. Thank you, Jason, for the great reminder. Very important. Actions can of course take a few seconds and so it's very important for the model to acknowledge those. With GPT realtime 2, you can communicate directly during the reasoning and the tool calling so the user stays informed. And by the way, what makes voice agent so natural now is that they stay in the conversation. Jason and I have been chatting. Uh the model has been listening and it's still listening now but not interrupting us until I say back to demo. >> I'm here when you're ready to continue the demo. >> Pretty cool, right? So now let's highlight what Jason just mentioned with preamble. Let's ask another task and say hey could you uh now uh update the CRM and um put the meeting of today as a brief and the next steps. >> Let me pull the latest context and update your CRM. Sablerest launched warehouse automation this morning. Expansion is active. Security review is the blocker. >> All right. Thank you. I'm all set. Uh please take quiet again as I wrap this up. What's exciting here is that you can now connect the model to any kind of system. It could be your dashboards, the services you're using, even connected devices, and so much more. So, that was a quick preview of our new real-time audio model coming to the OpenAI API. You can now create agents that keep conversation going as they think in the background. They can translate live across 70 languages. They can preserve context and they can even act inside the products you're already using. Voice can truly become the primary interface now and we can't wait to see what you build with these new models. Thanks for watching. Hey, back to demo. How was that? >> Smooth and clear. It felt natural and demo friendly.

Sur le même sujet : IA