Implementing Real-Time Screen-Sharing Assistant with Voice and Transcript using Gemini 2.0 and Gemini 1.5 Models
In this article, we will explore how to implement a real-time screen-sharing assistant with voice and transcript using Gemini 2.0 and Gemini 1.5 models. The Gemini multi-modal live API has been used in previous videos to demonstrate its capabilities, including real-time interactions through text, voice, and camera, along with screen sharing.
Introduction to Gemini Multi-Modal Live API
The Gemini multi-modal live API is a powerful tool that enables real-time interactions through text, voice, and camera, along with screen sharing. However, there is still a critical issue that prevents real-world application implementation, which is the inability to provide both real-time text and audio responses.
Introduction to Gemini Multi-Modal Live API
Remaining Issue of Gemini API
The "response_modality" parameter in the library is allowed to be set as a list of "audio" plus "text," but this setting is not working as expected, with only an error throughout.
Remaining Issue of Gemini API
Project Architecture
The application uses two Gemini models in a streamlined process. First, the client sends visual and audio inputs to the server. The server then uses the Gemini 2.0 Flash model for real-time audio streaming generation. Next, the server uses that output to transcribe this audio output into text using the Gemini 1.5 Flash 8B model.
Project Architecture
Code Walkthrough
The server code is responsible for handling the client's configuration message, connecting to the Gemini 2.0 multi-modal live API, and sending and receiving data. The "gemini_session_handler" function is used to handle the WebSocket connection and data exchange with the Gemini 2.0 multi-modal live API.
Code Walkthrough
Run the App
The entire screen-sharing process is implemented with the Gemini 2.0 multi-modal live API in the previous video. The key features of text and audio improvement are in the backend server, and the frontend code can be copied from the GitHub repository.
Run the App
Conclusion
In conclusion, this video addresses the limitation of the Gemini multi-modal live API by providing both readable transcriptions and audio feedback to implement the real-time screen-sharing assistant. The application uses two Gemini models in a streamlined process, and the server code is responsible for handling the client's configuration message, connecting to the Gemini 2.0 multi-modal live API, and sending and receiving data.
Conclusion
Future Development
The Gemini multi-modal live API is still in its early stage, and there are many opportunities for future development. The use of two different models, Gemini 2.0 Flash and Gemini 1.5 Flash 8B, provides a efficient and cost-effective solution for real-time audio streaming generation and transcription.
Future Development