Bringing Emma to Life: Connecting a Humanoid Robot to Local Language Models
What started as a simple desire to have a conversation with our school's humanoid robot turned into an exciting journey of integrating localhosted AI models with physical robotics. Meet Emma (formerly known as Pepper) - our school's humanoid robot that we've transformed from a basic interactive display into an intelligent, vision-enabled conversational companion.
The Challenge: From Slow to Smart
Initially, we connected Emma to the ChatGPT API with a custom personality system prompt. While this worked, our school's WiFi network made response times painfully slow - not exactly the fluid conversation experience we were hoping for. That's when we decided to take matters into our own hands and go local.
We "borrowed" a workstation from the CAD lab (with permission, of course!) and set up our own local language model server. This not only solved our latency issues but also gave us complete control over Emma's responses and capabilities and it enabled us to fully run this project offline.
Meet Emma: The Hardware
Emma is a SoftBank Robotics' Pepper robot - a 4-foot-tall humanoid social robot that was originally introduced in 2014. With her distinctive white design, tablet-like chest screen, and expressive LED eyes, Pepper was designed specifically for human interaction. While she moves on wheels rather than legs, her articulated arms and hands make her surprisingly expressive.
Originally marketed as a home companion, robots like Emma have found their sweet spot in commercial settings - retail stores, hotels, airports, and educational institutions like ours, where she serves as both a learning tool and an impressive demonstration of human-robot interaction.
The Evolution: Adding Vision
The real game-changer came with the release of open-source Vision Language Models (VLMs). Suddenly, Emma could "see" through her cameras, making our interactions much more natural and contextually aware. She can now comment on what she observes, recognize objects and people, and respond to visual cues in addition to verbal ones.
The Technical Architecture
Our system works through a five-step process:
- Activation: The user touches Emma's head, triggering her to capture an image from her head-mounted camera and convert it to base64
- Audio Capture: The server begins recording and transcribing the user's speech using either Google Speech Recognition API or Whisper
- AI Processing: The transcribed text and base64 image are sent as a prompt to a Vision Language Model (VLM) running through Ollama
- Response Generation: Ollama generates the model's response and sends it back to the server
- Output: Emma receives the response and speaks it aloud, potentially performing programmed actions like moving toward the user, raising her hands, or playing an imaginary saxophone
The Software Stack
Robot-Side Programming
Since we're using the school's robot, we're required to use Choregraphe, the visual programming software that ships with Pepper robots. According to the official documentation:
Choregraphe allows you to create very complex behaviors (e.g. interaction with people, dance, send e-mails, etc...), without writing a single line of code. In addition, it allows you to add your own Python code to a Choregraphe behavior.
While Choregraphe is powerful for robot programming, its built-in speech recognition is quite limited, which is why we moved that functionality to our server.
Server-Side Intelligence
Our server runs on a Lenovo Thinkstation workstation with:
- AMD Ryzen 5 PRO 4650G CPU
- 32GB RAM
- NVIDIA Quadro T1000 GPU with 8GB VRAM
Given our hardware constraints and the need for real-time responses, we focus on smaller, efficient models. Currently, we're running Qwen 2.5-VL 3B or Gemma 4B for their excellent vision capabilities and manageable parameter counts, which enables as to have an average respose time of 4-6 seconds for each question. In the've experimented with various models ranging from 1 to 8 billion parameters.
The Dual-Model Approach
We using two separate model instances:
- Conversation Model: Generates Emma's actual spoken response to the user
- Action Model: Uses a different system prompt to determine what physical actions Emma should perform
This separation allows us to fine-tune each aspect independently - conversational quality and physical behavior.
We have found this apporach better as it enables us to then concatenate the action and the answer in a single json, which we found more reliable than to rely on a model’s capability to generate json files.
Code
You can find the code for this project here
This project was developed collaboratively with friends as part of our exploration into practical AI and robotics integration. Special thanks to our school for providing access to Emma and the CAD lab workstation that made this all possible.