I've been attempting to build JARVIS from Iron Man in real life as a personal project, in the same way a cosplay fan might dress as Tony Stark, my inner engineer decided to go a slightly...different route. It's my belief that the MARVEL films have inspired a whole new generation of engineers, computer scientists and 'builders'. It's a great honour to be one of them. Anyway, let's get into today's topic.
I've wanted to document my project for a while. I figured the best way to do this would be in video form, however this leaves out a significant amount of detail. Which is why I'm now launching a sort of 'blog'. Containing links, new methods and regularly updated content.
Each post will break down a different part of the project, but I should warn you, the project never stands still and things change fast.
To have a conversation with anyone, human or computer, they must first know you're speaking to them! To do this, we can use something called Voice Activity Detection (VAD). This technology has evolved over the last fifty years from hard coded pattern matching to lightweight classifiers built around neural network models.
For my JARVIS I wanted to use something open source but reliable. With the mindset of "Don't re-invent the wheel" I chose WebRTC VAD, a wrapper around Google's WebRTC implementation, this is a fantastic project that supports both Python AND JavaScript.
The idea for my version of JARVIS is to have the front end client as light as possible. Using the Python WebRTC VAD library. Once it detects audio it will conditionally look for a wake word and ship the audio clip off to the server where JARVIS will convert the words into text and respond accordingly.
It should be noted that there are likely other more optimised implementations of this but it is the same way that Alexa, Google Home etc also work.
If you liked this post or have questions you can join the conversation on Discord!