Started this project with the motivation of playing GTA V using voice command (after watching DougDoub youtube video, specifically : https://www.youtube.com/watch?v=XvK4py0EE5s&t=1s ). I thought this would be a very easy task because there would be so many APIs and working codes that I will just take one of them and use it. But that wasnt the case. Started of SpeechRecognition (SR) library, but it has some major problems like its default google non cloud api is honestly bad. (not the fault of developers of library, google non cloud api is itself bad). It cannot detect long sentences and even it finds it difficult to understand smaller phrases (2 3 words). Also, there is not keyword feature..keyword being specific words that model should focus on. Then I tried to use SpeechRecognition library's IBM and google cloud functions but there were so many problems with authentication (for example one of the problems is IBM authentication doesnt use username and password for authentication anymore. But SR library only support that). So I thought it would be better to just use those APIs directly (without this SR library). First I treid IBM Watson API and boy oh boy, even their own repo wasnt working. API needs IBM account btw. But then I found the issue, it was service endpoint url missing in original code and when I added that (after wasting several hours) it worked. This API was working good but my goal was to use it for GTA and for that I keywords and even though API had a feature for telling the model to concentrate on these keywords but it wasnt doing so (at least that wasnt the result). So the result was not what I wanted. Thing to note here is that this is streaming API and that means it is continously returning the output (which is good). Then I moved to Google cloud API. It needs an account and starting few months are free (and you can only open the websocket for 305 seconds at a time). This API worked beautifully and the documentation was great. Except few minor library and environment problems. It similar to Watson, return results continously but unlike watson it produces refined output (correcting mistakes in interim outputs if any) after a few seconds of break (silence). One more advantage of google cloud api was it used the keywords almost perfectly. The keywords that I provided to this library were almost always predicted correctly. whereas same keywords were not being predicted in watson's api.
For Comparison I ran all these programs together and read a paragraph from reddit and the results are shown in the picture named "Speech recognition APIs comparison.png". (open in new tab if all 3 results are not showing. picture size is large so github cuts the image short, dont know why, ask github). Feel free to discuss the results btw. (email : mumertbutt@gmail.com )
So considering the results I used google cloud api for my GTA program. and the result was ..... beautiful. here is the video: https://www.youtube.com/watch?v=Hb1-bNAV2Zs
Note: You might need to install some libraries first. Sorry i couldnt keep track of each library. first and foremost of them is pyaudio (its for mic use). Others are (there can be more, these are only the one which i remember isntalling) : pyttsx3, speech_recognition (if you want) and etc. You will also need to install google cloud and ibm watson libraries but they all are mentioned on their site and hopefulyl you wont face any problem. Whenever I faced a problem I just googled and first page results always helped me so you wont have to dig deep for problems.