You can use Automatic Speech Recognition (ASR) to recognize specific utterances
from your user using SpeechRecognizer and turn them into text.
SpeechRecognizer is built into Android (requiring no additional libraries) and
works even when offline.
For SpeechRecognizer to convert your user's speech into text, the user needs
to grant your app the RECORD_AUDIO permission. To learn how to request
this permission for your app, see Request hardware permissions.
Instantiate SpeechRecognizer
Instantiate the SpeechRecognizer in your AI glasses activity's
onCreate() method so that it's available for the lifetime of the
activity:
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
//The RECORD_AUDIO permission must be granted to your app before instantiation
speechRecognizer = SpeechRecognizer.createOnDeviceSpeechRecognizer(this)
speechRecognizer?.setRecognitionListener(recognitionListener)
...
}
Configure your RecognitionListener
The setRecognitionListener() method lets you specify the object where
important callbacks are made, such as in RecognitionListener.onResults(),
which the system calls after it recognizes spoken language.
val recognitionListener = object : RecognitionListener {
override fun onResults(results: Bundle?) {
val matches = results?.getStringArrayList(RESULTS_RECOGNITION)
val confidences = results?.getFloatArray(CONFIDENCE_SCORES)
val mostConfidentIndex = confidences!!.indices.maxByOrNull { confidences[it] }
if (mostConfidentIndex != null){
val spokenText = matches[mostConfidentIndex]
if (spokenText.equals("Start my Run", ignoreCase = true)){
// User indicated they want to start a run
}
}
}
...
}
Key points about the code
The bundle is queried for two arrays. The first array includes all of the matches and the second is the speech recognizer's confidence in what was heard. The indices of these arrays correspond to each other. The match with the highest confidence value (
mostConfidentIndex) is used.A case-insensitive string match is performed to determine what action the user wants to take.
Alternative approaches when matching
In the preceding example, the match with the highest confidence value is used. This choice means that the system must be pretty confident in what it understood from the user or it won't flag a match. When using this approach, you might get false negatives.
Another approach could be to look through all of the matches regardless of confidence and find any match that fits the input you are looking for. In contrast, this kind of approach could lead to more false positives. The approach you should take is largely dependent on your usecase.
Start listening
To start listening to the user, specify the ACTION_RECOGNIZE_SPEECH
intent when calling startListening().
override fun onStart() {
super.onStart()
val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply {
putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM)
}
speechRecognizer?.startListening(intent)
}
Key points about the code
- When using
ACTION_RECOGNIZE_SPEECH, you must also specify theEXTRA_LANGUAGE_MODELextra. LANGUAGE_MODEL_FREE_FORMis intended for conversational speech.