-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ros_speech_recognition] Add WordInfo to SpeechRecognitionCandidates message. #320
base: master
Are you sure you want to change the base?
Changes from all commits
c841abc
53f5abb
17962f8
c74cc89
3c4b326
3387500
03cd8ed
9ca04e9
ae3f3eb
c267662
46b319d
a927fef
70aa6a4
c0b44bd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,6 +30,106 @@ This package uses Python package [SpeechRecognition](https://pypi.python.org/pyp | |
print result # => 'Hello, world!' | ||
``` | ||
|
||
If you are using `ros_speech_recognition` with `~continuous` is `True`, you can subscribe `/Tablet/voice` (`speech_recognition_msgs/SpeechRecognitionCandidates`) message. | ||
|
||
1. Launch sample launch file. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about adding google cloud speech-to-text URL to README.md too? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you very much for documentation. In addition to the |
||
|
||
|
||
```bash | ||
roslaunch ros_speech_recognition sample_ros_speech_recognition.launch | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We may need to set
I think
is more helpful. |
||
``` | ||
|
||
2. echo the message. | ||
|
||
|
||
```bash | ||
$ rostopic echo /Tablet/voice | ||
transcript: | ||
- may I help you | ||
confidence: [0.9286448955535889] | ||
sentences: | ||
- | ||
header: | ||
seq: 0 | ||
stamp: | ||
secs: 1641425262 | ||
nsecs: 268165588 | ||
frame_id: '' | ||
words: | ||
- | ||
start_time: 0.0 | ||
end_time: 0.2 | ||
word: "may" | ||
confidence: 0.91376436 | ||
speaker_tag: 0 | ||
- | ||
start_time: 0.2 | ||
end_time: 0.4 | ||
word: "I" | ||
confidence: 0.9366196 | ||
speaker_tag: 0 | ||
- | ||
start_time: 0.4 | ||
end_time: 0.5 | ||
word: "help" | ||
confidence: 0.9531065 | ||
speaker_tag: 0 | ||
- | ||
start_time: 0.5 | ||
end_time: 0.8 | ||
word: "you" | ||
confidence: 0.9110889 | ||
speaker_tag: 0 | ||
--- | ||
transcript: | ||
- pick up the red kettle | ||
confidence: [0.9499567747116089] | ||
sentences: | ||
- | ||
header: | ||
seq: 0 | ||
stamp: | ||
secs: 1641425268 | ||
nsecs: 58182954 | ||
frame_id: '' | ||
words: | ||
- | ||
start_time: 0.0 | ||
end_time: 0.4 | ||
word: "pick" | ||
confidence: 0.953269 | ||
speaker_tag: 0 | ||
- | ||
start_time: 0.4 | ||
end_time: 0.6 | ||
word: "up" | ||
confidence: 0.95326656 | ||
speaker_tag: 0 | ||
- | ||
start_time: 0.6 | ||
end_time: 0.8 | ||
word: "the" | ||
confidence: 0.96866167 | ||
speaker_tag: 0 | ||
- | ||
start_time: 0.8 | ||
end_time: 1.1 | ||
word: "red" | ||
confidence: 0.98762906 | ||
speaker_tag: 0 | ||
- | ||
start_time: 1.1 | ||
end_time: 1.5 | ||
word: "kettle" | ||
confidence: 0.8869578 | ||
speaker_tag: 0 | ||
``` | ||
|
||
The `word` is recognized word and the `confidence` means a higher number indicates an estimated greater likelihood that the recognized words are correct. | ||
`start_time` indicates time offset relative to the beginning of the audio (timestamp of header), and corresponding to the start of the spoken word. | ||
`end_time` indicates time offset relative to the beginning of the audio, and corresponding to the end of the spoken word. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will be usefull to put these descriptions to message definition. |
||
|
||
|
||
## Interface | ||
|
||
### Publishing Topics | ||
|
@@ -38,6 +138,11 @@ This package uses Python package [SpeechRecognition](https://pypi.python.org/pyp | |
|
||
Action client to play sound on events. If the action server is not available or `~enable_sound_effect` is `False`, no sound is played. | ||
|
||
|
||
* `/Tablet/voice` (`speech_recognition_msgs/SpeechRecognitionCandidates`) | ||
|
||
Publish recognized results when `~continuous` is `True`. | ||
|
||
### Subscribing Topics | ||
|
||
* `audio` (`audio_common_msgs/AudioData`) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
<launch> | ||
|
||
<arg name="google_cloud_credentials_json" default="''" doc="Credential JSON is only used when the engine is GoogleCloud." /> | ||
<arg name="engine" default="GoogleCloud" doc="Speech to text engine. TTS engine, Google, GoogleCloud, Sphinx, Wit, Bing Houndify, IBM" /> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my understanding, If you think |
||
<arg name="launch_sound_play" default="false" doc="Launch sound_play node to speak" /> | ||
|
||
<!-- launch bag file --> | ||
<node name="rosbag_play" | ||
pkg="rosbag" type="play" | ||
args="$(find ros_speech_recognition)/sample/data/may_i_help_you.bag --loop"> | ||
</node> | ||
|
||
<include file="$(find ros_speech_recognition)/launch/speech_recognition.launch"> | ||
<arg name="launch_sound_play" value="$(arg launch_sound_play)" /> | ||
<arg name="engine" value="$(arg engine)" /> | ||
<arg name="launch_audio_capture" value="false" /> | ||
<arg name="audio_topic" value="/speech_audio" /> | ||
<arg name="language" value="en-US" /> | ||
<arg name="continuous" value="true" /> | ||
<arg name="google_cloud_credentials_json" value="$(arg google_cloud_credentials_json)" /> | ||
</include> | ||
|
||
</launch> |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,13 +8,18 @@ | |
from ros_speech_recognition.recognize_google_cloud import RecognizerEx | ||
import json | ||
import array | ||
import os | ||
import sys | ||
from threading import Lock | ||
|
||
import numpy as np | ||
import std_msgs.msg | ||
from audio_common_msgs.msg import AudioData | ||
from sound_play.msg import SoundRequest, SoundRequestAction, SoundRequestGoal | ||
|
||
from speech_recognition_msgs.msg import SentenceInfo | ||
from speech_recognition_msgs.msg import SpeechRecognitionCandidates | ||
from speech_recognition_msgs.msg import WordInfo | ||
from speech_recognition_msgs.srv import SpeechRecognition | ||
from speech_recognition_msgs.srv import SpeechRecognitionResponse | ||
from std_srvs.srv import Empty | ||
|
@@ -219,22 +224,32 @@ def play_sound(self, key, timeout=5.0): | |
|
||
def recognize(self, audio): | ||
recog_func = None | ||
self.enable_diarization = False | ||
if self.engine == Config.SpeechRecognition_Google: | ||
if not self.args: | ||
self.args = {'key': rospy.get_param("~google_key", None)} | ||
recog_func = self.recognizer.recognize_google | ||
elif self.engine == Config.SpeechRecognition_GoogleCloud: | ||
if not self.args: | ||
credentials_path = rospy.get_param("~google_cloud_credentials_json", None) | ||
if credentials_path is not None: | ||
if credentials_path is not None and len(credentials_path) > 0: | ||
if os.path.exists(credentials_path) is False: | ||
rospy.logerr( | ||
'google_cloud_credentials_json ' | ||
'{} not exists.'.format(credentials_path)) | ||
sys.exit(1) | ||
with open(credentials_path) as j: | ||
credentials_json = j.read() | ||
else: | ||
credentials_json = None | ||
self.args = {'credentials_json': credentials_json, | ||
'preferred_phrases': rospy.get_param('~google_cloud_preferred_phrases', None)} | ||
'preferred_phrases': rospy.get_param('~google_cloud_preferred_phrases', None), | ||
'show_all': True} | ||
if rospy.has_param('~diarizationConfig') : | ||
self.args.update({'user_config': {'diarizationConfig': rospy.get_param('~diarizationConfig') }}) | ||
diarizationConfig = rospy.get_param('~diarizationConfig') | ||
self.args.update({'user_config': {'diarizationConfig': diarizationConfig}}) | ||
self.enable_diarization = diarizationConfig.get( | ||
'enableSpeakerDiarization', False) | ||
recog_func = self.recognizer.recognize_google_cloud | ||
elif self.engine == Config.SpeechRecognition_Sphinx: | ||
recog_func = self.recognizer.recognize_sphinx | ||
|
@@ -251,17 +266,72 @@ def recognize(self, audio): | |
|
||
return recog_func(audio_data=audio, language=self.language, **self.args) | ||
|
||
def make_result_message_from_result(self, result, header=None): | ||
if header is None: | ||
header = std_msgs.msg.Header(stamp=rospy.Time.now()) | ||
if self.engine == Config.SpeechRecognition_GoogleCloud: | ||
if "results" not in result or len(result["results"]) == 0: | ||
raise SR.UnknownValueError() | ||
transcript = [] | ||
confidence = [] | ||
sentences = [] | ||
for res in result["results"]: | ||
sent_info_msg = SentenceInfo(header=header) | ||
if self.enable_diarization is False: | ||
transcript.append( | ||
res["alternatives"][0]["transcript"].strip()) | ||
confidence.append(res["alternatives"][0]['confidence']) | ||
prev_speaker = None | ||
trans = '' | ||
confs = [] | ||
for word in res["alternatives"][0]['words']: | ||
speaker = word.get('spekaerTag', 0) | ||
conf = word.get('confidence', 0.0) | ||
# for more details, please see | ||
# https://cloud.google.com/speech-to-text/docs/reference/rest/v1/speech/recognize#wordinfo | ||
word_info_msg = WordInfo( | ||
start_time=float(word.get( | ||
'startTime', '0.0s').rstrip('s')), | ||
end_time=float(word.get( | ||
'endTime', '0.0s').rstrip('s')), | ||
word=word.get('word', ''), | ||
confidence=conf, | ||
speaker_tag=speaker) | ||
sent_info_msg.words.append(word_info_msg) | ||
if self.enable_diarization is True \ | ||
and prev_speaker != speaker: | ||
trans += "[{}]".format(speaker) | ||
prev_speaker = speaker | ||
trans += ' ' + word['word'] | ||
confs.append(conf) | ||
if self.enable_diarization is True: | ||
transcript.append(trans) | ||
confidence.append(np.mean(conf)) | ||
sentences.append(sent_info_msg) | ||
msg = SpeechRecognitionCandidates( | ||
transcript=transcript, | ||
confidence=confidence, | ||
sentences=sentences) | ||
transcript = " ".join(transcript) | ||
else: | ||
transcript = result | ||
msg = SpeechRecognitionCandidates( | ||
transcript=[transcript]) | ||
return msg, transcript | ||
|
||
def audio_cb(self, _, audio): | ||
if not self.enable_audio_cb: | ||
return | ||
try: | ||
rospy.logdebug("Waiting for result... (Sent %d bytes)" % len(audio.get_raw_data())) | ||
header = std_msgs.msg.Header(stamp=rospy.Time.now()) | ||
result = self.recognize(audio) | ||
self.play_sound("recognized", 0.05) | ||
rospy.loginfo("Result: %s" % result.encode('utf-8')) | ||
self.play_sound("success", 0.1) | ||
msg = SpeechRecognitionCandidates(transcript=[result]) | ||
msg, transcript = self.make_result_message_from_result( | ||
result, header=header) | ||
rospy.loginfo("Result: %s" % transcript) | ||
self.pub.publish(msg) | ||
self.play_sound("success", 0.1) | ||
return | ||
except SR.UnknownValueError as e: | ||
if self.dynamic_energy_threshold: | ||
|
@@ -322,11 +392,14 @@ def speech_recognition_srv_cb(self, req): | |
rospy.loginfo("Waiting for result... (Sent %d bytes)" % len(audio.get_raw_data())) | ||
|
||
try: | ||
header = std_msgs.msg.Header(stamp=rospy.Time.now()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this timestamp may be misleading. (Not start time of original audio data, But the time of speech recognition). |
||
result = self.recognize(audio) | ||
rospy.loginfo("Result: %s" % result.encode('utf-8')) | ||
if not req.quiet: | ||
self.play_sound("success", 0.1) | ||
res.result = SpeechRecognitionCandidates(transcript=[result]) | ||
msg, _ = self.make_result_message_from_result( | ||
result, header=header) | ||
res.result = msg | ||
return res | ||
except SR.UnknownValueError: | ||
if self.dynamic_energy_threshold: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,8 @@ | ||
<launch> | ||
|
||
<arg name="google_cloud_credentials_json" default="''" doc="Credential JSON is only used when the engine is GoogleCloud." /> | ||
<arg name="engine" default="Google" doc="Speech to text engine. TTS engine, Google, GoogleCloud, Sphinx, Wit, Bing Houndify, IBM" /> | ||
|
||
<!-- launch speech recognition --> | ||
|
||
<arg name="launch_sound_play" default="false" doc="Launch sound_play node to speak" /> | ||
|
@@ -9,10 +13,12 @@ | |
</rosparam> | ||
<include file="$(find ros_speech_recognition)/launch/speech_recognition.launch"> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about integrating the content of this test to |
||
<arg name="launch_sound_play" value="$(arg launch_sound_play)" /> | ||
<arg name="engine" value="$(arg engine)" /> | ||
<arg name="launch_audio_capture" value="false" /> | ||
<arg name="audio_topic" value="/speech_audio" /> | ||
<arg name="language" value="ja" /> | ||
<arg name="continuous" value="true" /> | ||
<arg name="google_cloud_credentials_json" value="$(arg google_cloud_credentials_json)" /> | ||
</include> | ||
|
||
<!-- launch bag file --> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iory
You changed
speech_recognition_msgs/SpeechRecognitionCandidates
content in jsk-ros-pkg/jsk_common_msgs#28In my opinion, we should create new message type like
speech_recognition_msgs/GoogleCloudSpeechRecognitionCandidates
in addition to the existingspeech_recognition_msgs/SpeechRecognitionCandidates
This is because the new fields (e.g. wordinfo) seems to be specific to the google cloud speech-to-text.
This is my preference, but how about wrapping the common message by each speech-to-text service message?
The advantage of this method is we do not need to change the current
speech_recognition_msgs/SpeechRecognitionCandidates
.For example,
In addition, please consider that
julius_ros
also publishesspeech_recognition_msgs/SpeechRecognitionCandidates
.https://github.com/jsk-ros-pkg/jsk_3rdparty/tree/master/julius_ros#gmm-version
I'd like to hear your thoughts. @iory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not specific to the google cloud speech-to-text.
For example, azure cognitive service has similar functions.
https://docs.microsoft.com/ja-jp/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.speechconfig?view=azure-python#request-word-level-timestamps--
I think that the start time and end time for each word are general values as a framework for speech recognition.
However, this is a good way to avoid affecting other users, so I'll take this direction.