Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ros_speech_recognition] Add WordInfo to SpeechRecognitionCandidates message. #320

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions ros_speech_recognition/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,106 @@ This package uses Python package [SpeechRecognition](https://pypi.python.org/pyp
print result # => 'Hello, world!'
```

If you are using `ros_speech_recognition` with `~continuous` is `True`, you can subscribe `/Tablet/voice` (`speech_recognition_msgs/SpeechRecognitionCandidates`) message.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iory

You changed speech_recognition_msgs/SpeechRecognitionCandidates content in jsk-ros-pkg/jsk_common_msgs#28

In my opinion, we should create new message type like speech_recognition_msgs/GoogleCloudSpeechRecognitionCandidates in addition to the existing speech_recognition_msgs/SpeechRecognitionCandidates

This is because the new fields (e.g. wordinfo) seems to be specific to the google cloud speech-to-text.

This is my preference, but how about wrapping the common message by each speech-to-text service message?
The advantage of this method is we do not need to change the current speech_recognition_msgs/SpeechRecognitionCandidates.

For example,

$ rosmsg show speech_recognition_msgs/GoogleCloudSpeechRecognitionCandidates

Header header
speech_recognition_msgs/SpeechRecognitionCandidates candidates
speech_recognition_msgs/SentenceInfo[] sentences

In addition, please consider that julius_ros also publishes speech_recognition_msgs/SpeechRecognitionCandidates.
https://github.com/jsk-ros-pkg/jsk_3rdparty/tree/master/julius_ros#gmm-version

I'd like to hear your thoughts. @iory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the new fields (e.g. wordinfo) seems to be specific to the google cloud speech-to-text.

This is not specific to the google cloud speech-to-text.
For example, azure cognitive service has similar functions.
https://docs.microsoft.com/ja-jp/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.speechconfig?view=azure-python#request-word-level-timestamps--

I think that the start time and end time for each word are general values as a framework for speech recognition.

This is my preference, but how about wrapping the common message by each speech-to-text service message?
The advantage of this method is we do not need to change the current

However, this is a good way to avoid affecting other users, so I'll take this direction.


1. Launch sample launch file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding google cloud speech-to-text URL to README.md too?
It would be helpful.
https://cloud.google.com/speech-to-text/docs/reference/rest/v1/speech/recognize#wordinfo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for documentation.

In addition to the engine:=GoogleCloud example, it would be very helpful if you could add the engine:=Google example.



```bash
roslaunch ros_speech_recognition sample_ros_speech_recognition.launch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to set google_cloud_credentials_json:=xxx arg?
I got the following error.

[ERROR] [1641453200.346079]: Unexpected error: (<class 'oauth2client.client.ApplicationDefaultCredentialsError'>, ApplicationDefaultCredentialsError('The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.',), <traceback object at 0x7f980f9a6960>)

I think

roslaunch ros_speech_recognition sample_ros_speech_recognition.launch google_cloud_credentials_json:=xxx.json

is more helpful.

```

2. echo the message.


```bash
$ rostopic echo /Tablet/voice
transcript:
- may I help you
confidence: [0.9286448955535889]
sentences:
-
header:
seq: 0
stamp:
secs: 1641425262
nsecs: 268165588
frame_id: ''
words:
-
start_time: 0.0
end_time: 0.2
word: "may"
confidence: 0.91376436
speaker_tag: 0
-
start_time: 0.2
end_time: 0.4
word: "I"
confidence: 0.9366196
speaker_tag: 0
-
start_time: 0.4
end_time: 0.5
word: "help"
confidence: 0.9531065
speaker_tag: 0
-
start_time: 0.5
end_time: 0.8
word: "you"
confidence: 0.9110889
speaker_tag: 0
---
transcript:
- pick up the red kettle
confidence: [0.9499567747116089]
sentences:
-
header:
seq: 0
stamp:
secs: 1641425268
nsecs: 58182954
frame_id: ''
words:
-
start_time: 0.0
end_time: 0.4
word: "pick"
confidence: 0.953269
speaker_tag: 0
-
start_time: 0.4
end_time: 0.6
word: "up"
confidence: 0.95326656
speaker_tag: 0
-
start_time: 0.6
end_time: 0.8
word: "the"
confidence: 0.96866167
speaker_tag: 0
-
start_time: 0.8
end_time: 1.1
word: "red"
confidence: 0.98762906
speaker_tag: 0
-
start_time: 1.1
end_time: 1.5
word: "kettle"
confidence: 0.8869578
speaker_tag: 0
```

The `word` is recognized word and the `confidence` means a higher number indicates an estimated greater likelihood that the recognized words are correct.
`start_time` indicates time offset relative to the beginning of the audio (timestamp of header), and corresponding to the start of the spoken word.
`end_time` indicates time offset relative to the beginning of the audio, and corresponding to the end of the spoken word.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be usefull to put these descriptions to message definition.



## Interface

### Publishing Topics
Expand All @@ -38,6 +138,11 @@ This package uses Python package [SpeechRecognition](https://pypi.python.org/pyp

Action client to play sound on events. If the action server is not available or `~enable_sound_effect` is `False`, no sound is played.


* `/Tablet/voice` (`speech_recognition_msgs/SpeechRecognitionCandidates`)

Publish recognized results when `~continuous` is `True`.

### Subscribing Topics

* `audio` (`audio_common_msgs/AudioData`)
Expand Down
6 changes: 6 additions & 0 deletions ros_speech_recognition/launch/speech_recognition.launch
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@
<arg name="engine" default="Google" doc="Speech to text engine. TTS engine, Google, GoogleCloud, Sphinx, Wit, Bing Houndify, IBM" />
<arg name="language" default="en-US" doc="Speech to text language. For Japanese, set ja-JP." />
<arg name="continuous" default="false" doc="If false, /speech_recognition service is published. If true, /Tablet/voice topic is published." />
<arg name="google_cloud_credentials_json" default="''" doc="Credential JSON is only used when the engine is GoogleCloud." />
<arg name="enable_speaker_diarization" default="false" doc="Enable speaker diarization option when the engine is GoogleCloud." />

<!-- sound play -->
<node name="sound_play" pkg="sound_play" type="soundplay_node.py"
Expand All @@ -35,6 +37,7 @@
<node name="speech_recognition"
pkg="ros_speech_recognition" type="speech_recognition_node.py"
respawn="true"
clear_params="true"
output="screen">
<rosparam subst_value="true">
audio_topic: $(arg audio_topic)
Expand All @@ -46,6 +49,9 @@
language: $(arg language)
continuous: $(arg continuous)
enable_sound_effect: $(arg launch_sound_play)
google_cloud_credentials_json: $(arg google_cloud_credentials_json)
diarizationConfig:
enableSpeakerDiarization: $(arg enable_speaker_diarization)
</rosparam>
</node>

Expand Down
Binary file not shown.
23 changes: 23 additions & 0 deletions ros_speech_recognition/sample/sample_ros_speech_recognition.launch
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<launch>

<arg name="google_cloud_credentials_json" default="''" doc="Credential JSON is only used when the engine is GoogleCloud." />
<arg name="engine" default="GoogleCloud" doc="Speech to text engine. TTS engine, Google, GoogleCloud, Sphinx, Wit, Bing Houndify, IBM" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, GoogleCloud needs credentials. (It costs money)
I think the default engine should be Google so that we can try this ROS package for free.

If you think GoogleCloud should be the default for ros_speech_recognition, I think it's ok to keep this change.

<arg name="launch_sound_play" default="false" doc="Launch sound_play node to speak" />

<!-- launch bag file -->
<node name="rosbag_play"
pkg="rosbag" type="play"
args="$(find ros_speech_recognition)/sample/data/may_i_help_you.bag --loop">
</node>

<include file="$(find ros_speech_recognition)/launch/speech_recognition.launch">
<arg name="launch_sound_play" value="$(arg launch_sound_play)" />
<arg name="engine" value="$(arg engine)" />
<arg name="launch_audio_capture" value="false" />
<arg name="audio_topic" value="/speech_audio" />
<arg name="language" value="en-US" />
<arg name="continuous" value="true" />
<arg name="google_cloud_credentials_json" value="$(arg google_cloud_credentials_json)" />
</include>

</launch>
87 changes: 80 additions & 7 deletions ros_speech_recognition/scripts/speech_recognition_node.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,18 @@
from ros_speech_recognition.recognize_google_cloud import RecognizerEx
import json
import array
import os
import sys
from threading import Lock

import numpy as np
import std_msgs.msg
from audio_common_msgs.msg import AudioData
from sound_play.msg import SoundRequest, SoundRequestAction, SoundRequestGoal

from speech_recognition_msgs.msg import SentenceInfo
from speech_recognition_msgs.msg import SpeechRecognitionCandidates
from speech_recognition_msgs.msg import WordInfo
from speech_recognition_msgs.srv import SpeechRecognition
from speech_recognition_msgs.srv import SpeechRecognitionResponse
from std_srvs.srv import Empty
Expand Down Expand Up @@ -219,22 +224,32 @@ def play_sound(self, key, timeout=5.0):

def recognize(self, audio):
recog_func = None
self.enable_diarization = False
if self.engine == Config.SpeechRecognition_Google:
if not self.args:
self.args = {'key': rospy.get_param("~google_key", None)}
recog_func = self.recognizer.recognize_google
elif self.engine == Config.SpeechRecognition_GoogleCloud:
if not self.args:
credentials_path = rospy.get_param("~google_cloud_credentials_json", None)
if credentials_path is not None:
if credentials_path is not None and len(credentials_path) > 0:
if os.path.exists(credentials_path) is False:
rospy.logerr(
'google_cloud_credentials_json '
'{} not exists.'.format(credentials_path))
sys.exit(1)
with open(credentials_path) as j:
credentials_json = j.read()
else:
credentials_json = None
self.args = {'credentials_json': credentials_json,
'preferred_phrases': rospy.get_param('~google_cloud_preferred_phrases', None)}
'preferred_phrases': rospy.get_param('~google_cloud_preferred_phrases', None),
'show_all': True}
if rospy.has_param('~diarizationConfig') :
self.args.update({'user_config': {'diarizationConfig': rospy.get_param('~diarizationConfig') }})
diarizationConfig = rospy.get_param('~diarizationConfig')
self.args.update({'user_config': {'diarizationConfig': diarizationConfig}})
self.enable_diarization = diarizationConfig.get(
'enableSpeakerDiarization', False)
recog_func = self.recognizer.recognize_google_cloud
elif self.engine == Config.SpeechRecognition_Sphinx:
recog_func = self.recognizer.recognize_sphinx
Expand All @@ -251,17 +266,72 @@ def recognize(self, audio):

return recog_func(audio_data=audio, language=self.language, **self.args)

def make_result_message_from_result(self, result, header=None):
if header is None:
header = std_msgs.msg.Header(stamp=rospy.Time.now())
if self.engine == Config.SpeechRecognition_GoogleCloud:
if "results" not in result or len(result["results"]) == 0:
raise SR.UnknownValueError()
transcript = []
confidence = []
sentences = []
for res in result["results"]:
sent_info_msg = SentenceInfo(header=header)
if self.enable_diarization is False:
transcript.append(
res["alternatives"][0]["transcript"].strip())
confidence.append(res["alternatives"][0]['confidence'])
prev_speaker = None
trans = ''
confs = []
for word in res["alternatives"][0]['words']:
speaker = word.get('spekaerTag', 0)
conf = word.get('confidence', 0.0)
# for more details, please see
# https://cloud.google.com/speech-to-text/docs/reference/rest/v1/speech/recognize#wordinfo
word_info_msg = WordInfo(
start_time=float(word.get(
'startTime', '0.0s').rstrip('s')),
end_time=float(word.get(
'endTime', '0.0s').rstrip('s')),
word=word.get('word', ''),
confidence=conf,
speaker_tag=speaker)
sent_info_msg.words.append(word_info_msg)
if self.enable_diarization is True \
and prev_speaker != speaker:
trans += "[{}]".format(speaker)
prev_speaker = speaker
trans += ' ' + word['word']
confs.append(conf)
if self.enable_diarization is True:
transcript.append(trans)
confidence.append(np.mean(conf))
sentences.append(sent_info_msg)
msg = SpeechRecognitionCandidates(
transcript=transcript,
confidence=confidence,
sentences=sentences)
transcript = " ".join(transcript)
else:
transcript = result
msg = SpeechRecognitionCandidates(
transcript=[transcript])
return msg, transcript

def audio_cb(self, _, audio):
if not self.enable_audio_cb:
return
try:
rospy.logdebug("Waiting for result... (Sent %d bytes)" % len(audio.get_raw_data()))
header = std_msgs.msg.Header(stamp=rospy.Time.now())
result = self.recognize(audio)
self.play_sound("recognized", 0.05)
rospy.loginfo("Result: %s" % result.encode('utf-8'))
self.play_sound("success", 0.1)
msg = SpeechRecognitionCandidates(transcript=[result])
msg, transcript = self.make_result_message_from_result(
result, header=header)
rospy.loginfo("Result: %s" % transcript)
self.pub.publish(msg)
self.play_sound("success", 0.1)
return
except SR.UnknownValueError as e:
if self.dynamic_energy_threshold:
Expand Down Expand Up @@ -322,11 +392,14 @@ def speech_recognition_srv_cb(self, req):
rospy.loginfo("Waiting for result... (Sent %d bytes)" % len(audio.get_raw_data()))

try:
header = std_msgs.msg.Header(stamp=rospy.Time.now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this timestamp may be misleading. (Not start time of original audio data, But the time of speech recognition).

result = self.recognize(audio)
rospy.loginfo("Result: %s" % result.encode('utf-8'))
if not req.quiet:
self.play_sound("success", 0.1)
res.result = SpeechRecognitionCandidates(transcript=[result])
msg, _ = self.make_result_message_from_result(
result, header=header)
res.result = msg
return res
except SR.UnknownValueError:
if self.dynamic_energy_threshold:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ def recognize_google_cloud(self, audio_data, credentials_json=None, language="en
speech_config["speechContext"] = {"phrases": preferred_phrases}
if show_all:
speech_config["enableWordTimeOffsets"] = True # some useful extra options for when we want all the output
speech_config["enable_word_confidence"] = True
request = speech_service.speech().recognize(body={"audio": {"content": base64.b64encode(flac_data).decode("utf8")}, "config": speech_config})

try:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
<launch>

<arg name="google_cloud_credentials_json" default="''" doc="Credential JSON is only used when the engine is GoogleCloud." />
<arg name="engine" default="Google" doc="Speech to text engine. TTS engine, Google, GoogleCloud, Sphinx, Wit, Bing Houndify, IBM" />

<!-- launch speech recognition -->

<arg name="launch_sound_play" default="false" doc="Launch sound_play node to speak" />
Expand All @@ -9,10 +13,12 @@
</rosparam>
<include file="$(find ros_speech_recognition)/launch/speech_recognition.launch">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about integrating the content of this test to sample_ros_speech_recognition.launch?
or
How about moving the content of this test to sample_ros_speech_recognition_ja.launch?

<arg name="launch_sound_play" value="$(arg launch_sound_play)" />
<arg name="engine" value="$(arg engine)" />
<arg name="launch_audio_capture" value="false" />
<arg name="audio_topic" value="/speech_audio" />
<arg name="language" value="ja" />
<arg name="continuous" value="true" />
<arg name="google_cloud_credentials_json" value="$(arg google_cloud_credentials_json)" />
</include>

<!-- launch bag file -->
Expand Down