Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ros_speech_recognition] Add WordInfo to SpeechRecognitionCandidates message. #320

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

iory
Copy link
Member

@iory iory commented Jan 5, 2022

What is this?

This PR enables publishing start_time, end_time, confidence and speaker_tag.
This PR requires the following PR for new message. jsk-ros-pkg/jsk_common_msgs#28

Example

If you are using ros_speech_recognition with ~continuous is True, you can subscribe /Tablet/voice (speech_recognition_msgs/SpeechRecognitionCandidates) message.

  1. Launch sample launch file.

    roslaunch ros_speech_recognition sample_ros_speech_recognition.launch
  2. echo the message.

    $ rostopic echo /Tablet/voice
    transcript:
      - may I help you
    confidence: [0.9286448955535889]
    sentences:
      -
        header:
          seq: 0
          stamp:
            secs: 1641425262
            nsecs: 268165588
          frame_id: ''
        words:
          -
            start_time: 0.0
            end_time: 0.2
            word: "may"
            confidence: 0.91376436
            speaker_tag: 0
          -
            start_time: 0.2
            end_time: 0.4
            word: "I"
            confidence: 0.9366196
            speaker_tag: 0
          -
            start_time: 0.4
            end_time: 0.5
            word: "help"
            confidence: 0.9531065
            speaker_tag: 0
          -
            start_time: 0.5
            end_time: 0.8
            word: "you"
            confidence: 0.9110889
            speaker_tag: 0
    ---
    transcript:
      - pick up the red kettle
    confidence: [0.9499567747116089]
    sentences:
      -
        header:
          seq: 0
          stamp:
            secs: 1641425268
            nsecs:  58182954
          frame_id: ''
        words:
          -
            start_time: 0.0
            end_time: 0.4
            word: "pick"
            confidence: 0.953269
            speaker_tag: 0
          -
            start_time: 0.4
            end_time: 0.6
            word: "up"
            confidence: 0.95326656
            speaker_tag: 0
          -
            start_time: 0.6
            end_time: 0.8
            word: "the"
            confidence: 0.96866167
            speaker_tag: 0
          -
            start_time: 0.8
            end_time: 1.1
            word: "red"
            confidence: 0.98762906
            speaker_tag: 0
          -
            start_time: 1.1
            end_time: 1.5
            word: "kettle"
            confidence: 0.8869578
            speaker_tag: 0

The word is recognized word and the confidence means a higher number indicates an estimated greater likelihood that the recognized words are correct.
start_time indicates time offset relative to the beginning of the audio (timestamp of header), and corresponding to the start of the spoken word.
end_time indicates time offset relative to the beginning of the audio, and corresponding to the end of the spoken word.

@iory iory force-pushed the add-time-information branch from b6edf13 to ae3f3eb Compare January 5, 2022 12:30
@iory iory requested review from mqcmd196 and 708yamaguchi January 5, 2022 23:39
Copy link
Member

@708yamaguchi 708yamaguchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the useful features.

I left some reviews.

@@ -30,6 +30,106 @@ This package uses Python package [SpeechRecognition](https://pypi.python.org/pyp
print result # => 'Hello, world!'
```

If you are using `ros_speech_recognition` with `~continuous` is `True`, you can subscribe `/Tablet/voice` (`speech_recognition_msgs/SpeechRecognitionCandidates`) message.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iory

You changed speech_recognition_msgs/SpeechRecognitionCandidates content in jsk-ros-pkg/jsk_common_msgs#28

In my opinion, we should create new message type like speech_recognition_msgs/GoogleCloudSpeechRecognitionCandidates in addition to the existing speech_recognition_msgs/SpeechRecognitionCandidates

This is because the new fields (e.g. wordinfo) seems to be specific to the google cloud speech-to-text.

This is my preference, but how about wrapping the common message by each speech-to-text service message?
The advantage of this method is we do not need to change the current speech_recognition_msgs/SpeechRecognitionCandidates.

For example,

$ rosmsg show speech_recognition_msgs/GoogleCloudSpeechRecognitionCandidates

Header header
speech_recognition_msgs/SpeechRecognitionCandidates candidates
speech_recognition_msgs/SentenceInfo[] sentences

In addition, please consider that julius_ros also publishes speech_recognition_msgs/SpeechRecognitionCandidates.
https://github.com/jsk-ros-pkg/jsk_3rdparty/tree/master/julius_ros#gmm-version

I'd like to hear your thoughts. @iory

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the new fields (e.g. wordinfo) seems to be specific to the google cloud speech-to-text.

This is not specific to the google cloud speech-to-text.
For example, azure cognitive service has similar functions.
https://docs.microsoft.com/ja-jp/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.speechconfig?view=azure-python#request-word-level-timestamps--

I think that the start time and end time for each word are general values as a framework for speech recognition.

This is my preference, but how about wrapping the common message by each speech-to-text service message?
The advantage of this method is we do not need to change the current

However, this is a good way to avoid affecting other users, so I'll take this direction.



```bash
roslaunch ros_speech_recognition sample_ros_speech_recognition.launch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to set google_cloud_credentials_json:=xxx arg?
I got the following error.

[ERROR] [1641453200.346079]: Unexpected error: (<class 'oauth2client.client.ApplicationDefaultCredentialsError'>, ApplicationDefaultCredentialsError('The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.',), <traceback object at 0x7f980f9a6960>)

I think

roslaunch ros_speech_recognition sample_ros_speech_recognition.launch google_cloud_credentials_json:=xxx.json

is more helpful.

<launch>

<arg name="google_cloud_credentials_json" default="''" doc="Credential JSON is only used when the engine is GoogleCloud." />
<arg name="engine" default="GoogleCloud" doc="Speech to text engine. TTS engine, Google, GoogleCloud, Sphinx, Wit, Bing Houndify, IBM" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, GoogleCloud needs credentials. (It costs money)
I think the default engine should be Google so that we can try this ROS package for free.

If you think GoogleCloud should be the default for ros_speech_recognition, I think it's ok to keep this change.

@@ -9,10 +13,12 @@
</rosparam>
<include file="$(find ros_speech_recognition)/launch/speech_recognition.launch">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about integrating the content of this test to sample_ros_speech_recognition.launch?
or
How about moving the content of this test to sample_ros_speech_recognition_ja.launch?

@@ -30,6 +30,106 @@ This package uses Python package [SpeechRecognition](https://pypi.python.org/pyp
print result # => 'Hello, world!'
```

If you are using `ros_speech_recognition` with `~continuous` is `True`, you can subscribe `/Tablet/voice` (`speech_recognition_msgs/SpeechRecognitionCandidates`) message.

1. Launch sample launch file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding google cloud speech-to-text URL to README.md too?
It would be helpful.
https://cloud.google.com/speech-to-text/docs/reference/rest/v1/speech/recognize#wordinfo

@@ -30,6 +30,106 @@ This package uses Python package [SpeechRecognition](https://pypi.python.org/pyp
print result # => 'Hello, world!'
```

If you are using `ros_speech_recognition` with `~continuous` is `True`, you can subscribe `/Tablet/voice` (`speech_recognition_msgs/SpeechRecognitionCandidates`) message.

1. Launch sample launch file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for documentation.

In addition to the engine:=GoogleCloud example, it would be very helpful if you could add the engine:=Google example.


The `word` is recognized word and the `confidence` means a higher number indicates an estimated greater likelihood that the recognized words are correct.
`start_time` indicates time offset relative to the beginning of the audio (timestamp of header), and corresponding to the start of the spoken word.
`end_time` indicates time offset relative to the beginning of the audio, and corresponding to the end of the spoken word.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be usefull to put these descriptions to message definition.

@@ -322,11 +392,14 @@ def speech_recognition_srv_cb(self, req):
rospy.loginfo("Waiting for result... (Sent %d bytes)" % len(audio.get_raw_data()))

try:
header = std_msgs.msg.Header(stamp=rospy.Time.now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this timestamp may be misleading. (Not start time of original audio data, But the time of speech recognition).

@sktometometo
Copy link
Contributor

@iory Great work! I have also left some comments on jsk-ros-pkg/jsk_common_msgs#28. I think we needs more discussion about start_time and end_time representation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants