Skip to content

bikutaa-dev/Shorthand-SSML-BikuEdition

Repository files navigation

Shorthand SSML for Bikubot


What is this

This is a custom and shortend way to control the TTS voices of Bikubot, this uses AWS Polly SSML tags to control how the voice sounds, but shortend and simplfies the tags to make it easier and shorter to use.

How it works

Any change to how something is spoken start with # followed by the modifications you wanna do to the voice, these modifications are represented by a letter [as an example p for pitch] and for some modification the addition of numbers are needed to represent the scale of the modification. Finally the spoken word you want the modification to apply to is encapsulated by [ and ]. Because of this the characters [ and ] are reserved and if used within a voice modification it needs to be a matching pair. \

an example would be the SSML <prosody pitch="+50%" rate="200%">This is a test</prosody> would in shorthand be #p150r200[this is a test]. Note that it's not a one to one for some things, as pitch in Normal SSML goes between -30 and +50, but shorthand only works with positive numbers so a conversion is done, where instead of starting at 0 the shorthand starts at 100 for pitch. \

You can also mix any modifications, as an example if you wanted to add a whisper to the above example the shorthand would be: #wp150r200[this is a test]. The order of the modification characters does not matter. So you could do it like #p150wr200[this is a test] and it would work the same. \

But if you would try to do something like #wr20r200[this is a test], that is to have the same modification more than once in the same tag it will only take the latest modification it sees in the tag so in the case it would seen the same as #wr200[this is a test], the r20 will be thrown away. \

The shorthand also support nested tags, so you could do something like #p150[this is a #w[test]]. All modification is also case insensitive so #P150L(Sv-Se)[test] is the same as #p150l(sv-se)[test]. \

The bot also does its best to fix any issues, such as if a value is too high it will set it to highest possible for that modification.
The possible modifications and their values can be found next.


Short Notes

  • A voice modifications starts with # followed by one or more modification found below, then ending with the speech you want modified encapsulated in [ and ].
  • The characters [ and ] are reserved characters and if used, need to be used in pairs when used outside their intended use case (marking what to modifiy).
  • You can do nested modifications.
    • Example:
      • #p150[this is a nested pitch #w[whisper test]]
      • #p150[this is #w[deeply #r120s[nested and #t120[going deeper], and] now] back up]
      • #v11[#w[testing #s[softly] whispering] with a bit higher volume, #t50[ending with some timbre]]
  • You can add more then one modification per voice modificiation, the order does not matter.
    • Example:
      • #p150w[this is a modifed pitch with whipser]
      • #wst50l(sv-se)[this soft and whispering swedish language voice with modified timbre]
      • #b.5t50p150r180[This starts with a 0.5s break and modified pitch, rate and timbre]
  • The modification part is case insensative.
  • Any modification value outside it's min or max range will be set to its min or max (whatever is closest).
  • Any modification value that is not valid will be set to a normalized default value.
  • Any characters that does not represent a modification will be ignored if part of the modification part.
  • A Faulty voice modification, like a space in the modification part or not correctly encapsulated will be read as normal.

Modifications


Break

pitch is represented by the letter b and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <break time=””> tag. The break happens before any given text, if there is any in the encapsulating []

  • Effect: Creates a break in the speech at the given point of the tag for the given amount of time in seconds..

  • Characters:
    These represent the same preset values that normal SSML has.

    • ++ = x-high
    • + = high
    • - = low
    • -- = x-low
  • Numeric:

    • default: 1.0
    • max: 10.0
    • min: 0.0
  • Example:

    • Characters:
      #b+[] is equal to <break strength=”strong” />
    • Numeric:
      #b1.2[A test] is equal to <break strength=”1200ms” />A test
      #b.5[] is equal to <break strength=”500ms” />

Emphasis

Emphasis is represented by the letter m and needs a following - , + , ++. The SSML equivalence is the**<emphasis level="">** tag.

  • Effect: Tries to (de)emphasis the word/sentence.

  • Characters:
    These represent the same preset values that normal SSML has.

    • ++ = strong
    • + = moderate
    • - = reduced
  • Example:
    #m++[A test] is equal to <emphasis level="strong">A test</say-as>
    #m-[A test] is equal to <emphasis level="reduced">A test</say-as>


Expletive/Beep

Expletive/beep is represented by the letter e and does not need any additional data. The SSML equivalence is the**<say-as interpret-as="expletive">** tag.

  • Effect: Beeps out the content.

  • Example:
    #e[A test] is equal to <say-as interpret-as="expletive">A test</say-as>


IPA (International Phonetic Alphabet)

IPA is represented by the letter i and followed by encapsulated in () the phonetic symbols for pronunciation. The SSML equivalence is the <phoneme alphabet="ipa" ph=”"> tag.

  • Effect: Changes how the word(s) encapsulated in [] are spoken.**

  • Example:
    #i(pɪˈkɑːn)[A test] is equal to <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>


Language

Language is represented by the letter l and followed by encapsulated in () the language code for the language you want to use. The SSML equivalence is the <lang xml:lang="fr-FR"> tag.

  • Effect: Changes what language the voice will use to try to speak the words.

  • Language codes:

Language Code Language Code Language Code
Arabic arb Arabic (gulf) ar-ae Catalan ca-es
Chinese (Cantonese) yue-cn Chinese (Mandarin) cmn-cn Danish da-dk
Dutch nl-nl English (Australien) en-au English (British) en-gb
English (Indian) en-in English (New Zealand) en-nz English (South African) en-za
English (US) en-us English (Welsh) en-gb-wls Finnish fi-fi
French fr-fr French (Canadian) fr-ca Hindi hi-in
German de-de German (Austrian) de-at Icelandic is-is
Italian it-it Japanese ja-jp Korean ko-kr
Norwegian nb-no Polish pl-pl Portuguese (Brazilian) pt-br
Portuguese (European) pt-pt Romanian ro-ro Russian ru-ru
Spanish (European) es-es Spanish (Mexican) es-mx Spanish (US) es-us
Swedish sv-se Turkish tr-tr Welsh cy-gb
  • Example:
    • Characters:
      #l(ja-jp)[A test] is equal to <lang xml:lang="ja-JP">A test</lang>
    • Numeric:
      #l(en-us)[A test] is equal to <lang xml:lang="en-US">A test</lang>

Max Duration

max duration is represented by the letter d and needs a following numeric value. The SSML equivalance is the <prosody amazon:max-duration=""> tag. There is limits on how fast the speech can be speed up, and if it already fits within the duration no changes are made.

  • Effect: Tries to speed up the speech so it fits within the given time.

  • Numeric:

    • default: 1.0
    • max: 60.0
    • min: 0.0
  • Example #d5.3[A test] is equal to <prosody amazon:max-duration="5300ms">A test</prosody> / #d.5[A test] is equal to <prosody amazon:max-duration="500ms">A test</prosody> /


Pitch

pitch is represented by the letter p and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <prosody pitch=””> tag.

  • Effect: Changes the pitch at which the spoken words are spoken at.

  • Characters:
    These represent the same preset values that normal SSML has.

    • ++ = x-high
    • + = high
    • - = low
    • -- = x-low
  • Numeric:

    • default: 100
    • max: 150
    • min: 70
  • Example:

    • Characters:
      #p++[A test] is equal to <prosody pitch=”x-high”>A test</prosody>
    • Numeric:
      #p150[A test] is equal to <prosody pitch=”50%”>A test</prosody>

Soft

soft speech is represented by the letter s and does not need any additional data. The SSML equivalence is the <amazon:effect phonation="soft""> tag.

  • Effect: Makes the speech being spoken sound softer.

  • Example:
    #s[A test] is equal to <amazon:effect phonation="soft""A test</amazon:effect>


Rate

Rate is represented by the letter r and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <prosody rate=””> tag.

  • Effect: Changes the speed at which the words are spoken.

  • Characters:
    These represent the same preset values that normal SSML has.

    • ++ = x-fast
    • + = fast
    • - = slow
    • -- = x-slow
  • Numeric:

    • default: 100
    • max: 200
    • min: 20
  • Example:

    • Characters:
      #r--[A test] is equal to <prosody rate=”x-slow”>A test</prosody>
    • Numeric:
      #r150[A test] is equal to <prosody rate=”150%”>A test</prosody>

Timbre

Rate is represented by the letter t and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <amazon:effect vocal-tract-length=""> tag.

  • Effect: Changes the timbre of voice.

  • Characters:

    • ++ = 200%
    • + = 150%
    • - = 75%
    • –- = 50%
  • Numeric:

    • default: 100
    • max: 200
    • min: 50
  • Example:

    • Characters:
      #t--[A test] is equal to <amazon:effect vocal-tract-length="50%">A test</amazon:effect>
    • Numeric:
      #t50[A test] is equal to <amazon:effect vocal-tract-length="50%">A test</amazon:effect>

Volume

Volume is represented by the letter v and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <prosody volume=””> tag.

  • Effect: Changes the volume of the speech.

  • Characters:
    These represent the same preset values that normal SSML has.

    • ++ = x-loud
    • + = loud
    • - = soft
    • -- = x-soft
  • Numeric:

    • default: 10
    • max: 14
    • min: 4
  • Example:

    • Characters:
      #v+[A test] is equal to <prosody volume=”loud”>A test</prosody>
    • **Numeric:
      #v4[A test] is equal to <prosody rate=”-6db”>A test</prosody>

Whisper

Is represented by the letter w and does not need any additional data. The SSML equivalence is the <amazon:effect name="whispered"> tag.

  • Effect: Makes the spoken words be spoken in a whispering voice. \

  • Example:
    #w[A test] is equal to <amazon:effect name="whispered">A test</amazon:effect>


Special Effects

There are a few special effects that the shorthand supports. These sounds are represented by the effect name encapsulated by -- , like --effectname-- . Some of these will be affected by modifications as they are created with SSML and TTS, if so it will be noted.**_ _**Plans for the future is to allow streamers to add their own sounds to this system. These are all case insensitive.


Breath

These are created using the SSML <amazon:breath> tag. All breath uses the volume=”x-loud” for a chance to be heard.

  • --BXL-- = <amazon:breath duration="x-long" volume="x-loud"/>
  • --BL-- = <amazon:breath duration="long" volume=”x-loud"/>
  • --B-- = <amazon:breath duration="medium" volume=”loud"/>
  • --BS-- = <amazon:breath duration="short" volume=”loud"/>
  • --BXS-- = <amazon:breath duration=”x-short" volume=”loud"/>

Tones

This is a cheat sheet on how to create tone sounding sounds by using the Expletive/Beep tag, that was created by a community memeber Nowrench. TTS-Melody-Guide


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published