This is a custom and shortend way to control the TTS voices of Bikubot, this uses AWS Polly SSML tags to control how the voice sounds, but shortend and simplfies the tags to make it easier and shorter to use.
Any change to how something is spoken start with # followed by the modifications you wanna do to the voice, these modifications are represented by a letter [as an example p for pitch] and for some modification the addition of numbers are needed to represent the scale of the modification. Finally the spoken word you want the modification to apply to is encapsulated by [ and ]. Because of this the characters [ and ] are reserved and if used within a voice modification it needs to be a matching pair. \
an example would be the SSML <prosody pitch="+50%" rate="200%">This is a test</prosody> would in shorthand be #p150r200[this is a test]. Note that it's not a one to one for some things, as pitch in Normal SSML goes between -30 and +50, but shorthand only works with positive numbers so a conversion is done, where instead of starting at 0 the shorthand starts at 100 for pitch. \
You can also mix any modifications, as an example if you wanted to add a whisper to the above example the shorthand would be: #wp150r200[this is a test]. The order of the modification characters does not matter. So you could do it like #p150wr200[this is a test] and it would work the same. \
But if you would try to do something like #wr20r200[this is a test], that is to have the same modification more than once in the same tag it will only take the latest modification it sees in the tag so in the case it would seen the same as #wr200[this is a test], the r20 will be thrown away. \
The shorthand also support nested tags, so you could do something like #p150[this is a #w[test]]. All modification is also case insensitive so #P150L(Sv-Se)[test] is the same as #p150l(sv-se)[test]. \
The bot also does its best to fix any issues, such as if a value is too high it will set it to highest possible for that modification.
The possible modifications and their values can be found next.
- A voice modifications starts with # followed by one or more modification found below, then ending with the speech you want modified encapsulated in [ and ].
- The characters [ and ] are reserved characters and if used, need to be used in pairs when used outside their intended use case (marking what to modifiy).
- You can do nested modifications.
- Example:
- #p150[this is a nested pitch #w[whisper test]]
- #p150[this is #w[deeply #r120s[nested and #t120[going deeper], and] now] back up]
- #v11[#w[testing #s[softly] whispering] with a bit higher volume, #t50[ending with some timbre]]
- Example:
- You can add more then one modification per voice modificiation, the order does not matter.
- Example:
- #p150w[this is a modifed pitch with whipser]
- #wst50l(sv-se)[this soft and whispering swedish language voice with modified timbre]
- #b.5t50p150r180[This starts with a 0.5s break and modified pitch, rate and timbre]
- Example:
- The modification part is case insensative.
- Any modification value outside it's min or max range will be set to its min or max (whatever is closest).
- Any modification value that is not valid will be set to a normalized default value.
- Any characters that does not represent a modification will be ignored if part of the modification part.
- A Faulty voice modification, like a space in the modification part or not correctly encapsulated will be read as normal.
pitch is represented by the letter b and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <break time=””> tag. The break happens before any given text, if there is any in the encapsulating []
-
Effect: Creates a break in the speech at the given point of the tag for the given amount of time in seconds..
-
Characters:
These represent the same preset values that normal SSML has.- ++ = x-high
- + = high
- - = low
- -- = x-low
-
Numeric:
- default: 1.0
- max: 10.0
- min: 0.0
-
Example:
- Characters:
#b+[] is equal to <break strength=”strong” /> - Numeric:
#b1.2[A test] is equal to <break strength=”1200ms” />A test
#b.5[] is equal to <break strength=”500ms” />
- Characters:
Emphasis is represented by the letter m and needs a following - , + , ++. The SSML equivalence is the**<emphasis level="">** tag.
-
Effect: Tries to (de)emphasis the word/sentence.
-
Characters:
These represent the same preset values that normal SSML has.- ++ = strong
- + = moderate
- - = reduced
-
Example:
#m++[A test] is equal to <emphasis level="strong">A test</say-as>
#m-[A test] is equal to <emphasis level="reduced">A test</say-as>
Expletive/beep is represented by the letter e and does not need any additional data. The SSML equivalence is the**<say-as interpret-as="expletive">** tag.
-
Effect: Beeps out the content.
-
Example:
#e[A test] is equal to <say-as interpret-as="expletive">A test</say-as>
IPA is represented by the letter i and followed by encapsulated in () the phonetic symbols for pronunciation. The SSML equivalence is the <phoneme alphabet="ipa" ph=”"> tag.
-
Effect: Changes how the word(s) encapsulated in [] are spoken.**
-
Example:
#i(pɪˈkɑːn)[A test] is equal to <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>
Language is represented by the letter l and followed by encapsulated in () the language code for the language you want to use. The SSML equivalence is the <lang xml:lang="fr-FR"> tag.
-
Effect: Changes what language the voice will use to try to speak the words.
-
Language codes:
Language | Code | Language | Code | Language | Code |
Arabic | arb | Arabic (gulf) | ar-ae | Catalan | ca-es |
Chinese (Cantonese) | yue-cn | Chinese (Mandarin) | cmn-cn | Danish | da-dk |
Dutch | nl-nl | English (Australien) | en-au | English (British) | en-gb |
English (Indian) | en-in | English (New Zealand) | en-nz | English (South African) | en-za |
English (US) | en-us | English (Welsh) | en-gb-wls | Finnish | fi-fi |
French | fr-fr | French (Canadian) | fr-ca | Hindi | hi-in |
German | de-de | German (Austrian) | de-at | Icelandic | is-is |
Italian | it-it | Japanese | ja-jp | Korean | ko-kr |
Norwegian | nb-no | Polish | pl-pl | Portuguese (Brazilian) | pt-br |
Portuguese (European) | pt-pt | Romanian | ro-ro | Russian | ru-ru |
Spanish (European) | es-es | Spanish (Mexican) | es-mx | Spanish (US) | es-us |
Swedish | sv-se | Turkish | tr-tr | Welsh | cy-gb |
- Example:
- Characters:
#l(ja-jp)[A test] is equal to <lang xml:lang="ja-JP">A test</lang> - Numeric:
#l(en-us)[A test] is equal to <lang xml:lang="en-US">A test</lang>
- Characters:
max duration is represented by the letter d and needs a following numeric value. The SSML equivalance is the <prosody amazon:max-duration=""> tag. There is limits on how fast the speech can be speed up, and if it already fits within the duration no changes are made.
-
Effect: Tries to speed up the speech so it fits within the given time.
-
Numeric:
- default: 1.0
- max: 60.0
- min: 0.0
-
Example #d5.3[A test] is equal to <prosody amazon:max-duration="5300ms">A test</prosody> / #d.5[A test] is equal to <prosody amazon:max-duration="500ms">A test</prosody> /
pitch is represented by the letter p and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <prosody pitch=””> tag.
-
Effect: Changes the pitch at which the spoken words are spoken at.
-
Characters:
These represent the same preset values that normal SSML has.- ++ = x-high
- + = high
- - = low
- -- = x-low
-
Numeric:
- default: 100
- max: 150
- min: 70
-
Example:
- Characters:
#p++[A test] is equal to <prosody pitch=”x-high”>A test</prosody> - Numeric:
#p150[A test] is equal to <prosody pitch=”50%”>A test</prosody>
- Characters:
soft speech is represented by the letter s and does not need any additional data. The SSML equivalence is the <amazon:effect phonation="soft""> tag.
-
Effect: Makes the speech being spoken sound softer.
-
Example:
#s[A test] is equal to <amazon:effect phonation="soft""A test</amazon:effect>
Rate is represented by the letter r and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <prosody rate=””> tag.
-
Effect: Changes the speed at which the words are spoken.
-
Characters:
These represent the same preset values that normal SSML has.- ++ = x-fast
- + = fast
- - = slow
- -- = x-slow
-
Numeric:
- default: 100
- max: 200
- min: 20
-
Example:
- Characters:
#r--[A test] is equal to <prosody rate=”x-slow”>A test</prosody> - Numeric:
#r150[A test] is equal to <prosody rate=”150%”>A test</prosody>
- Characters:
Rate is represented by the letter t and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <amazon:effect vocal-tract-length=""> tag.
-
Effect: Changes the timbre of voice.
-
Characters:
- ++ = 200%
- + = 150%
- - = 75%
- –- = 50%
-
Numeric:
- default: 100
- max: 200
- min: 50
-
Example:
- Characters:
#t--[A test] is equal to <amazon:effect vocal-tract-length="50%">A test</amazon:effect> - Numeric:
#t50[A test] is equal to <amazon:effect vocal-tract-length="50%">A test</amazon:effect>
- Characters:
Volume is represented by the letter v and supports either a following numeric value or + , ++ , - , --. The SSML equivalence is the <prosody volume=””> tag.
-
Effect: Changes the volume of the speech.
-
Characters:
These represent the same preset values that normal SSML has.- ++ = x-loud
- + = loud
- - = soft
- -- = x-soft
-
Numeric:
- default: 10
- max: 14
- min: 4
-
Example:
- Characters:
#v+[A test] is equal to <prosody volume=”loud”>A test</prosody> - **Numeric:
#v4[A test] is equal to <prosody rate=”-6db”>A test</prosody>
- Characters:
Is represented by the letter w and does not need any additional data. The SSML equivalence is the <amazon:effect name="whispered"> tag.
-
Effect: Makes the spoken words be spoken in a whispering voice. \
-
Example:
#w[A test] is equal to <amazon:effect name="whispered">A test</amazon:effect>
There are a few special effects that the shorthand supports. These sounds are represented by the effect name encapsulated by -- , like --effectname-- . Some of these will be affected by modifications as they are created with SSML and TTS, if so it will be noted.**_ _**Plans for the future is to allow streamers to add their own sounds to this system. These are all case insensitive.
These are created using the SSML <amazon:breath> tag. All breath uses the volume=”x-loud” for a chance to be heard.
- --BXL-- = <amazon:breath duration="x-long" volume="x-loud"/>
- --BL-- = <amazon:breath duration="long" volume=”x-loud"/>
- --B-- = <amazon:breath duration="medium" volume=”loud"/>
- --BS-- = <amazon:breath duration="short" volume=”loud"/>
- --BXS-- = <amazon:breath duration=”x-short" volume=”loud"/>
This is a cheat sheet on how to create tone sounding sounds by using the Expletive/Beep tag, that was created by a community memeber Nowrench.