[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Emacspeak] TTS Server Implementation Questions

To: Tim Cross <theophilusx@xxxxxxxxx>
Subject: Re: [Emacspeak] TTS Server Implementation Questions
From: Rob Hill <robmichill@xxxxxxxxx>
Date: Wed, 10 Apr 2024 09:52:37 +1200
In-reply-to: <87le5ml3m7.fsf@gmail.com>
Speaking as someone who has relied on emacspeak for many and varied
tasks, both work and pleasure, though isn't from the programming
world,  I have never missed indexing.  The ability to easily navigate
by chunks does the job, and also prevents me from falling asleep if
reading late at night.  It's just a different way of doing the job.

Rob




"Tim Cross" (via emacspeak Mailing List) writes:
 > 
 > I could be missing something, but as I see it, what voice indexing would
 > provide is for the ability to have a 'voice cursor' (which may or may
 > not be the same as your emacs cursor) tracking of location where the TTS
 > engine is up to when generating speech from the submitted text.
 > 
 > The would, for example, enable for the pausing and then subsequent
 > resuming of speech whereby the resumed speech would start from where the
 > speech was previously paused. In some systems, this is very important
 > becasue the system only sends large chunks of speech at a time. For
 > example, I've seen a simple TTS interface for reading files where it
 > will just start reading the file. You odn't have the ability to ask for
 > just a page, paragraph, sentence, line, word. You just ask for it to
 > start speaking and then yuou can pause and resume speech. The other
 > thing you may get is cursor tracking of speech. A cursor might move
 > through the text as it is spoken so that when you pause speech, the
 > cursor is at that point in yhour text. This can be useful for people who
 > want to read  along with the speech i.e. the speech is an aid to visual
 > reading.
 > 
 > While I can see the potential benefits in having the ability to get and
 > use speech index information, I've not found it very high on my wishlist
 > for emacspeak. This is primarily because emacspeak provides very fined
 > grained control over the size or chunks of speech I send at a
 > time. Depending on what I'm doing, I'll read/browse the data using a
 > movement/chank size which suits my need. For example, I'f I have a large
 > buffer of text I want to read, I'm unlikely to ask emacspeak to just
 > read the whole buffer. Instead, I'm more likely to as it to read by
 > page, paragraph or perhaps sentence.
 > 
 > With emacspeak, I find it is very much about moving around using the
 > unit (letter, word, sentence, paragraph, page, buffer) best sutied for
 > what I'm doing. I find this provides an adequate balance between my use
 > case and complexity/consistency across speech servers. This has also
 > enabled me to experiment with different TTS engines. For example, many
 > years ago, I wrote speech servers for the Cepstral TTS engines. These
 > were a commercial TTS engine that at the time, had high quality
 > voices. The additional complexity and overheads involved in a TTS
 > interface model which supported voice indexing would likely hav made
 > this much harder to implement and discouraged the type of experimentation
 > which is at the heart of emacspeak. Likewise, I wonder if we would have
 > had the other TTS engines, some of which have come and gone, like the
 > flite and festival servers or the server written in C, or the existing
 > mac, swiftmac servers or the experimental windows, speech-dispatcher and
 > JS servers that are out there currently in various stages of
 > development.
 > 
 > I personally don't see the amount of required effort justifyhing the
 > benefits given we already have the capability to work with varying
 > chunks of speech. Yes, it would provide some convenience, but at a high
 > cost which I feel is hard to justify. However, provided someone can
 > implement something which does not require changes to the existing
 > servers or their design, I would say go for it. A lot can be learnt from
 > implementing a TTS server. In fact, I've learnt a lot from failed
 > attempts to implement TTS servers as there is a considerable amount of
 > subtle and non-obvious aspects to a TTS server which only become clear
 > when you try implementing one, making it a great learning experience. At
 > least it was for me. 
 > 
 > 
 > Victor Tsaran <vtsaran@xxxxxxxxx> writes:
 > 
 > > I guess, the question stands: what user-facing problem are we trying to solve?
 > >
 > > On Tue, Apr 9, 2024 at 3:14 AM Parham Doustdar <emacspeak@xxxxxxxxxxxxx> wrote:
 > >
 > >  That's true, Emacspeak doesn't currently "read" from the speech server process as far as I've seen, it only "writes" to it.
 > >  Fixing that isn't impossible, but definitely time consuming.
 > >  The other concrete issue is that last time I checked, console screen readers read all the text in one chunk. They don't use the
 > >  audio CSS (forgive me if I don't use the correct name here) that Emacspeak has, which requires you to play audio icons,
 > >  speak text with different pitch, and pauses. All of this means that you have to do extra heavy-lifting to really track the index,
 > >  because the index you get back from the TTS engine isn't simply a position in the buffer -- it is just the position in the
 > >  current chunk of text it has recently received.
 > >  So that's why I'm curious if we really think it's worth it. It could be, or not, I'm not opinionated, but I'm also realizing that in
 > >  our community, we don't really have a good mechanism to discuss and decide on things like this.
 > >
 > >  On Tue, Apr 9, 2024 at 8:35 AM Tim Cross <theophilusx@xxxxxxxxx> wrote:
 > >
 > >  You are overlooking one critical component which explains why adding
 > >  indxing support is a non-trivial exercise which would require a complete
 > >  redesign of the existing TTS interface model.
 > >
 > >  For indexing information to be of any use, it has to be fed back into the
 > >  client and used by the client. For example, tell the client to
 > >  update/move the cursor to the last position spoken.
 > >
 > >  There is absolutely no support for this data to be fed back into the
 > >  current system. The current TTS interface has data flowing in only one
 > >  direction, from emacs to emacpseak and from emacspeak to the TTS server
 > >  and form the tts server to the tts synthesizer. There is no existing
 > >  mechanism to feed information (i.e. index positions) back from the TTS
 > >  engine to emacs. While getting this information from the TTS engine into
 > >  the TTS server is probably reasonably easy, there is no existing channel
 > >  to feed that information up into Emacspeak. 
 > >
 > >  Not only would it be necessary to define and implement a whole new model
 > >  to incorporate this feedback, in addition to also working with TTS
 > >  engines which do not provide indexing information, you would also likely
 > >  need to implement some sort of multi speech cursor tracking so that the
 > >  system can track cursor positions in different buffers.
 > >
 > >  The reason this sort of functionality seems easy in systems like speakup
 > >  or speech-dispatcher is because those systems were designed with this
 > >  functionality. It is incprporated into the base design and part of the
 > >  various communication protocols the design implement. Adding this
 > >  functionality is not something which can just be 'tacked on'. 
 > >
 > >  The good news of course is that being open source, anyone can go ahead
 > >  and define a new interface model and add indexing capability. However,
 > >  it may be worth considering that it has taken 30 years of development to
 > >  get the current model to where it is at, so I think you can expect a
 > >  pretty steep climb initially!
 > >
 > >  John Covici <covici@xxxxxxxxxxxxxx> writes:
 > >
 > >  > Its a lot simpler -- indexing is supposed to simply arrange things so
 > >  > that when reading a buffer, and you stop reading, the cursor will be
 > >  > at or near the point where you stopped.  Speakup has had this for a
 > >  > long time and that is why I use it on Linux.  But its only good for
 > >  > the virtual console.  Now speech dispatcher has indexinng built in, so
 > >  > if you connect to that and use one of the supported synthesizers,
 > >  > indexing works correctly and I don't see any performance hit.  I think
 > >  > all the client has to do is connect to speech dispatcher, but check me
 > >  > on this.
 > >  >
 > >  > On Mon, 08 Apr 2024 08:25:27 -0400,
 > >  > Robert Melton wrote:
 > >  >> 
 > >  >> Is indexing supposed to be like per reading block, or like one global?  Is the idea 
 > >  >> that you can be reading a buffer, go to another buffer, read some of it, then come 
 > >  >> back and continue? IE: Index per "reading block"? 
 > >  >> 
 > >  >> Assuming it is global for simplicity, it is still a heavy lift for implementation on 
 > >  >> Mac and Windows. 
 > >  >> 
 > >  >> As they do not natively report back as words are spoken, now
 > >  >> you can get this behavior at an "Utterance" level, by installing hooks and callbacks 
 > >  >> and tracking those. With that you would need to additionally keep copies of the future 
 > >  >> utterances, even if they already where queued with the TTS.
 > >  >> 
 > >  >> Considered from the POV of index per reading block, then you need to find ways to ident 
 > >  >> each one and its position and index them and continue reading.
 > >  >> 
 > >  >> Sounds neat, but at least for my servers, right now, the juice isn't worth the sqeeze, I
 > >  >> am still trying to get basic stuff like pitch multipliers working on windows via wave 
 > >  >> mangling and other basic features, hehe. 
 > >  >> 
 > >  >> > On Apr 8, 2024, at 05:20, Parham Doustdar <parham90@xxxxxxxxx> wrote:
 > >  >> > 
 > >  >> > I understand. My question isn't whether it's possible though, or how difficult it
 > >  >> > would be, or the steps we'd have to take to implement it.
 > >  >> > My question is more about whether the use cases we have today make it worth it to
 > >  >> > reconsider. All other questions we can apply the wisdom of the community to solve, if
 > >  >> > we were convinced that the effort would be worth it.
 > >  >> > For me, the way I've got around this is to use the next/previous paragraph
 > >  >> > commands. The chunks are good small enough that I can "zoom in" if I want, and yet
 > >  >> > large enough that I don't have to constantly hit next-line.
 > >  >> > Sent from my iPhone
 > >  >> > 
 > >  >> >> On 8 Apr 2024, at 11:13, Tim Cross <theophilusx@xxxxxxxxx> wrote:
 > >  >> >> 
 > >  >> >> 
 > >  >> >> This is extremely unlikely to be implemented. It is non-trivial and
 > >  >> >> would require a significant re-design of the whole interface and model
 > >  >> >> of operation. It isn't as simple as just getting index information from
 > >  >> >> the TTS servers which support it. That information has to then be fed
 > >  >> >> backwards to Emacs through some mechanism which currently does not
 > >  >> >> exist and would result in a far more complicated interface/model.
 > >  >> >> 
 > >  >> >> As Raman said, the decision not to have this was not simply an oversight
 > >  >> >> or due to lack of time. It was a conscious design decision. What your
 > >  >> >> asking for isn't simply an enhancement, it is a complete redesign of the
 > >  >> >> TTS interface model.
 > >  >> >> 
 > >  >> >> "Parham Doustdar" (via emacspeak Mailing List) <emacspeak@xxxxxxxxxxxxx> writes:
 > >  >> >> 
 > >  >> >>> I agree. I'm not sure which TTS engines support it. Maybe, just like notification streams
 > >  >> >>> are supported in some servers, we can implement this feature for engines that support it?
 > >  >> >>> Sent from my iPhone
 > >  >> >>> 
 > >  >> >>>>> On 8 Apr 2024, at 10:24, John Covici <emacspeak@xxxxxxxxxxxxx> wrote:
 > >  >> >>>> 
 > >  >> >>>> I know this might be contraversial, but, indexing would be very useful
 > >  >> >>>> to me,  sometimes I read long buffers and when I stop the reading, the
 > >  >> >>>> cursor is still where I started, so no real  way to do this adequately
 > >  >> >>>> -- I would not mind if it were just down to the line, rather than
 > >  >> >>>> individual words, but it would make emacspeak lots nicer for me.
 > >  >> >>>> 
 > >  >> >>>>> On Fri, 05 Apr 2024 15:39:15 -0400,
 > >  >> >>>>> "T.V Raman" (via emacspeak Mailing List) wrote:
 > >  >> >>>>> 
 > >  >> >>>>> [1  <text/plain; us-ascii (7bit)>]
 > >  >> >>>>> as a single call is that it ensures  atomicity i.e. all of the state
 > >  >> >>>>> gets set at one shot from the perspective of the elisp layer, so you
 > >  >> >>>>> hopefully never get TTS that has its state  partially set.
 > >  >> >>>>> note that the other primary benefit of tts_sync_state
 > >  >> >>>>> 
 > >  >> >>>>> Robert Melton writes:
 > >  >> >>>>>> On threading. It is all concurrent, lots of fun protecting of the state.
 > >  >> >>>>>> 
 > >  >> >>>>>> On language and voice, I was thinking of them as a tree, language/voice,
 > >  >> >>>>>> as this is how Windows and MacOS seem to provide them.
 > >  >> >>>>>> 
 > >  >> >>>>>> ----
 > >  >> >>>>>> 
 > >  >> >>>>>> Oh, one last thing. Should TTS Server implementations be returning a \n
 > >  >> >>>>>> after command is complete, or is just returning nothing acceptable?
 > >  >> >>>>>> 
 > >  >> >>>>>> 
 > >  >> >>>>>>> On Apr 5, 2024, at 14:01, T.V Raman <raman@xxxxxxxxxx> wrote:
 > >  >> >>>>>>> 
 > >  >> >>>>>>> And do spend some time thinking of atomicity and multithreaded systems,
 > >  >> >>>>>>> e.g. ask yourself the question "how many threads of execution are active
 > >  >> >>>>>>> at any given time"; Hint: the answer isn't as simple as "just one
 > >  >> >>>>>>> because my server doesn't use threads". > Raman--
 > >  >> >>>>>>>> 
 > >  >> >>>>>>>> Thanks so much, that clarifies a bunch. A few questions on the
 > >  >> >>>>>>>> language / voice support.
 > >  >> >>>>>>>> 
 > >  >> >>>>>>>> Does the TTS server maintain an internal list and switch through
 > >  >> >>>>>>>> it or does it send the list the lisp in a way I have missed?
 > >  >> >>>>>>>> 
 > >  >> >>>>>>>> Would it be useful to have a similar feature for voices, being
 > >  >> >>>>>>>> first you pick right language, then you pick preferred voice
 > >  >> >>>>>>>> then maybe it is stored in a defcustom and sent next time as
 > >  >> >>>>>>>> (set_lang lang:voice t)
 > >  >> >>>>>>>> 
 > >  >> >>>>>>>> 
 > >  >> >>>>>>>>> On Apr 5, 2024, at 13:10, T.V Raman <raman@xxxxxxxxxx> wrote:
 > >  >> >>>>>>>>> 
 > >  >> >>>>>>>>> If your TTS supports more than one language, the TTS API exposes these
 > >  >> >>>>>>>>> as a list; these calls loop through the list (dectalk,espeak, outloud)
 > >  >> >>>>>>>> 
 > >  >> >>>>>>>> --
 > >  >> >>>>>>>> Robert "robertmeta" Melton
 > >  >> >>>>>>>> lists@xxxxxxxxxxxxxxxx
 > >  >> >>>>>>>> 
 > >  >> >>>>>>> 
 > >  >> >>>>>> 
 > >  >> >>>>>> --
 > >  >> >>>>>> Robert "robertmeta" Melton
 > >  >> >>>>>> lists@xxxxxxxxxxxxxxxx
 > >  >> >>>>> 
 > >  >> >>>>> --
 > >  >> >>>>> [2  <text/plain; UTF-8 (8bit)>]
 > >  >> >>>>> Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
 > >  >> >>>>> To unsubscribe send email to:
 > >  >> >>>>> emacspeak-request@xxxxxxxxxxxxx with a subject of: unsubscribe
 > >  >> >>>> 
 > >  >> >>>> --
 > >  >> >>>> Your life is like a penny.  You're going to lose it.  The question is:
 > >  >> >>>> How do
 > >  >> >>>> you spend it?
 > >  >> >>>> 
 > >  >> >>>>       John Covici wb2una
 > >  >> >>>>       covici@xxxxxxxxxxxxxx
 > >  >> >>>> Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
 > >  >> >>>> To unsubscribe send email to:
 > >  >> >>>> emacspeak-request@xxxxxxxxxxxxx with a subject of: unsubscribe
 > >  >> >>> 
 > >  >> >>> Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
 > >  >> >>> To unsubscribe send email to:
 > >  >> >>> emacspeak-request@xxxxxxxxxxxxx with a subject of: unsubscribe
 > >  >> 
 > >  >> --
 > >  >> Robert "robertmeta" Melton
 > >  >> lists@xxxxxxxxxxxxxxxx
 > >  >> 
 > >  >> 
 > >
 > >  Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
 > >  To unsubscribe send email to:
 > >  emacspeak-request@xxxxxxxxxxxxx with a subject of: unsubscribe
 > Emacspeak discussion list -- emacspeak@xxxxxxxxxxxxx
 > To unsubscribe send email to:
 > emacspeak-request@xxxxxxxxxxxxx with a subject of: unsubscribe

--
References:
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Tim Cross
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Parham Doustdar
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Robert Melton
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: John Covici
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Tim Cross
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Parham Doustdar
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Victor Tsaran
- Re: [Emacspeak] TTS Server Implementation Questions
  - From: Tim Cross
Prev by Date: [Emacspeak] Sharpwin First (hopefully) Useable Release
Next by Date: Re: [Emacspeak] TTS Server Implementation Questions
Previous by thread: Re: [Emacspeak] TTS Server Implementation Questions
Next by thread: Re: [Emacspeak] TTS Server Implementation Questions
Index(es):
- Date
- Thread
|Full archive May 1995 - present by Year|Search the archive|
If you have questions about this archive or had problems using it, please contact us.
Contact Info Page