Why using voice recognition software is an act of language philosophy

I use voice recognition software on a daily basis. I turn to it when I am too lazy to type up on my computer or when I am busy driving but want get the smartphone to do something.

On the computer I use dictation mainly when I am copying text for students’ handouts or online quizzes. This act of requires me to look at a text and type at the same time. Although I am a touch typist, I still quite clumsy at it, especially when the text is not the text I am typing.

Now the mic on the computer is like a mic on any recording like my PCM recorder. It picks up sound. But, while my PCM recorder records the sound as pure audio data and does nothing with it, my computer, when I am dictating to it, does at least one extra step – it transforms that sound into written text. This extra step is like thinking; it needs to recognise the sounds to convert them into the text. In other words, sound and text are not one but two things.

When I dictate to my computer the computer does not initially “know” the meaning of the words. It only recognises the sound patterns then transforms them into corresponding written patterns. From experience, I know that I must pronounce the sounds clearly for the computer to be able to do its job – to transform the sounds into text. But even when I being most careful the computer cannot accurately do this and returns what it thinks the sound patterns it heard. The problem could be

  • my pronunciation,
  • the mic’s inability to pick up the sound because of incidental noise,
  • low level of the sound because I was too far away from them,
  • the quality of the mic,
  • the inability to distinguish similar sounds.

Most of the time I am required spend time to go through the text and edit it. (But, in the end, it still saves me a lot of time.)

Most days, too, I would say to my smartphone things like, “Hey Siri, play music,” or “Hey Siri, what’s my schedule for today?” And it (it isn’t a “she” or “he”, for we can choose and change the voice we desire to hear) would play some songs it thinks I like or maybe tell me I have a doctor’s appoint at 9pm.

Like my computer my smartphone also has a mic. Unlike my computer it is constantly “listening”. Until it hears a sound pattern that is Hey Siri it does nothing (it isn’t converting the sound pattern to the text Hey Siri every time, for that would use too much power). But once it does, it is instructed to

  • listen for new sound patterns,
  • transform the sound patterns into text,
  • decide on the meaning of the text, and then
  • do something (like play music or find and read out the day’s schedule).

What my computer or smartphone isn’t first doing is recognising meaning. It is listening for sound patterns. And before the advent of voice recognition it was looking for written text patterns.

The Hey Siri process is essentially no different than the text conversion process, except it is not converting to text but another action directly – the act of listening for what to convert into text.

As I sit here my wife is doing things like carrying laundry, washing the cups, turning on and off taps, removing plates from dishwashers. But none of these are sounds that I need to convert to text (except for me to translate that for here as to what is physically happening here). These are sound patterns of non-words. But nonetheless they all “translate” to have meaning (that I should feel guilty and start to help her with the day’s work) because I am processing them.

Sound or text patterns as received form (sensation) are undoubtedly separate from meaning (perception). They are clear as the difference between ear and nerve, and brain. The failure of the ear or nerve necessarily means the brain will not receive form from the outside. A person given the gift of sound for the first time with a prosthetic ear will likely break down in tears especially when sound matches vision. The world is a richer place with sensation. It has more meaning.