Machines and humans: restless and not shaken for a perfect captioning recipe
Posted: Thu Dec 30, 2021 9:53 am
The early days of 3Play Media included extensive research into the various methods of transcribing audio and video content in order to create accurate and well-timed captions. One of the main points of interest was the use of Automatic Speech Recognition (ASR). Specifically, could ASR be used to generate quality captions? As a stand-alone solution: no. But the speed and scale of generating a "decent" draft was compelling.As we pondered the problem - potentially millions of hours of content needing to be captioned - we focused on designing a process and system that could evolve, recognizing that a human was needed to achieve. acceptable levels of precision for this use case.
Our ultimate conclusion is still relevant today as the basic process for transcribing and captioning audio and video assets today: 1) generate an ASR draft, 2) human edits and edits every second, and 3) human quality assurance control.How to select the right closed captioning providerHow to select the right closed caption provider. 10 questions you must ask yourself. Download the checklist.This white paper is designed to provide you with 10 critical questions to ask as you compare your options and find the video accessibility solution that's right for you.Access the white paperAt the time, we couldn't identify any other vendor using this approach of successfully combining speech recognition with human correction. Have we built a better mousetrap? Or were we Business Email List really ahead of the market? [Best voice of John McEnroe] Did people even understand what we were talking about ?!The few vendors in the space at the time immediately rushed in, exclaiming that it was not possible to use speech recognition in the process to achieve legitimate levels of precision. Several noted that they had tried it themselves and could confidently claim that a hybrid man / tech solution probably couldn't speed up the process, and anyone suggesting so much was deluding themselves. I'm not going to lie, that was a pretty funny answer. Can you imagine taxi companies saying that a rideshare app would never work when it was happening right in front of them? Ah, OK… However, other vendors have doubled down on their anti-tech messages. Competitive blog posts and articles on the realities of speech recognition accuracy and how “overuse” the technology equates to “bad” scared some customers and prospects. Some tenders even began to specify the need for “human” captions and that processes using speech recognition would not be tolerated, rather than focusing on output precision measurements.
We found ourselves playing around with language to minimize the automated aspects of the transcription process. Instead, we've focused all of the automation messages on aspects of the workflow.We started to focus on the result rather than the exact process; noting that through our use of technology, we have often been able to achieve higher levels of precision than fully manual solutions (still do today).We learned that going into the details of how we created the captions can become extremely confusing for people new to these products and concepts. have. Did it even matter how we transcribed the content? Well, maybe. Automation was seen by some as the holy grail - cheap, fast, and infinitely scalable. However, many also misunderstood the balance of ASR's capabilities and limitations then (and perhaps still today). At the same time, working with a new company in the space that had found a way to innovate was exciting, and learning the process was both interesting and part of the assessment process. We started to focus on the result rather than the exact process; noting that through our use of technology, we have often been able to achieve higher levels of precision than fully manual solutions (still do today). And even still, some customers have said that the closed captioning we provided was automated in nature.
Our ultimate conclusion is still relevant today as the basic process for transcribing and captioning audio and video assets today: 1) generate an ASR draft, 2) human edits and edits every second, and 3) human quality assurance control.How to select the right closed captioning providerHow to select the right closed caption provider. 10 questions you must ask yourself. Download the checklist.This white paper is designed to provide you with 10 critical questions to ask as you compare your options and find the video accessibility solution that's right for you.Access the white paperAt the time, we couldn't identify any other vendor using this approach of successfully combining speech recognition with human correction. Have we built a better mousetrap? Or were we Business Email List really ahead of the market? [Best voice of John McEnroe] Did people even understand what we were talking about ?!The few vendors in the space at the time immediately rushed in, exclaiming that it was not possible to use speech recognition in the process to achieve legitimate levels of precision. Several noted that they had tried it themselves and could confidently claim that a hybrid man / tech solution probably couldn't speed up the process, and anyone suggesting so much was deluding themselves. I'm not going to lie, that was a pretty funny answer. Can you imagine taxi companies saying that a rideshare app would never work when it was happening right in front of them? Ah, OK… However, other vendors have doubled down on their anti-tech messages. Competitive blog posts and articles on the realities of speech recognition accuracy and how “overuse” the technology equates to “bad” scared some customers and prospects. Some tenders even began to specify the need for “human” captions and that processes using speech recognition would not be tolerated, rather than focusing on output precision measurements.
We found ourselves playing around with language to minimize the automated aspects of the transcription process. Instead, we've focused all of the automation messages on aspects of the workflow.We started to focus on the result rather than the exact process; noting that through our use of technology, we have often been able to achieve higher levels of precision than fully manual solutions (still do today).We learned that going into the details of how we created the captions can become extremely confusing for people new to these products and concepts. have. Did it even matter how we transcribed the content? Well, maybe. Automation was seen by some as the holy grail - cheap, fast, and infinitely scalable. However, many also misunderstood the balance of ASR's capabilities and limitations then (and perhaps still today). At the same time, working with a new company in the space that had found a way to innovate was exciting, and learning the process was both interesting and part of the assessment process. We started to focus on the result rather than the exact process; noting that through our use of technology, we have often been able to achieve higher levels of precision than fully manual solutions (still do today). And even still, some customers have said that the closed captioning we provided was automated in nature.