766K+ views   |   26K+ likes   |   598 dislikes   |  
18:54   |   Dec 31, 2018




  • >> Oh, hey there! What's up? Cary Knife Holder here.
  • If you read the title,
  • then you know this video is about my lip reading AI, or rather
  • I should say *our* lip reading AI, because I did this project with my friendo James WoMa.
  • >> JAMES— >> See, in winter 2018,
  • he roped me into— (coughs) —I mean encouraged me into taking this class at our college called CS 230: Deep Learning.
  • What really hyped us up, was that this class was literally taught by Andrew Ng. Yeah,
  • that's the guy who founded Google Brain and the guy who left Baidu's AI group, causing the company's market cap to drop by
  • 2.7 billion dollars. Anyway, the class was mostly taught online through pre-recorded lecture videos on Coursera.
  • So, over the course of 10 weeks, I got very used to the sound of
  • Andrew Ng's voice played at two times speed.
  • (Sped up 2x) >> Hehe, am I giving any of you former students nightmare flashbacks of watching lecture videos? Haha, serves you right!
  • >> Oh,
  • but back to the subject of this video.
  • The students of CS 230 were supposed to form groups of 2 to 4 and create something
  • **AI-y.**
  • So, I teamed up with James and we had trouble coming up with a project idea quickly.
  • We tried to make an autoencoder that converted traditional Chinese characters into simplified ones,
  • but it overfit like crazy, and we also didn't want to bow down to stereotypes,
  • so that idea had to go. >> Goodbye!
  • >> With less than three weeks left, we were getting desperate. So we settled on a lip-reading algorithm
  • Can we get this creepy guy out of here? The goal: give the algorithm a silent video of someone talking and
  • it should spit out the sounds it thinks they said.
  • For our first crack at this problem,
  • we were a little ambitious.
  • Whereas most lip-reading programs just output text subtitles of what was said,
  • >> Place blue in m 1 soon
  • (hi im here because i want to seem like a contribution dont mind me)
  • >> We wanted to go the extra mile and output full audio of what was said. That way,
  • we'd include features that text would otherwise miss, like voice inflection, pauses, breathing noises and lip smacking.
  • Huh, what's that? Yeah, of course people want to hear the lip smacking!
  • >> (Lip smacking...)
  • ...Hello— >> Now, for our data set, we wanted a well framed video with consistent lighting and good anunciacion.
  • So, I just went on YouTube's trending page and found this 15 minute video.
  • (One of the few actual uses of trending am I right?)
  • As I mentioned in an old video, raw audio files, like WAV, are
  • information dense, with around 40,000 data points per second.
  • We're gonna convert all the audio into spectrograms using
  • ARSS to drop the temporal resolution by 400 times and save us from the headache.
  • >> This is the hardest and most emotionally
  • confusing video that I've ever had to make. Everyone that watches this video—
  • (repeated) This is the hardest and most emotionally
  • confus—
  • >> Since the x-axis of a spectrogram corresponds to time, every frame of the video
  • corresponds to a single column of the spectrogram. That transforms our problem into this:
  • Given a silent video frame and a surrounding neighborhood of say 9 other frames to give context and velocity information,
  • can we get a neural network to generate the spectrogram column that corresponds with the correct phonemes spoken? For those of you
  • who care about the technical details,
  • here's the neural network architecture.
  • But for the rest of us,
  • let's just see if this algorithm can figure out what the heck this guy is saying. So Computer,
  • here's just the visuals of this guy talking. What sounds do you think he's making?
  • (computer generated-voice dub) This is the hardest and most— [x1]
  • (computer generated-voice dub) [This is the hardest and most— [x2]
  • (computer generated-voice dub) [This is the hardest and most— [x3]
  • That was actually pretty good, but James and I purposely trained our model on only the first two seconds as a proof of concept.
  • So, we didn't know how well it would do on the rest of the video.
  • We'd better check, because after all, the whole point of a lip-reading program is to read lips that we don't yet know the words to.
  • So Computer, take a look at the rest of the video, would you please?
  • (computer generated-voice dub) [This is the hardest and most—
  • (more computer-generated voice dub)
  • (Computery's gibberish)
  • (Computery's hellish gibberish)
  • >> Uhm, I don't know what you thought about that, but that wasn't quite what we had planned.
  • OK Jimbo, if we don't get this to work, we'll fail the class!
  • It's time to bring out the big guns! Train the neural network on not just two seconds, but 14 minutes of the video!
  • (Computery's hellish gibberish)
  • Increase the neuron count per layer!
  • (more Computery's hellish gibberish)
  • Include the history of past phonemes in the models input for even more context!
  • (gibberish, once again)
  • Use Adam Geitgey's face recognition Python library to find the bounds of this guy's lips and crop all images to show just that!
  • (insert witty comment about gibberish)
  • As you can see, things were looking dire. And to make things worse, finals week was approaching, so I started having to pull all-nighters
  • studying for other classes. >> It's 7:00 AM, and I've stayed up all night working on CS 107, Binary Bomb and three working on
  • memorizing this script. >> We had some serious reconsidering to do. With one week left,
  • here's what we decided. You know how instead of text subtitles, we had wanted to go the extra mile and create full audio?
  • Yeah, let's not do that. It may not be as impressive, but we might make more progress working with simple phonemes instead.
  • Fortunately, we found Gentle (gentle!), which is a robust yet lenient forced aligner built on Kaldi, by lowerquality.com.
  • (Oof!)
  • If you give it an audio file, it will align CMU phonemes to the audio with appropriate timestamps.
  • Unfortunately, you need to provide a text transcript to assist the aligner, and this video didn't have one! At this point,
  • there were three changes
  • we wanted to make to our dataset:
  • More exaggerated lip motions, closer to an hour's worth of video instead of 15 minutes, and like I just said, a pre-written
  • transcript. So, we might as well create our own dataset! Can't possibly take longer than the desired hour of video, right?
  • We just need to pick the right...
  • ...transcript...
  • (melancholic music)
  • On March the 11th,
  • 2018, at 11 PM, I did the unthinkable.
  • I read the entire Bee Movie script on camera.
  • >> "According to all known laws of
  • aviation, there is no way a bee should be able to fly." >> To reiterate, I used Jerry Seinfeld's cinematic
  • masterpiece to create 70 minutes, or
  • 126 thousand frames, of training data for an academic project of a graduate level computer science course at Stanford. Now, I know
  • Bee Movie memes are dead, but remember that we did this assignment ten months ago,
  • so spare me the pitchforks? The key improvement of this data set over
  • the last one is that I could synchronize the correct words and
  • phonemes with the video using the help of Gentle, and the publicly available
  • Bee Movie transcript. And sure, I did misspeak a few times... >> "...and begins your career at Honey
  • Indu—Industries!!!" >> ...and Gentle did a few oopsies as well...
  • >> "Artie growing a moustache, looks good!" >> ...but I bet 95% of the data set is accurate. At this point,
  • you might be starting to see why these phoneme labels make this problem so much easier.
  • Now, we can forget about the audio file and focus solely on the phonemes, which are one four-thousandth the size.
  • Oh, and like before, we'll use Adam's face recognition library to crop it on my lips.
  • Simply put, every image fits into one of 41
  • categories (one category per phoneme). Upon receiving a brand new image, our new and improved
  • neural network just has to figure out which of the 41 categories this new image falls into. This is called a classification problem,
  • and it's drastically easier to get working than our first attempt. In technical terms,
  • we trained a convolutional neural network with a softmax output on the first
  • 110,000 frames of me speaking; that took eight hours.
  • Then we tested it on the remaining 16,000 frames to evaluate how well it could do on data
  • it's never seen before. And what was the verdict? Well, I'll let you be the judge!
  • (epic music)
  • OK, it might not be super clear what's going on,
  • so I'll explain.
  • This is the input image the neural network's looking at. These phonemes on the left are the ground truth for what I'm actually saying.
  • >> "Barry, what happened?
  • Wait, I think we were on autopilot the whole time!" >> Or at least what Gentle predicted I was saying. The neural network's
  • goal is to guess this phoneme correctly, but it can't see any of this list.
  • It only sees this image and its frame neighbourhood. With only that to go off of, the neural network predicts and its prediction is
  • here. Because it's never 100%
  • confident, it guesses probabilities of each of the 41 phonemes being spoken. Here,
  • we are seeing the neural net's top guesses, ranked from most confident to least.
  • Sometimes, its mind is split between a handful of different phonemes,
  • but other times it's almost certain it knows the truth. Now, the true answer is highlighted in pink,
  • so when the leftmost bar is pink, that means the neural network got it right! This happened for 47% of frames,
  • so that's our final score, if you will. Now, that might sound like a failing grade,
  • but remember this: many phonemes look alike, such as /θ/ (f) and /v/.
  • In those cases, the neural network pretty much has to make a coin toss, and if by chance,
  • it ends up ranking the correct phoneme as number 2. Well, I still think that's a success, but it technically counts as a failure.
  • That was all we could finish by the final project deadline.
  • This was our poster, by the way.
  • But I'm not satisfied, and you shouldn't be satisfied either! You and I both know that this was supposed to be a lip-reading project.
  • "Reading" implies we get words out of it, and this algorithm clearly isn't giving us any words.
  • In fact, this bar graph thingy doesn't seem to have any uses for humans... at all.
  • So, nine months after we submitted our project, I set out to finish what we'd started.
  • Alone. Because James WoMa had died.
  • (more melancholic music)
  • Just kidding. I set out to create what we had initially envisioned: a tool that could read aloud
  • what somebody was saying by just looking at their lips. It was actually pretty easy!
  • All I had to do was download the CMU Pronouncing Dictionary, which describes every words'
  • pronunciation in CMU phonemes. Then, I matched up each
  • pronunciation with the phoneme probabilities across time that the neural network outputted, to give each word a
  • probability score. Then, I just chose the high-scoring word, move the playhead to the end of that word and repeated the process.
  • After running this process through ten minutes of a test set, it outputted a script of around 800
  • time-aligned words, so I just used the Google text-to-speech Python library to read them aloud. Want to hear it? Go ahead!
  • So for comparison, I'm gonna give you the actual first two lines of the test set.
  • >> "That's the bee way! We're not made of jello." >> Now, just looking at my lips,
  • what did the neural network think I said? >> "That a be weigh. Will upp matte fuss
  • aloe." >> I think it did pretty well on the first four words given that 'T' and 'S' look identical, and the other two mismatches are
  • literally homophones. I don't think I could have asked for anything better.
  • But for the second half, not so much. >> "We are not made of jello." >> "Will up matte fuss aloe."
  • Distinctive phonemes like the 'M' in the middle and the 'LO' at the end were understood, but not much else was.
  • Notice that it mixed up the 'RE' in "we're" as an 'L' in "will". So,
  • even though we try to avoid racial stereotypes in the beginning, eventually, um, it caught up t— actually, nevermind,
  • let me just jump to another example;
  • this one's a famous quote from the movie. >> "Thinking bee. Thinking bee. Thinking bee." >> "Thing make ee. Thing ee be. Thing be."
  • Haha. It's funny because I got the first syllable of thinking right each time,
  • but the real takeaway from this clip is to never lick your lips! When I unnecessarily press my lips together after the first 'bee'
  • silently, the lip reader misinterpreted that as an 'm'. (My dermatologist was right all along.)
  • >> "Of course, I saw the flower."
  • >> "Eye of owns, es unfair." >> "That was genius!" >> "That wish tee is!" >> "Thank you, but we're not done yet."
  • >> "They you owe, bore Tai Allah tan."
  • >> "Listen, everyone" >> "This, everyone." >> "This runway is covered with the last pollen." >> "This rowe weigh staph Earl the last paw le."
  • >> At this point, you might be starting to think that my algorithm isn't performing too
  • well, and I agree... but for fairness's sake, let's play a game.
  • I'll show you my lips with no sound and you'll guess what I said. Ready?
  • Not so easy, is it? Here it is again, two more times.
  • Now, it's the AI's turn.
  • >> "Vanessa,
  • mode oohs oven enter!" >> Obviously that wasn't right, but at least it's something. Let's see what the correct answer was.
  • >> "Vanessa, pull yourself together!"
  • >> Oh, hey, look at that! The AI got a three-syllable word right!
  • ...although the AI is kinda cheating, because
  • Vanessa is a main character, whose name appeared a few times in the training set, and you might not have known that.
  • ...you didn't know that? She's that woman. She likes jazz.
  • Let's play the game again, but on a less cherry-picked line, like this one. I'll play it twice.
  • Here's what the AI thought that was. >> "After
  • item omit?" "After item omit?" >> That looks like it matches my mouth pretty well,
  • but what was the true answer? >> "Have you got a moment?"
  • "Have you got a moment?"
  • >> Ehh, both of you did pretty bad. Boring. Something fun
  • we can do is
  • simultaneously play the real and AI
  • audio in different ears, and even though they're quite different,
  • you can sorta hear how they say open and closed vowels at the same time.
  • (simultaneous audio) [Transcriber note: I will not transcribe this section; the words spoken are already highlighted on the video.]
  • Well, that was a little overwhelming.
  • But if I'm being honest, making machine learning projects like this is fun! Speaking from experience though,
  • it's hard to get started,
  • especially when you don't have the option of taking a college course about machine learning. Now, if you haven't heard of it before,
  • [paid promotion] Brilliant.org is a great alternative! They cover a wide variety of STEM fields,
  • including AI, of course.
  • And the way they present the material is a lot more beginner-friendly
  • than the lists of formulas you tend to see on AI sites. You get to answer hands-on questions that
  • gradually challenge you more and more. >> I think it tests... >> Oh, that's perfect! Yeah, it does test like then.
  • >> Boom! >> And personally, that enticed me to keep going! Here's one thing I really like about Brilliant.
  • There's a community producing a constant stream of new problems and solutions, and if you're adventurous enough, you can dive in!
  • Do you know the answer to this? Well, I do!
  • >> Wow, we're the first to get this right! >> Wow!
  • >> No, no, no, no, there's no way!
  • >> Man, it feels good to be the first solver. If you want that rush of dopamine,
  • head over to brilliant.org/carykh and sign up for free now!
  • The first 200 people to sign up will have their annual
  • premium subscription cost decimated twice. In other words, drops by twenty percent.
  • So, that's pretty cool! And thanks to Brilliant.org for sponsoring this video.
  • What a brilliant idea! [end of paid promotion] But here are some of my final thoughts about the whole video.
  • Overall, I'd say this is in my top three favourite AI projects. The final algorithm clearly isn't capable of figuring out what I'm saying,
  • but it's still fun to see its predictions, and I'm pleased with how often those predictions do align visually with my lip motions.
  • There's more that could be done, though. This neural net is only trained on my mouth with specific lighting and exaggerated lip motions.
  • So, it's not going to work on every person's mouth.
  • That is, unless, I train it on a more diverse data set, which I don't think should be too hard because image classification
  • is getting really good these days.
  • Furthermore, the word selection at the end just chooses words in isolation.
  • It's not taking into account grammar and context, and perhaps we could get it to perform better
  • if it did know these things, such as after the word "the", there should always be a noun!
  • What's fun about this is that we can further tweak the vocabulary to lean towards more meme-y words or academic words or political words.
  • I don't know. That sounds like it could be dangerous though!
  • Like, if you put hateful words onto a legit video of a political leader speaking,
  • even though I highly doubt anyone would believe it, it still wouldn't be very nice.
  • But it's not all bad applications, because maybe this tool could be used for good, too,
  • like helping the hearing-impaired figure out what words are being said or recovering damaged videos that have lost their audio components.
  • I'd only trust this once the accuracy goes way up, because right now it's abysmally low. Alright, two less things before I go.
  • One— I uploaded most of the code for this project to a GitHub link in the description,
  • although the lip reader probably won't work for you because I'm not gonna upload my data set of
  • 126,000 images, but you can get a rough picture of how I did things.
  • Two— you might have noticed that Liza animated the first few minutes of this video!
  • It was really fun working with her, and you should follow her on Instagram. Related to that,
  • I think I could release videos like these faster if I had more animators.
  • So, if you want to work with me on these
  • projects, reach out to me in my Twitter DMs or something. If you send a video of 20 seconds of your animation to my voice,
  • perhaps this animated clip of me talking here, I can gauge if our styles match up or not.
  • Oh, also, thanks for a great 2018! Happy New Year, and I'll see you in 2019!
  • Anyway, that's all I've got to say for now.
  • I think I've turned off all the programs running on my computer, but I'm not sure.
  • Ehh, doesn't matter because in the law— >> "...eye gunmen aids manager..."
  • >> Whoa, AI, didn't know you were still on, but geez, can you tone it down? I want this video to stay monetized!
  • >> "...fur hi pee gay slick they ee..."
  • >> Please stop the kind of language right now! >> "...this tuohy MIT (censored) know binding
  • ee ay uhh ma utt that..." >> How do I get this stupid thing off‽
  • >> "...low hannah lower alright..." >> Oh! What are you and Hannah getting up to? >> What the heck? I didn't even say that!
  • >> I saw your lips
  • mouthing it! >> Oh, for Pete's sake—!

Download subtitle


Check out Brilliant.org for fun STEMmy courses online! First 200 people to sign up here get 20% off their annual premium subscription cost: https://brilliant.org/CaryKH/

Thanks to Liza for animating the beginning of this video: https://www.instagram.com/lizadesya/?hl=en

GitHub repo for this project:
(I haven't uploaded all files here yet, especially the December ones. They'll be coming soon!)

James WoMa's channel:

If you wanna help animate for my videos, here's my Twitter I suppose: https://twitter.com/realCarykh

Raw output of the lip-reading AI: /watch?v=7LCUN7tMbRI

Original video of me reading the Bee Movie Script: /watch?v=AJCfgXhA5fc

Gentle Python library: https://github.com/lowerquality/gentle

Adam Geitgey's face recognition Python library: https://github.com/ageitgey/face_recognition

*** MUSIC ***
Everything here is licensed under a Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/)

Lee Rosevere - Wireless

Lobo Loco - Railroad (ID 1003)

Nikolai Rimsky-Korsakov - Flight of the Bumblebee (surprisingly fitting. It's also in the public domain)

BODYSURFER - Call Your Grandma

"Childhood Memories of Winter" from: Music4YourVids.co.uk

"Skyline" by JujuMas

Long Road Ahead by Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License

Final Count by Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0 License

Sippie Jepper - Branchless

Song: Fredji - Happy Life (Vlog No Copyright Music)
Music provided by Vlog No Copyright Music.
Video Link: /watch?v=KzQiRABVARk