LOADING ...

AI Learns to Write Rap Lyrics!

806K+ views   |   34K+ likes   |   1K+ dislikes   |  
16:03   |   Mar 09, 2019

Thumbs

AI Learns to Write Rap Lyrics!
AI Learns to Write Rap Lyrics! thumb AI Learns to Write Rap Lyrics! thumb AI Learns to Write Rap Lyrics! thumb

Transcription

  • [computer-generated gibberish]
  • Yeah, alright, so hi everybody, it's me Cary /khhh/
  • Now, I've always thought of myself as a musical person
  • [loud singing, recorder screeching, and rubber chicken shrieking]
  • Isn't it amazing?
  • [sigh] No. No, Cary that isn't amazing.
  • Anyway given that I've used AI to compose Baroque music
  • [Computery's Baroque music]
  • And I've used AI to compose jazz music
  • [Computery's jazz music]
  • I think it just makes sense for me to fast-forward the musical clock another 60 years to compose some rap music
  • But before I do that,
  • I gotta give credit to Siraj Raval, who actually did this first.
  • homie grows on E like Leone totin inspired enough
  • But you know what they say: No rap battle is complete without two contenders
  • So what did I do to build my own digital rap god?
  • Well, I used Andrej Karpathy's recurrent neural network code again
  • An RNN is just an ordinary neural network
  • But we give it a way to communicate with its future self with this hidden state meaning it can store memory.
  • Now I've done this countless times before so I won't dive too deep into what an RNN is
  • Instead I want to focus more on a twist I implemented that makes this quote/unquote "algorithm" more musical.
  • Before I do that though
  • I need to introduce you to Dave from boyinaband.
  • He's, um, a tad bit good at rapping I guess
  • [definitely more than "a tad bit good" rapping]
  • So when I first trained Karpathy's RNN to generate rap lyrics in 2017
  • I invited him over to read the lyrics my algorithm had written
  • but then I lost the footage and then he lost the footage and
  • Well, long story short, there's no footage of it ever happening. That made me bummed for a bit
  • But then I realized this could be interpreted as a sign from above
  • Perhaps the AI prevented us humans from rapping its song because it wanted to do the rap itself!
  • Well Computery if you insist.
  • To give Computery a voice,
  • I downloaded this Python module that lets us use Google's text-to-speech software directly
  • I'm pretty sure you've heard this text-to-speech voice before.
  • Now, as we hear Computery's awesome rap
  • I'm gonna show the lyrics on screen. If you're up for it, you viewers out there can sing along too!
  • Alright, let's drop this track
  • Wait, why aren't you singing along?
  • WHY AREN'T YOU-
  • The reason it performed so badly is because it hasn't had any training data to learn from.
  • So let's go find some training data. With my brother's help,
  • I used a large portion of the Original Hip-Hop Lyrics Archive as my data set to train my algorithm on.
  • This includes works by rap giants like Kendrick Lamar and Eminem
  • We stitched around 6,000 songs into one giant text file
  • (Separated with line breaks) to create our final data set of 17 million text characters
  • Wait, that's only 17 megabytes. A single 4-minute video typically takes up more space than that.
  • Yeah, it turns out that text as a data type is incredibly dense.
  • You can store a lot of letters in the same amount of space as a short video. Let's see the algorithm learned.
  • Okay, ready? Go-stop
  • As you can see, after just 200 milliseconds less than a blink of an eye
  • It learned to stop putting spaces everywhere
  • in the data set. You'll rarely see more than two spaces in a row
  • So it makes sense that the AI would learn to avoid doing that too
  • However, I can see it still putting in uncommon patterns like double I's and capital letters in the middle of words
  • So let's keep training to see if it learns to fix that
  • We're half a second into training now and the pesky double I's seem to have vanished
  • The AI has also drastically shortened the length of its lines.
  • But behind the scenes, that's actually caused by an increase of the frequency of the line break character.
  • For the AI, the line break is just like any other text character
  • However, to match the data set
  • we need a good combination of both line breaks and spaces
  • Which we actually get in the next iteration!
  • And here we see the AI's first well-formatted word: "it"
  • Wait, does "eco" count as a word? Not sure about that.
  • Oh my god, you guys. Future Cary here
  • I realize that's not an uppercase I, it's a lowercase L. Major 2011 vibes.
  • Now at one full second into training,
  • We see the AI has learned that commas are often not followed by letters directly
  • There should be a space or a line break afterwards.
  • By the way, the average human reads at 250 words per minute
  • So a human learning how to rap alongside the AI has currently read...
  • Four words.
  • I'm gonna let it run in the background as I talk about other stuff
  • So one thing I keep getting asked is "what is loss?"
  • Basically, when a neural network makes a guess about what the next letter is gonna be,
  • it assigns a probability to each letter type
  • And loss just measures how far away those probabilities were from the true answer given by the data set on average
  • So lower loss usually means the model can predict true rap lyrics better
  • Now I'm playing the training time-lapse 10 times faster
  • The loss function actually held pretty constant for the first 18 seconds
  • Then it started to drop.
  • That big drop corresponds to the text looking much more English,
  • With the lines finally beginning to start with capital letters (took long enough)
  • And common words like "you," "I," and "the" making their first appearance
  • By 54 seconds I'd say about half of the words are real
  • So rudimentary grammar rules can start forming
  • "Of the" is one of the most common bigrams in the English language, and here it is.
  • Also, apostrophes are starting to be used for contractions and we're seeing the origins of one word interjections
  • Over a minute in, we see the square bracket format start showing up.
  • In the data set, square brackets were used to denote which rapper was speaking at any given time
  • So that means our baby AI's choice of rappers are Guhe Comi, Moth, and Berse Dog Rlacee
  • I also want to quickly point out how much doing this relies on the memory I described earlier.
  • As Andrej's article shows, certain neurons of the network
  • have to be designated to fire only when you're inside the brackets to remember that you have to close them
  • at some point to avoid bracket imbalance
  • [sigh] Okay, this is the point in the video where I have to discuss swear words
  • I know a good chunk of my audience is children. So typically I'd censor this out
  • However, given the nature of our rap data set
  • I don't think it's possible to accurately judge the neural networks performance if we were to do that .
  • Besides I've included swears in my videos before; people just didn't notice.
  • But that means if you're a kid under legal swearing age, I'm kindly asking you to leave to preserve your precious ears
  • But if you won't leave I'll have to scare you away
  • Ready?
  • Shit [gasp] fuck [GASP] bitch [GASP!] Peter Ruette [AAAH!]
  • But with that being said,
  • There is one word that's prevalent in raps that- ah- that I don't think I'm in the position to say and- ah
  • Dang it. Why is this glue melting? Okay. Well, I'm pretty sure we all know what word I'm talking about
  • So in the future, I'm just going to place all occurrences of that word with ninja
  • After two minutes, it's learned to consistently put two line breaks in between stanzas
  • and the common label "chorus" is starting to show up (correctly)
  • Also, did you notice the mysterious line "Typed by OHHLA webmaster DJ Flash"?
  • That doesn't sound like a rap lyric! Well, it's not.
  • It appeared 1172 times in the data set as part of the header of every song that the webmaster transcribed.
  • Now over the next 10 minutes the lyrics gradually got better
  • It learned more intricate grammar rules like that "motherfuckin'" should be followed by a noun,
  • but the improvements became less and less significant
  • So what you see around 10 minutes is about as good as it's gonna get
  • After all, I set the number of synapses to a constant 5 million
  • And there's only so much information you can fit into 5 million synapses
  • Anyway, I ran the training overnight and got it to produce this 600-line file
  • If you don't look at it too long, you could be convinced they're real lyrics
  • Patterns shorter than a sentence are replicated pretty well
  • But anything longer is a bit iffy
  • There are a few one-liners that came out right, like "now get it off" and "if you don't give a fuck about me"
  • The lines that are a little wonky like "a bust in the air" could be interpreted as poetic
  • Oh, I also like it when a switches into shrieking mode
  • But anyway, we can finally feed this into Google's text-to-speech to hear it rap once and for all
  • Hold on! That was actually pretty bad.
  • The issue here is we gave our program no way to implement rhythm
  • Which in my opinion is the most important element to making a rap flow.
  • So, how do we implement this rhythm?
  • Well, this is the twist I mentioned earlier in the video.
  • There's two methods. Method one would be to manually time stretch and time squish syllables
  • To match a pre-picked rhythm using some audio editing software
  • For this I picked my brother's song "3000 subbies"
  • And I also used Melodyne to auto-tune each syllable to the right pitch. So it's more of a song.
  • Although that's not required for rap.
  • So how does the final result actually sound? I'll let you be the judge
  • I think that sounded pretty fun and I'm impressed with Google's vocal range.
  • However, it took me two hours to time align everything
  • And the whole reason we used AI was to have a program to automatically generate our rap songs.
  • So we've missed the whole point!
  • That means we should focus on method two: automatic algorithmic time alignment.
  • How do we do that?
  • Well firstly notice that most rap background tracks are in the time signature 4/4 or some multiple of it
  • Subdivisions of beats as well as full stanzas also come in powers of two
  • So all rhythms seem to depend closely on this exponential series
  • My first approach was to detect the beginning of each spoken syllable
  • And quantize or snap that syllable to the nearest half beat
  • That means syllables will sometimes fall on the beat
  • just. like. this.
  • But even if it fell off the beat we'd get cool syncopation, just. like. this. which is more groovy
  • Does this work? Actually, no.
  • Because it turns out detecting the beginning of syllables from waveforms is not so easy.
  • Some sentences, like "come at me, bro"
  • Are super clear, but others like
  • "Hallelujah our auroras are real"
  • Are not so clear.
  • And I definitely don't want to have to use phoneme extraction. It's too cumbersome
  • So here's what I actually did: I cut corners
  • Listening to lots of real rap,
  • I realized the most important syllables to focus on were the first and last syllables of each line
  • Since they anchor everything in place
  • The middle syllables can fall haphazardly
  • And the listeners brain will hopefully find some pattern in there to cling to
  • Fortunately human brains are pretty good at finding patterns where there aren't any
  • So to find where the first syllable started,
  • I analyzed where the audio amplitude first surpassed 0.2
  • And for the last syllable I found when the audio amplitude last surpassed 0.2 and literally subtracted a fifth of a second from it
  • That's super janky and it doesn't account for these factors, but it worked in general
  • From here I snapped those two landmarks to the nearest beat
  • Time-dilating or contracting as necessary
  • Now if you squish audio the rudimentary way, you also affect its pitch, which I don't want.
  • So I instead used the phase vocoder of the Python library Audio TSM to edit timing without affecting pitch
  • Now instead of this:
  • We get this:
  • That's pretty promising. We're almost at my final algorithm, but there's one final fix.
  • Big downbeats, which occur every 16 normal beats, are especially important
  • Using our current method,
  • Google's TTS will just run through them like this:
  • Not only is that clunky, it's just plain rude
  • So I added a rule that checks if the next book-end line will otherwise run through the big downbeat.
  • And if so, it will instead wait for that big downbeat to start before speaking.
  • This is better, but we've also created awkward silences.
  • So to fix that I introduced a second speaker
  • When speaker one encounters an awkward silence,
  • Speaker 2 will fill in by echoing the last thing speaker once said and vice-versa.
  • What we get from this is much more natural.
  • Alright, so that's pretty much all I did for a rhythm alignment, and it vastly improves the flow of our raps.
  • I think it's time for you to hear a full-blown song that this algorithm generated.
  • Are you ready to experience Computery's first single?
  • I know I sure am.

Download subtitle

Description

o wow, this is the exact 2-year anniversary of my baroque-AI video!

AI baroque music: /watch?v=SacogDL_4JU
AI jazz music: /watch?v=nA3YOFUCn4U

Siraj Raval: https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A
Boyinaband: https://www.youtube.com/channel/UCQ4FyiI_1mWI2AtLS5ChdPQ
Peter Ruette: https://www.youtube.com/channel/UCL2A9Ncbz1BAb4dCXLwY2dg

Source code: https://github.com/carykh/rapLyrics

Drone footage in the back of the song at the end:
/watch?v=xTaND4JKVv4

Music:
*** MUSIC ***
Everything here is licensed under a Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/)

Lee Rosevere - Wireless
http://freemusicarchive.org/search/?sort=track_date_published&d=1&quicksearch=lee+wireless

Lobo Loco - Railroad (ID 1003)
http://freemusicarchive.org/search/?sort=track_date_published&d=1&quicksearch=lobo+loco+railroad

"Rushhour 1, 2, and more!" - Cary Huang (2005)
/watch?v=B7oejTqJoPE

Music: "Permafrost"
Music by Dan-O at DanoSongs.com
http://www.danosongs.com/#music

"Skyline" by JujuMas
I believe this is the right JujuMas? Not sure though:
https://www.youtube.com/channel/UC9x3tPxNuXC-38CqcxY1yFA/featured

Background tracks:
RAP BEAT by Made2Make (intro and rap tests before any time alignment)
/watch?v=5Ys4AbHDtMI

Sadness by OZsound ("Just tell me..." quantized section)
/watch?v=OdEwOyQt4JE

Track for 2-minute "full song" at the end:
/watch?v=NoSKBTZfI84