AI Learns to Write Rap Lyrics!

806K+ views   |   34K+ likes   |   1K+ dislikes   |  
16:03   |   Mar 09, 2019


AI Learns to Write Rap Lyrics!
AI Learns to Write Rap Lyrics! thumb AI Learns to Write Rap Lyrics! thumb AI Learns to Write Rap Lyrics! thumb


  • [computer-generated gibberish]
  • Yeah, alright, so hi everybody, it's me Cary /khhh/
  • Now, I've always thought of myself as a musical person
  • [loud singing, recorder screeching, and rubber chicken shrieking]
  • Isn't it amazing?
  • [sigh] No. No, Cary that isn't amazing.
  • Anyway given that I've used AI to compose Baroque music
  • [Computery's Baroque music]
  • And I've used AI to compose jazz music
  • [Computery's jazz music]
  • I think it just makes sense for me to fast-forward the musical clock another 60 years to compose some rap music
  • But before I do that,
  • I gotta give credit to Siraj Raval, who actually did this first.
  • homie grows on E like Leone totin inspired enough
  • But you know what they say: No rap battle is complete without two contenders
  • So what did I do to build my own digital rap god?
  • Well, I used Andrej Karpathy's recurrent neural network code again
  • An RNN is just an ordinary neural network
  • But we give it a way to communicate with its future self with this hidden state meaning it can store memory.
  • Now I've done this countless times before so I won't dive too deep into what an RNN is
  • Instead I want to focus more on a twist I implemented that makes this quote/unquote "algorithm" more musical.
  • Before I do that though
  • I need to introduce you to Dave from boyinaband.
  • He's, um, a tad bit good at rapping I guess
  • [definitely more than "a tad bit good" rapping]
  • So when I first trained Karpathy's RNN to generate rap lyrics in 2017
  • I invited him over to read the lyrics my algorithm had written
  • but then I lost the footage and then he lost the footage and
  • Well, long story short, there's no footage of it ever happening. That made me bummed for a bit
  • But then I realized this could be interpreted as a sign from above
  • Perhaps the AI prevented us humans from rapping its song because it wanted to do the rap itself!
  • Well Computery if you insist.
  • To give Computery a voice,
  • I downloaded this Python module that lets us use Google's text-to-speech software directly
  • I'm pretty sure you've heard this text-to-speech voice before.
  • Now, as we hear Computery's awesome rap
  • I'm gonna show the lyrics on screen. If you're up for it, you viewers out there can sing along too!
  • Alright, let's drop this track
  • Wait, why aren't you singing along?
  • The reason it performed so badly is because it hasn't had any training data to learn from.
  • So let's go find some training data. With my brother's help,
  • I used a large portion of the Original Hip-Hop Lyrics Archive as my data set to train my algorithm on.
  • This includes works by rap giants like Kendrick Lamar and Eminem
  • We stitched around 6,000 songs into one giant text file
  • (Separated with line breaks) to create our final data set of 17 million text characters
  • Wait, that's only 17 megabytes. A single 4-minute video typically takes up more space than that.
  • Yeah, it turns out that text as a data type is incredibly dense.
  • You can store a lot of letters in the same amount of space as a short video. Let's see the algorithm learned.
  • Okay, ready? Go-stop
  • As you can see, after just 200 milliseconds less than a blink of an eye
  • It learned to stop putting spaces everywhere
  • in the data set. You'll rarely see more than two spaces in a row
  • So it makes sense that the AI would learn to avoid doing that too
  • However, I can see it still putting in uncommon patterns like double I's and capital letters in the middle of words
  • So let's keep training to see if it learns to fix that
  • We're half a second into training now and the pesky double I's seem to have vanished
  • The AI has also drastically shortened the length of its lines.
  • But behind the scenes, that's actually caused by an increase of the frequency of the line break character.
  • For the AI, the line break is just like any other text character
  • However, to match the data set
  • we need a good combination of both line breaks and spaces
  • Which we actually get in the next iteration!
  • And here we see the AI's first well-formatted word: "it"
  • Wait, does "eco" count as a word? Not sure about that.
  • Oh my god, you guys. Future Cary here
  • I realize that's not an uppercase I, it's a lowercase L. Major 2011 vibes.
  • Now at one full second into training,
  • We see the AI has learned that commas are often not followed by letters directly
  • There should be a space or a line break afterwards.
  • By the way, the average human reads at 250 words per minute
  • So a human learning how to rap alongside the AI has currently read...
  • Four words.
  • I'm gonna let it run in the background as I talk about other stuff
  • So one thing I keep getting asked is "what is loss?"
  • Basically, when a neural network makes a guess about what the next letter is gonna be,
  • it assigns a probability to each letter type
  • And loss just measures how far away those probabilities were from the true answer given by the data set on average
  • So lower loss usually means the model can predict true rap lyrics better
  • Now I'm playing the training time-lapse 10 times faster
  • The loss function actually held pretty constant for the first 18 seconds
  • Then it started to drop.
  • That big drop corresponds to the text looking much more English,
  • With the lines finally beginning to start with capital letters (took long enough)
  • And common words like "you," "I," and "the" making their first appearance
  • By 54 seconds I'd say about half of the words are real
  • So rudimentary grammar rules can start forming
  • "Of the" is one of the most common bigrams in the English language, and here it is.
  • Also, apostrophes are starting to be used for contractions and we're seeing the origins of one word interjections
  • Over a minute in, we see the square bracket format start showing up.
  • In the data set, square brackets were used to denote which rapper was speaking at any given time
  • So that means our baby AI's choice of rappers are Guhe Comi, Moth, and Berse Dog Rlacee
  • I also want to quickly point out how much doing this relies on the memory I described earlier.
  • As Andrej's article shows, certain neurons of the network
  • have to be designated to fire only when you're inside the brackets to remember that you have to close them
  • at some point to avoid bracket imbalance
  • [sigh] Okay, this is the point in the video where I have to discuss swear words
  • I know a good chunk of my audience is children. So typically I'd censor this out
  • However, given the nature of our rap data set
  • I don't think it's possible to accurately judge the neural networks performance if we were to do that .
  • Besides I've included swears in my videos before; people just didn't notice.
  • But that means if you're a kid under legal swearing age, I'm kindly asking you to leave to preserve your precious ears
  • But if you won't leave I'll have to scare you away
  • Ready?
  • Shit [gasp] fuck [GASP] bitch [GASP!] Peter Ruette [AAAH!]
  • But with that being said,
  • There is one word that's prevalent in raps that- ah- that I don't think I'm in the position to say and- ah
  • Dang it. Why is this glue melting? Okay. Well, I'm pretty sure we all know what word I'm talking about
  • So in the future, I'm just going to place all occurrences of that word with ninja
  • After two minutes, it's learned to consistently put two line breaks in between stanzas
  • and the common label "chorus" is starting to show up (correctly)
  • Also, did you notice the mysterious line "Typed by OHHLA webmaster DJ Flash"?
  • That doesn't sound like a rap lyric! Well, it's not.
  • It appeared 1172 times in the data set as part of the header of every song that the webmaster transcribed.
  • Now over the next 10 minutes the lyrics gradually got better
  • It learned more intricate grammar rules like that "motherfuckin'" should be followed by a noun,
  • but the improvements became less and less significant
  • So what you see around 10 minutes is about as good as it's gonna get
  • After all, I set the number of synapses to a constant 5 million
  • And there's only so much information you can fit into 5 million synapses
  • Anyway, I ran the training overnight and got it to produce this 600-line file
  • If you don't look at it too long, you could be convinced they're real lyrics
  • Patterns shorter than a sentence are replicated pretty well
  • But anything longer is a bit iffy
  • There are a few one-liners that came out right, like "now get it off" and "if you don't give a fuck about me"
  • The lines that are a little wonky like "a bust in the air" could be interpreted as poetic
  • Oh, I also like it when a switches into shrieking mode
  • But anyway, we can finally feed this into Google's text-to-speech to hear it rap once and for all
  • Hold on! That was actually pretty bad.
  • The issue here is we gave our program no way to implement rhythm
  • Which in my opinion is the most important element to making a rap flow.
  • So, how do we implement this rhythm?
  • Well, this is the twist I mentioned earlier in the video.
  • There's two methods. Method one would be to manually time stretch and time squish syllables
  • To match a pre-picked rhythm using some audio editing software
  • For this I picked my brother's song "3000 subbies"
  • And I also used Melodyne to auto-tune each syllable to the right pitch. So it's more of a song.
  • Although that's not required for rap.
  • So how does the final result actually sound? I'll let you be the judge
  • I think that sounded pretty fun and I'm impressed with Google's vocal range.
  • However, it took me two hours to time align everything
  • And the whole reason we used AI was to have a program to automatically generate our rap songs.
  • So we've missed the whole point!
  • That means we should focus on method two: automatic algorithmic time alignment.
  • How do we do that?
  • Well firstly notice that most rap background tracks are in the time signature 4/4 or some multiple of it
  • Subdivisions of beats as well as full stanzas also come in powers of two
  • So all rhythms seem to depend closely on this exponential series
  • My first approach was to detect the beginning of each spoken syllable
  • And quantize or snap that syllable to the nearest half beat
  • That means syllables will sometimes fall on the beat
  • just. like. this.
  • But even if it fell off the beat we'd get cool syncopation, just. like. this. which is more groovy
  • Does this work? Actually, no.
  • Because it turns out detecting the beginning of syllables from waveforms is not so easy.
  • Some sentences, like "come at me, bro"
  • Are super clear, but others like
  • "Hallelujah our auroras are real"
  • Are not so clear.
  • And I definitely don't want to have to use phoneme extraction. It's too cumbersome
  • So here's what I actually did: I cut corners
  • Listening to lots of real rap,
  • I realized the most important syllables to focus on were the first and last syllables of each line
  • Since they anchor everything in place
  • The middle syllables can fall haphazardly
  • And the listeners brain will hopefully find some pattern in there to cling to
  • Fortunately human brains are pretty good at finding patterns where there aren't any
  • So to find where the first syllable started,
  • I analyzed where the audio amplitude first surpassed 0.2
  • And for the last syllable I found when the audio amplitude last surpassed 0.2 and literally subtracted a fifth of a second from it
  • That's super janky and it doesn't account for these factors, but it worked in general
  • From here I snapped those two landmarks to the nearest beat
  • Time-dilating or contracting as necessary
  • Now if you squish audio the rudimentary way, you also affect its pitch, which I don't want.
  • So I instead used the phase vocoder of the Python library Audio TSM to edit timing without affecting pitch
  • Now instead of this:
  • We get this:
  • That's pretty promising. We're almost at my final algorithm, but there's one final fix.
  • Big downbeats, which occur every 16 normal beats, are especially important
  • Using our current method,
  • Google's TTS will just run through them like this:
  • Not only is that clunky, it's just plain rude
  • So I added a rule that checks if the next book-end line will otherwise run through the big downbeat.
  • And if so, it will instead wait for that big downbeat to start before speaking.
  • This is better, but we've also created awkward silences.
  • So to fix that I introduced a second speaker
  • When speaker one encounters an awkward silence,
  • Speaker 2 will fill in by echoing the last thing speaker once said and vice-versa.
  • What we get from this is much more natural.
  • Alright, so that's pretty much all I did for a rhythm alignment, and it vastly improves the flow of our raps.
  • I think it's time for you to hear a full-blown song that this algorithm generated.
  • Are you ready to experience Computery's first single?
  • I know I sure am.

Download subtitle


o wow, this is the exact 2-year anniversary of my baroque-AI video!

AI baroque music: /watch?v=SacogDL_4JU
AI jazz music: /watch?v=nA3YOFUCn4U

Siraj Raval: https://www.youtube.com/channel/UCWN3xxRkmTPmbKwht9FuE5A
Boyinaband: https://www.youtube.com/channel/UCQ4FyiI_1mWI2AtLS5ChdPQ
Peter Ruette: https://www.youtube.com/channel/UCL2A9Ncbz1BAb4dCXLwY2dg

Source code: https://github.com/carykh/rapLyrics

Drone footage in the back of the song at the end:

*** MUSIC ***
Everything here is licensed under a Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/)

Lee Rosevere - Wireless

Lobo Loco - Railroad (ID 1003)

"Rushhour 1, 2, and more!" - Cary Huang (2005)

Music: "Permafrost"
Music by Dan-O at DanoSongs.com

"Skyline" by JujuMas
I believe this is the right JujuMas? Not sure though:

Background tracks:
RAP BEAT by Made2Make (intro and rap tests before any time alignment)

Sadness by OZsound ("Just tell me..." quantized section)

Track for 2-minute "full song" at the end: