/bat

🦇 Beat As Token - a transformer architecture for Western music theory

bat 🦇

A Beat As Token transformer architecture designed from scratch to fail to enhance beat token embeddings with pairwise beat comparisons.

Idea

  • Let's train BERT since we don't want to generate anything.
  • Let's take a MIDI file. If you don't have one, take your audio, run demucs + basic-pitch and pack tracks into one
  • Run beat tracking to know exact measure/beat onsets
  • Discard your file if it's not 4/4
  • Write a function that gives you all notes participating in a single beat
  • Now write a function that extracts binary features from a single beat. The more the better. Think relatively. Hey, music isn't about absolute pitches.
    • Is there a bass onset at this beat?
    • Eleven features for presence of all intervals above the bass?
    • Does the melody have one/two/four notes? Where do they go?
    • Is there a chord strumming?
    • Which quarter of the measure is it?
    • Does it have 16th hi-hats? Snare/kick/crash?
    • etc etc
  • Make a bit vector out of these features. As we multiply it by a learnable matrix B, we'll get beat embeddings - these will be our tokens.
  • By the way, we'll go for the context of 512 tokens - because most of the tracks fit under 128 measures.
  • Now for the weird part. We need to make a first self-attention. A key-query-value dance. Let's calculate relative pairwise features between two beats and use it there.
  • So, write a function that extracts binary features from a pair of beats. Again, no absolute pitches allowed:
    • Is the bass/chord/melody the same?
    • Is the melody from beat 1 tranposed to beat 2?
    • Are the bass/chord/melody starting from these two beats equal for the next 4/16/64 beats? (A bit of pre-compute for faster convergence.)
    • What's the interval between leftmost bass notes of two beats?
    • Is the second beat an exact transposition of all notes one tone up?
    • Encode relative distance in measures in beats between these two. Look up rotary position encoding and invent your own.
    • etc etc
  • Then multiply this on Key/Query/Value matrices, make self-attention, add MLP.
  • Stack more layers. Don't repeat the pairwise embeddings again - hope that a residual connection will help if need be. Use some standard self-attention with some relative distances.

Where to start

Start learning this on tiny sizes with a handful of features and just 2 self-attention layers. Maybe try a tiny embedding length - 32?

Oh, how do we train? What should [MASK] token be equal to? Well, let's do binary prediction of all beat features via a sigmoid. A loss will sum all sigmoids.

Fine-tuning

We can try to fine-tune on answering music theory questions about every beat:

  • is this a tonic chord
  • is this a true measure start, a 4-measure phrasing start
  • is this a start of a direct modulation up
  • what's the likely local scale
  • is this blues

Literature