From noobs to nerds, everyone has been told about neural network. But did you ever play with?
Let’s avoid the MNIST tutorial (handwritten digit recogntion), too boring, and choose something more exciting : music recognition ! Style, instrument, notes, effects, all of this would be nice to extract from mp3 !
Step 1 : find data
We need a lot of data for any kind of recognition. Of course we need music, and also a massive pack of instrument sample . Let’s start by classifying our datas, each instrument in a folder of it’s name.
Step2 : generate more datas.
1500 samples is not that too bad, but what about getting a lot more ? In visual recognition, we would slightly distort our sample image, let’s do the same on our musical samples.
Now apply chorus, reverb and then filter for getting more realistic samples.
Step 3 : prepare datas
As working on wav file would be painful, let’s transform our sample into representative images, for say spectrogram. As this project has never been documented, we will have to try or to combine different visualisation parameters.
Using Sox, we will get a DFT spectrogram, and use Hamming windows for the frequency analysis, and have to try the Dolph windows for dynamic analysis. For an human eye, the « peak spectrometer » seems also representative, let’s keep an eye on it.
I’m currently using a sh script for crawling music folder, i’ll release it soon here.
Step 4 : Build a brain I : Skull
Now we have to setup the tools to contain and manage our neural network.
I tried TensorFlow, wich works fine but lacks by it’s user-interface, in combination with mnisten, in order to package custom datas in MNIST format.
But even geeks needs some kindness, so we will use NVIDIA’s Digits and their Caffe fork, wich implement multiple neural network sample and tools for data formatting, the whole thing managed via a local webpage.
Building digits and caffe wasn’t so easy, here’s a short list of command to get it work on a fresh ubuntu 14.04.
| Without CUDA
sed -i ‘$aCPU_ONLY := 1’ Makefile.config
CUDA_REPO_PKG=cuda-repo-ubuntu1404_7.5-18_amd64.deb && wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/$CUDA_REPO_PKG && sudo dpkg -i $CUDA_REPO_PKG
ML_REPO_PKG=nvidia-machine-learning-repo_4.0-2_amd64.deb && wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1404/x86_64/$ML_REPO_PKG && sudo dpkg -i $ML_REPO_PKG
sudo apt-get update
Warning : Training new models on caffe/digits without CUDA is very slow. Our case took more than a week on ImageNet !
If you’re not the owner of a CUDA-compatible nvidia graphic card, you’ll have to spend from 50$ to 25000$ to go further, or use pretrained models.
Step 5 : Build a brain II : external wiring
Let see our goals again : Get Mp3 file as input, Output MIDI file. In the same time, Caffe get images as input and output raw log.
We will basically need 2 programs, let’s say inputter, for spectrographing a mp3, and launching our tests. The second, outputter, will assume the conversion to MIDI format.
Step 6 : Build a brain III, or 470 !
Take a look on the parameters we want to extract for each instrument note :
– Note duration (attack – sustain – release events)
– Pitch (or note)
– Velocity / Volume
– Instrument sub-category / style
Many of these parameters can easily be calculated when we use monophonic tracks, but in our polyphonic case we will rely on our neural networks from the A to the Z, we came for that ! Sadly, the amount of combination still exceed what a poor computer like ours can do (it would take a 47 million category model to achieve the aimed resolution below ).
// hypothetical way
//Let’s split our job.
//First, what about using one model per instrument ? Better, let’s split their job in 3 :
// Score, instrumentation and effects.
//Score will detect around : 40 notes * 3 events * 8 volumes = 960 cats #may vary for polyphonic instrument
//Instrumentation : 20 sub instruments * 50 filters = 1000 cats
//Effects : 12 delays * 5 reverbs * 15 filters = 900 cats
let’s first try to recognize drum and percussions hits. Results in 15 hours ^^ »
Still training, and get 92% accuracy … Looks promising !