Neural Network – Let’s dub

From noobs to nerds, everyone has been told about neural network. But did you ever play with?

Let’s avoid the MNIST tutorial (handwritten digit recogntion), too boring, and choose something more exciting : music recognition ! Style, instrument, notes, effects, all of this would be nice to extract from mp3 !

Step 1 : find data
We need a lot of data for any kind of recognition. Of course we need music, and also a massive pack of  instrument sample . Let’s start by classifying our datas, each instrument in a folder of it’s name.
Step2 : generate more datas.
1500 samples is not that too bad, but what about getting a lot more ? In visual recognition, we would slightly distort our sample image, let’s do the same on our musical samples.
Now apply chorus, reverb and then filter for getting more realistic samples.
Step 3 : prepare datas
As working on wav file would be painful, let’s transform our sample into representative images, for say spectrogram. As this project has never been documented, we will have to try or to combine different visualisation parameters.
Using Sox, we will get a DFT spectrogram, and use Hamming windows for the frequency analysis, and have to try the Dolph windows for dynamic analysis. For an human eye, the « peak spectrometer » seems also representative, let’s keep an eye on it.

I’m currently using a sh script for crawling music folder, i’ll release it soon here.

Step 4 : Build a brain I : Skull
Now we have to setup the tools to contain and manage our neural network.
I tried TensorFlow, wich works fine but lacks by it’s user-interface, in combination with mnisten, in order to package custom datas in MNIST format.
But even geeks needs some kindness, so we will use NVIDIA’s Digits and their Caffe fork, wich implement multiple neural network sample and tools for data formatting, the whole thing managed via a local webpage.

Building digits and caffe wasn’t so easy,  here’s a short list of command to get it work on a fresh ubuntu 14.04.

 Without CUDA

cd $HOME
sudo apt-get install python-pil python-numpy python-scipy python-protobuf \ python-gevent python-flask gunicorn python-h5py \
libgflags-dev libgoogle-glog-dev libopencv-dev \
libleveldb-dev libsnappy-dev liblmdb-dev libhdf5-serial-dev \
libprotobuf-dev protobuf-compiler libatlas-base-dev \
python-dev python-pip python-numpy gfortran
sudo apt-get install –no-install-recommends libboost-all-dev
git clone digits
export $DIGITS_HOME=$HOME/digits
sudo python
sudo pip install -r requirements.txt
cd $HOME
git clone –branch caffe-0.14
export CAFFE_HOME=${HOME}/caffe
sudo apt-get install
cat python/requirements.txt | xargs -n1 sudo pip install

sed -i ‘$aCPU_ONLY := 1’ Makefile.config
make all
make test
make runtest
sudo pip install –upgrade Flask-WTF


CUDA_REPO_PKG=cuda-repo-ubuntu1404_7.5-18_amd64.deb && wget$CUDA_REPO_PKG && sudo dpkg -i $CUDA_REPO_PKG

ML_REPO_PKG=nvidia-machine-learning-repo_4.0-2_amd64.deb && wget$ML_REPO_PKG && sudo dpkg -i $ML_REPO_PKG

sudo apt-get update
sudo apt-get install digits

Warning  : Training new models on caffe/digits without CUDA is very slow. Our case took more than a week on ImageNet !
If you’re not the owner of a CUDA-compatible nvidia graphic card, you’ll have to spend from 50$ to 25000$ to go further, or use pretrained models.

Step 5 : Build a brain II : external wiring
Let see our goals again : Get Mp3 file as input, Output MIDI file. In the same time, Caffe get images as input and output raw log.
We will basically need 2 programs, let’s say inputter, for spectrographing  a mp3, and launching our tests. The second, outputter, will assume the conversion to MIDI format.

Step 6 : Build a brain III, or 470 !
Take a look on the parameters we want to extract for each instrument note :
– Note duration (attack – sustain – release  events)
– Pitch (or note)
– Velocity / Volume
– Instrument sub-category / style
– Filter
– Reverb
Many of these parameters can easily be calculated when we use monophonic tracks, but in our polyphonic case we will rely on our neural networks from the A to the Z, we came for that ! Sadly, the amount of combination still exceed what a poor computer like ours can do (it would take a 47 million category model to achieve the aimed resolution below ).

// hypothetical way
//Let’s split our job.
//First, what about using one model per instrument ? Better, let’s split their job in 3 :
// Score, instrumentation and effects.
//Score will detect around  : 40 notes * 3 events * 8 volumes = 960 cats  #may vary for polyphonic instrument
//Instrumentation : 20 sub instruments * 50 filters = 1000 cats
//Effects : 12 delays * 5 reverbs * 15 filters = 900 cats

let’s first try to recognize drum and percussions hits. Results in 15 hours ^^ »
Still training, and  get 92% accuracy … Looks promising  !

Laisser un commentaire