GANs N’ Roses: listen to visual art

AI pipeline to enable people to feel visual art through the sense of hearing

Guillermo García Cobo
Saturdays.AI

--

Context

This article describes the result of joint efforts by five people from different backgrounds, who came together to develop the final project of the 2021 Deep Learning course offered by Saturdays.AI, a non-profit organization.

Before going into other details, we believe that it is important to answer the following questions: who are we and where do we come from?

So given this short and needed context, before going deeper and explaining the project and its motivation, it may be useful to see what we have achieved. Take a look at the following video with an example:

Finally, the code and implementation can be found in this github repository. Now, let’s go into details! 🚀

Motivation

Once the context is set, let’s explain the main motivation for this project. To do that, we kindly ask you to close your eyes and, without opening them and thus not looking at the following painting in Figure 1, reflect on what feelings this piece of art evokes in you.

Figure 1. Sample painting

You may be asking yourself: ‘How am I supposed to feel the painting if I am not able to see it?’ Frustrating, isn’t it? Well, you have experienced a similar situation to the one that people with special visual needs would experience in a museum. Based on these circumstances is where our idea was born, since, in the end, this project intends to enable people to feel visual art through other senses, in this case the sense of hearing.

How can we reach this objective? And, equally important, can we automate the process? Our hypothesis was that Artificial Intelligence can give a positive answer to both questions and, because of that, we have built an end-to-end process that aims to help in the previously described situation.

From a painting to a melody — Project Pipeline

Having explained the situation where we would like to help, a general picture of the path a painting follows to become a melody comes next, which also should contribute to give an idea about how we approached this problem to achieve such a demanding goal.

Figure 2. Pipeline

As depicted in Figure 2, there are three differentiated steps that the data from the drawing passes through before transforming itself into a melody.

Although these steps are going to be explained in depth in the following paragraphs, a brief and high-level description of the whole process could be done here. First of all, the image is fed into a CNN that predicts which emotions are more probable to be evoked in a viewer. After that, we cannot feed these detected emotions directly into our orchestra director (a robust pre-trained Transformer), since, just like a normal human, it would be really difficult for it to compose something with 50% happiness, 30% love, 20% optimism, wouldn’t it? The inspiration that the model needs to compose the final melody is produced by the black box, the component where the real transformation happens. But just like good magicians, we are not going to reveal the trick so easily (… yet! Wait a few paragraphs).

From paintings to emotions

As explained in the previous section, the first step is to detect emotions in the input painting. To fulfill this first objective, we have combined Transfer Learning with an extensive dataset with human annotations on 4k+ paintings which resulted in a CNN that outputted reasonable results. Let’s take a deeper look at each of the parts involved.

In relation to the dataset that was used, it comes from The Wikiart Emotions Dataset, an extensive dataset of annotated paintings considering the emotions they evoke. The dataset was constructed as follows: several pieces of art (more than 4k as mentioned above) from different western styles were presented to different observers, who annotated them based on a set of 20 available emotions, apart from other information not considered for this project. The observers could select more than one emotion for the same painting. Thus, the raw data available is the percentage of people that annotated each piece of art by emotion, so that we could get 0.65 of happiness and 0.5 love at the same time for a particular painting. The authors present several datasets, but we only considered the raw data to extract all the information we needed. With this information, we built our dataset which consisted of two parts: all the pictures from paintings with the original name on the dataset, and a csv file with information about the original name of the picture and the percentage of each emotion that it evokes. The emotions that were available to observers can be seen at Figure 3:

Figure 3. Emotions available

Some of the issues that this dataset caused are:

  • The sum of percentages of emotions per painting did not sum 100%. This is because people could say that more than one emotion was felt when watching a painting, so when averaging emotions among all observations per painting, scores could sum more than 1. Since having something similar to a probabilistic distribution is desired, these scores were transformed so they would sum 1 per painting. This effect can be reached using a well-known function for Data Scientists: softmax. However, since this function uses exponentials, it will never assign 0 probability, even though the initial score was 0. Is this desirable? At first sight, it does not seem reasonable to assign a score to an emotion if we are sure that a painting contains nothing related to it. Moreover, if you take a look at Figure 4 (before softmax) and Figure 5 (after normal softmax), not only emotions with 0 score are given a positive score, but also this score is close to the one assigned to other emotions which previously had non-zero values. Also, the distance between all scores is reduced, since there is less value left to assign. Because of this, an effort was done to exclude 0s in the softmax computation, which results in Figure 6
Figure 4. Before softmax
Figure 5. After normal softmax
Figure 6. After non-zero softmax

Since this is a great subjective task, a great effort was made to balance the final detected emotions. As you can see in Figure 7, the dataset was clearly unbalanced towards positive emotions and during the first versions of the model we observed a clear tendency to predict mostly positive emotions.

Figure 7. Emotions distribution

In terms of the model built, we used ResNet50 as the base model, thus taking advantage of its pre-trained power in extracting features from images. After that, and since our problem required further learning from the model to better identify emotions, we added a few extra dense layers.

Lastly, the target towards which the model was trained should be defined. The first attempt was to treat this problem as a classification one, trying to predict the most ‘popular’ emotion in a painting, and then use softmax as the activation function in the last layer so as to get a probabilistic distribution that could be used in the next steps. However, although it was a good first try to get a working model, we realized two aspects could be improved:

  • We were not making use of valuable information that was actually provided by the dataset about the percentage of each emotion.
  • If in the end we wanted a probabilistic distribution, why not predict it directly with the available data?

After considering these points, the classificator was turned into a regressor that directly predicted the score associated with each emotion. When this version of the model was working, we discovered two issues:

  • Because of the nature of the dataset (many evaluators with different and subjective criteria), the model was finding it hard to predict percentages accurately. But, are we interested in having the model to predict exactly the percentages or will it be enough if it could determine the main ‘directions’ of the emotions (positiveness, activeness, …)? In fact, putting the focus on predicting the exact distribution was causing it to not predict these directions at all.
  • Once the predicted distribution of emotions was outputted, how were we going to cope with the fact that there is no standard space of emotions? For instance, datasets used in next steps were annotated using a different list of emotions.

Based on the two factors mentioned above, we had to figure out a way of bringing different spaces of emotions together at the same time that the transformation would keep the basic ‘directions’ of the emotions. This key transformation is explained in the next section.

Valence-Arousal — The key transformation

Having described the necessity that aroused, let’s explain our proposal and the solution we adopted: the valence-arousal scale.

The idea came first from the paper that described one of the datasets we used for having music annotated — the vgmidi project. The basis of this scale is to describe emotions with two dimensions: valence (or positiveness) and arousal (or intensity). For even more simplification, vectors in the valence-arousal scale have norm less than or equal to 1 (that is, they land inside the circumference of radius 1). For better understanding, Figure 8 has an example of this new representation.

Figure 8. Valence-Arousal scale

Having the new representation of emotions clear, now it is time to explain why it is so useful for us. Remember that we initially have an array with proportions of n emotions. With this, we intend to:

  • Reduce dimensionality so that the base ‘directions’ of emotions are correctly predicted by the model, and not the exact proportions themselves.
  • Easily map the space of n emotions to a space of m emotions that could have no relation to the original space.

The first step to meet these objectives is to define where in the circumference every emotion of the original space is located. As shown in Figure 9, which corresponds to the emotions found in the dataset used to train the CNN, each sentient is assigned a location in the border of the circumference according to its valence-arousal value. Mathematically, this is equal to assigning an angle to each emotion.

Figure 9. Base emotions mapped to circumference

To better understand the rest of the transformations needed, let’s define a simpler set of emotions: happiness, love, fear and sadness. As done with the previous set, let’s assign each sentiment to a location in the circumference:

Figure 10. New emotions mapped to circumference

Now that the place of each emotion of the original space is clear, how do we define the coordinates of the input n-dimensional vector? Consider the following example:

Figure 11. Example input vector

First, as seen in Figure 12, each emotion (or each coordinate in the original vector) is associated with a vector in the new space.

Figure 12. Coordinates to vectors

This is done by using the score/proportion of each emotion as the modulus of the vector, and the predefined angle of the emotion as the angle of the vector. These are the polar coordinates of the vector, so to get the cartesian coordinates cosines and sines are used, as remembered in Figure 13.

Figure 13. Polar to Cartesian coordinates

After getting one vector per emotion, how do we get the final representation of the original vector? We decided to simply sum all the previously computed vectors. By doing this, emotions with higher proportions will have a greater impact on the final coordinates, ending up with a good summary representation of the original input, as seen in Figure 14.

Figure 14. Final representation

As the reader may have observed, all the transformations performed result in a reasonable and human-understandable procedure to map any set of emotions to a common space of two dimensions; that correctly summarize the main ‘directions’ of the emotions and ease working with different spaces. Thus, the CNN model is trained to predict the valence-arousal coordinates of a painting, which is then used in the next step as a selector of how another AI model would generate the final melody.

From emotions to an initial inspiration

The previous sections show that we are able to determine the valence-arousal coordinates of an input painting. Once the information from the image part is retrieved, it is time to pass it to another dimension: music. As mentioned at the beginning, our orchestra director needs an initial inspiration to compose the final melody, and this is the section where this initialization is obtained. This initialization is a music piece of 5 seconds that is believed to evoke very similar emotions than the ones detected in the input painting. After that, the creativity of the model generates the rest of the melody.

Two main tasks are quickly identified for finding the most suitable initialization:

  • Get a list of songs annotated by the emotions they evoke. This task is fulfilled thanks to the existence of multiple datasets. The first one is the vgmidi project, where MIDIs are already annotated using the valence-arousal scale. Another one that was used is emotify, which contains music annotated based on a new set of emotions, which was translated by us to valence-arousal following exactly the same process as with the annotations of the paintings. If you haven’t already, do you realize now how versatile the valence-arousal scale is?
  • Define what ‘suitable’ means in this context. To solve this, one can imagine that, if we have the vectorial representation in valence-arousal of the image as well as the piece, a natural way of defining closeness is by the euclidean distance. Thus, we just need to compute distances from the painting’s vector to all the pieces’ vectors and get the closest. To find the most suitable piece (the closest) in a very efficient and simple way, we have used k-NN with k=1.

A graphical representation of the process is found in Figure 15, where the painting’s vector and its corresponding initial piece’s vector are represented:

Figure 15. Painting to MIDI

From inspiration to the final melody through Transformers — The Orchestra Director

Transformers employ a training method that consists of masking data belonging to a sequence in order to predict it. Those models build features of each word using an attention mechanism to figure out how important all the other words in the sentence are with respect to the aforementioned word. Knowing this, the words’ updated features are simply the sum of linear transformations of the features of all the words, weighted by their importance.

The attention mechanism consists of one or more blocks of mathematical operations that will give us the weight of importance of each data in the sequence. In this block, the network analyzes the sequence simultaneously. This block is in charge of finding relations between the different parts of the sequence. The original tokens are represented in three different ways: QUERYS, KEYS, VALUES. QUERYS of each token are compared with the existing KEYS (vector multiplication whose result measures the degree of relationship between words), the result is normalized and a softmax function is applied to convert this into probabilities, with this score (probability), the VALUES vectors are weighted and the importance of each part of the sequence is determined at the time of encoding the tokens. This strategy allows us to learn more efficiently which parts of the sequence are more important than others, and therefore, which ones define it better.

In this project the use of transformers has been given by magenta: an open source Python library, powered by TensorFlow. This library includes utilities for manipulating source data (primarily music and images), using this data to train machine learning models, and finally generating new content from these models. The use of this tool has allowed us to predict new songs with Artificial Intelligence from an initial midi, which has the valence-arousal characteristics that match the image we want to set to music.

Deployment

If only there was a way to share this work with the rest of the world… Wait! There is a way. And it’s easy as pie using Streamlit and Docker containers.

The first thing the user sees is a Streamlit web application that asks her to upload a picture. This picture would be about a painting to be processed by the rest of the pipeline previously described. Then, all the magic happens backstage.

Since the image sentiment analysis pipe and the MusicTransformer have conflicting dependencies and the latter is much more memory thirsty, we considered two separate containers to be the best solution. Both in terms of simplicity and horizontal scalability. This tandem works using HTTP requests, as outlined in Figure 16.

Figure 16. Architecture of the solution

When a painting is fed to the pipeline, our image sentiment analysis model estimates some values for valence and arousal and picks the best-matching seed. Then, it sends a GET request to the MusicTransformer container, providing the seed file name, seconds to take from this file and whether accompaniment should be generated for the melody (which is more computationally and time expensive) or not.

After several seconds of computation, the MusicTransformer gets back to the pipe with the freshly generated MIDI file. The process gets to an end when the pipeline container transforms the resulting song and displays it for the client.

Conclusions & Future Work

As we have seen during all this article, the building block of the GANs N’ Roses project is the valence-arousal model of emotions. This is the tool that enables us to build a practical example of translating between senses through emotions, as well as the adapter that connects several Artificial Intelligence models to speak the same language.

We have built an end-to-end Artificial Intelligence project that relies on three well-known models, from one of the most used CNN architectures (ResNet50) to extract relevant information about the paintings to Transformers to generate music, through a k-NN method to find the most similar initial song to the painting considering the emotion it evokes.

Finally, in terms of the infrastructure, we built a web-based interface with Streamlit and wrapped all the pipeline with a Docker-based architecture that is scalable thanks to separating the most demanding procedures into two different machines, making the use of this project a reality.

To finish this publication, and since this project is open to new ideas and work around different areas, we have seen several improvements and related work that can be done, such as:

  • Training the CNN image classification model with a larger number of WikiArt images. Part of this dataset has been used to build the model in the present project. The next step is to get a more powerful environment that allows us to process more images and make fine-tuning.
  • Creation of a more manipulable music generation model, where feelings can be more accurately imprinted. For this we want to develop two different strategies:
  • Training one model for each feeling. The current project uses a single transformer that has been trained with music of all kinds of feelings. With this strategy we would train different transformers, one for each sentiment, so that we have more control over the characteristics of the output music.
  • Training a model labeled with valence-arousal output. With this strategy we get the input of the music model from the output of the image classification model directly, simplifying the project pipeline.
  • From GANs N’ Roses we want this application to be used in accordance with the needs of its users, so we want to get in contact with organisations of visually disability people and ask them for their opinion in order to incorporate the necessary improvements to the tool.
  • Creation of a complementary application for people with hearing disabilities, so that they can visualise the feelings conveyed by sounds. Also to put ourselves in the shoes of people who experience auditory synaesthesia. These people automatically and involuntarily experience a visual perception before an auditory stimulus, whereby each sound is associated with a colour. As with other disorders such as colour blindness, visualising how other people see the world brings us closer to them and helps to create a more sympathetic and tolerant society.
  • Generation of more realistic music adding different instruments to the orchestra.

To finish this article, we provide two more examples with different emotions to see our pipeline in action:

If you liked the project and would like to collaborate, feel free to contact any of us. Also, share it with your friends and contacts!

Thank you very much for having made it this far 😃

--

--

Guillermo García Cobo
Saturdays.AI

Double Degree in Mathematics and Computer Science Student