in a convolutional network, it doesn’t make sense talking about neurons. In this case, there 8 layers (separated by a dropout one) with 32 (3×1) kernels (with ELU activation), followed by 2 dense Tanh layers with 256 neurons and a softmax output layer with 2 units.

The number of layers can be analyzed in many ways:

- Experience
- Validation
- Grid-search

In general, it’s helpful to start with a model with smaller models, checking the validation accuracy, overfitting, and so on, and making a decision (e.g. adding new layers, increasing or decreasing the number of units, adding regularization, dropout, batch normalization, …). The golden rule (derived from the Occam’s razor) is to try to find the smallest model which achieves the highest validation accuracy. An alternative (but more expensive) approach is based on a grid-search. In this case, a set of models based on different parameters are trained sequentially (or in parallel, if you have enough resources) and the optimal configuration (corresponding to the highest accuracy/smallest loss) is selected. Normally this approach requires more iterations because the initial grid is coarse-grained and it’s used to determine the sub-space where the optimal parameter set is located. Then, several zooms are performed in order to fine-tune the research.

]]>In your architecture how many hidden layer did you use? and the number of neuron in each layer is 32?? am i right?

how can you know which number of layer would be beneficial for your model?

thanks alot

The word vector selector has not been detailed because I’m planning to post a complete working example. However, the idea is based on a “pseudo-attention” mechanism implemented with a simple MLP with a softmax output (the input length is fixed and the sentences are padded or truncated). Each value represents the probability of a specific word to be representative of a context. The network is trained with labeled examples and, thanks to word-vectors, is also very robust to synonyms.

Each training couple is made up of (wv1, wv2, …, wnN) -> (p1, p2, …, pN), where the probabilities are non-zero only for the representative vectors (an alternative approach is based on sigmoids, but the training speed was slower and the final accuracy worse). E.g. “The restaurant is nice but the food is quite bad” -> Word vectors -> Targets: “restaurant” and “food” (so the softmax output would be 0.0, 0.0, 0.5, …, 0.5, 0.0, 0.0). As we want to perform a “local” sentiment analysis, each “peak” in the softmax is surrounded by a set of additional words. Hence, in this case, for example, we want a peak for “restaurant” and a smaller value for “nice” (e.g. 0.3, 0.2).

Once this submodel has been trained, we froze it and trained the convolutional network. I hope this very brief explanation could be helpful.

]]>Your work looks so interesting.

I’d love to get a bit more insight about how the “word vector selector” works exactly thought.

Thanks

Antoine