EE Seminar: Geometric analysis of the self-attention mechanism at initialization
הרישום לסמינר יבוצע בתחילת הסמינר באמצעות סריקת הברקוד למודל (יש להיכנס לפני כן למודל, לא באמצעות האפליקציה)
Registration to the seminar is done at the beginning of the seminar by scanning the barcode for the Moodle (Please enter ahead to the Moodle, NOT by application)
Electrical Engineering Systems Seminar
Speaker: Roee Hop
M.Sc. student under the supervision of Dr. Anatoly Khina & Dr. Ido Nachum
Monday, 5th January 2026, at 15:00
Room 011, Kitot Building, Faculty of Engineering
Geometric analysis of the self-attention mechanism at initialization
Abstract
How does the application of neural layers affect the underlying geometry of the input dataset? For a long time, researchers sought to understand the dynamics of neural networks by analyzing the angles between data points. For randomly initialized fully connected linear layers, the Johnson - Lindenstrauss lemma ensures that angles are preserved, while adding ReLU activation causes angle contraction according to a known function. Convolutional neural networks exhibit a richer behavior, with synthetic data acting the same as in the fully connected case, but real data almost completely preserving the angles even with ReLU activations. Both architectures exhibit some form of angle contraction, which we intuitively expect to be detrimental for training. In this work, we consider the case of a single, randomly initialized multi-head self-attention layer, and explore its effect on angles between datapoints. For a simplified case where the softmax function is omitted, we show that the output angles tend to grow instead of contract, thus driving the output sentence embeddings towards orthogonality. We will then derive a relationship between input and output angles in expectation and show that it becomes increasingly accurate as the number of heads increases. For the complete layer, the behavior becomes highly scale dependent, and we will demonstrate that when the inputs and layer weights are carefully normalized, the input - output angle relationship can be approximated by the no softmax case with an additional rank reducing term.
-סמינר זה ייחשב כסמינר שמיעה לתלמידי תואר שני ושלישי-
This Seminar Is Considered A Hearing Seminar For Msc/Phd Students-

