EE Seminar: Geometric analysis of the self-attention mechanism at initialization

05 בינואר 2026, 15:00 
אולם 011, בניין כיתות חשמל 
EE Seminar: Geometric analysis of the self-attention mechanism at initialization

הרישום לסמינר יבוצע בתחילת הסמינר באמצעות סריקת הברקוד למודל (יש להיכנס לפני כן למודל,  לא באמצעות האפליקציה)

Registration to the seminar is done at the beginning of the seminar by scanning the barcode for the Moodle (Please enter ahead to the Moodle, NOT by application)

 

Electrical Engineering Systems Seminar

 

Speaker: Roee Hop

M.Sc. student under the supervision of Dr. Anatoly Khina & Dr. Ido Nachum

 

Monday, 5th January 2026, at 15:00

Room 011, Kitot Building, Faculty of Engineering

 

Geometric analysis of the self-attention mechanism at initialization

Abstract

How does the application of neural layers affect the underlying geometry of the input dataset? For a long time, researchers sought to understand the dynamics of neural networks by analyzing the angles between data points. For randomly initialized fully connected linear layers, the Johnson - Lindenstrauss lemma ensures that angles are preserved, while adding ReLU activation causes angle contraction according to a known function. Convolutional neural networks exhibit a richer behavior, with synthetic data acting the same as in the fully connected case, but real data almost completely preserving the angles even with ReLU activations. Both architectures exhibit some form of angle contraction, which we intuitively expect to be detrimental for training. In this work, we consider the case of a single, randomly initialized multi-head self-attention layer, and explore its effect on angles between datapoints. For a simplified case where the softmax function is omitted, we show that the output angles tend to grow instead of contract, thus driving the output sentence embeddings towards orthogonality. We will then derive a relationship between input and output angles in expectation and show that it becomes increasingly accurate as the number of heads increases. For the complete layer, the behavior becomes highly scale dependent, and we will demonstrate that when the inputs and layer weights are carefully normalized, the input - output angle relationship can be approximated by the no softmax case with an additional rank reducing term.

  -סמינר זה ייחשב כסמינר שמיעה לתלמידי תואר שני ושלישי-

This Seminar Is Considered A Hearing Seminar For Msc/Phd Students-

 

אוניברסיטת תל אביב עושה כל מאמץ לכבד זכויות יוצרים. אם בבעלותך זכויות יוצרים בתכנים שנמצאים פה ו/או השימוש שנעשה בתכנים אלה לדעתך מפר זכויות
שנעשה בתכנים אלה לדעתך מפר זכויות נא לפנות בהקדם לכתובת שכאן >>