Evolution of Deep Learning Architectures

A draft note about deep learning. Mostly start with what, why and how!

Architecture

What is Dense?
Fully feed forward
Neural layer that connect to all the previous neural layer
We typical see that in layer construction

`1`	`Dense(units=64, activation='relu')`

images/Pasted image 20250715210449.png

CNN

What hidden layers in CNN?

Convolutional layers
Pooling layers
Fully-connected layer
Normalization layer

What is the purpose of Convolutional layers?
Extract the feature from kernel (typical the dot product)
images/Pasted image 20250715222814.png

What is the purpose of Pooling layers?
Reduce the number of neurons
Use max or min for extract the value in the grid
images/Pasted image 20250715223350.png

Example flow how apply those CNN hidden layers?
images/Pasted image 20250715223608.png
And flatten
images/Pasted image 20250716154045.png

RNN overview

What is RNN?
Also feed-forward neural network but fixed size of input and output
Use for remember the past and current information to predict
images/Pasted image 20250715214725.png

How RNN keep track relationship between each feature in the sequence?
Sharing parameter give the network ability to look for a given feature everywhere in the sequence:

Deal the variable length
Maintain sequence order
Keep track long term dependencies
Share parameter across sequence

How RNN actually store the sharing parameter?
Use a feedback loop in the hidden layer (short term memory)
images/Pasted image 20250715215326.png

How it learn in RNN (the backpropagation)?
Backprop applier for every sequence data point
Backprop through time (BTT)

Problem of RNN?
Short term memory in RNN is due to vanish gradient decent
When we have a long sequence, we forgot a lots
Ex: we have no idea the word “it” and “was” to use due to vanish gradient => it hard to predict with just “on Tuesday”
images/Pasted image 20250715221858.png

Why it occur vanish gradient decent in long sequence when use RNN?
Backpropagation: when use calculate gradient base on previous layer
If init gradient is small => smaller until vanish

RNN

Intuitive approach

How to handle with variable input size?
images/Pasted image 20250721073025.png
__
We can padding input

But it not work well in the length sequence

How the linear function in this RNN sequence?
images/Pasted image 20250721073301.png
__
We could use with the input is the previous activation value ($a^{l-1}$) and the input value in sequence $x_{i,t}$
and weight and bias for each layer individual (???). and yes still go through a activation function.

How we handle missing layer because the input size is not the same?
Padding the amount of first layer as activation value is 0
images/Pasted image 20250721073911.png

What happened when we use the weight and bias for each layer individual ($W^l$, $b^l$)?
We should share the weight matrix and bias for the whole sequence to avoid too many weight matrix
We can have many layer as we want in RNN because not require more param
images/Pasted image 20250721074127.png

Problem with RNN?

Need to process sequence make it impossible not parallel
Multiply many many number again and again make vanish gradient or exploding gradient

What problem with vanilla RNN when predict the whole sentence without knowledge about previous result timestep?
images/Pasted image 20250721202239.png
__
If not base on previous input the result sentence output can be pointless because no connection in sentence
The future output must base on the past output for predict next input

What problem that distributional shift describe?
How the heck we know predict right sentence when the first token is “I”

Schedule sampling

Schedule Sampling Analogy: Learning to Play a Musical Piece
Imagine you are a music student (the model) learning to play a complex piece on the piano.

1. Traditional Training (Teacher Forcing):

Your music teacher (ground truth) is sitting right next to you.
Every time you play a note, even if you make a mistake, the teacher immediately tells you the correct next note to play. They guide your fingers precisely.
The problem: You’re not really learning to recover from your mistakes. If the teacher suddenly vanished, and you hit a wrong note, you’d likely get completely lost because you’ve never practiced figuring out the correct sequence after an error. This is exposure bias. You’re exposed only to perfect sequences.
2. Schedule Sampling (The Smart Way to Learn):
Now, your teacher uses a new method:
Early in your practice (high probability of teacher guidance): The teacher still helps you a lot. Maybe 90% of the time, they tell you the correct next note even if you messed up. This helps you learn the basic melody.
Gradually, as you get better (decreasing probability of teacher guidance): The teacher starts to withdraw their help.
- Sometimes (say, 20% of the time), they’ll let you play your own wrong note and then make you figure out the next note based on your own incorrect position.
- They might still correct you on subsequent notes, but you’re getting practice navigating from a “bad state.”
- The percentage of times they force you to play your own note increases.
Late in your practice (low probability of teacher guidance): You’re mostly playing the piece on your own. If you hit a wrong note, you have to try to recover and play the correct next note from that mistaken position.
The “Why” it’s Needed (and why Schedule Sampling is good):
By gradually forcing you to play from your own (potentially wrong) previous notes, you learn:

Resilience: You get better at recovering when you make a mistake.
Robustness: Your performance doesn’t fall apart completely just because of one small error.
Real-world Readiness: When you perform the piece by yourself (inference), you’re prepared to handle any missteps, because you’ve practiced playing without constant perfect guidance.
In summary: Schedule sampling is like a smart teacher who gradually reduces their direct help, forcing the student to learn from and adapt to their own errors, making them a much more robust and independent performer.

images/Pasted image 20250721212342.png

images/Pasted image 20250721215700.png

Seq2Seq

Use this architecture
images/Pasted image 20250722095741.png

Can we do something further than just machine translation?
Generate text
Previous output will pass as input
But we need to process the encoder before generate text
images/Pasted image 20250722100012.png

But how we represent the word?
There are many way. Simplest we can use one hot encoder
But more complex we use word embedding

How to know the generate word is end of sentence? each word need previous word but what about first word, what will be it’s input?
If we generate new whole sentence, the start token will probably random or some thing we pre define
We use token for start and end for separate
images/Pasted image 20250722102500.png

How to make language model learn predict the sentence (not same with how we run actual model)?
We feed the token on at the time => start predict but don’t sample from this prediction but direct feed the next word

How to predict the blank world in the sentence?

How to make image caption model?
CNN => Vector init state of RNN => Decoder to sequence
The result of CNN will we init vector
images/Pasted image 20250722110417.png

How to make the translate machine?
Another LSTM layer the produce the context information

How the trademark the start and end of sentence in both encoder and decoder of seq2seq?
The second <START> will be the start of first sentence
images/Pasted image 20250722105900.png
images/Pasted image 20250722111120.png

What we can use Seq2Seq for?
images/Pasted image 20250722111524.png

Beam search for seq2seq

Why we need beam search for machine translate?
For example if follow the greedy search we will overall get a point less sentence
images/Pasted image 20250722133748.png
We want most probability for all

How to evaluate how goodness of a sequence we generate?
Each token will depend on the whole previous and the input sequence
images/Pasted image 20250722134325.png

images/Pasted image 20250722134544.png

How large the possible different sequence can be?
Tree search => M word with length T will have $M^T$ (we can use graph search but not optimal)
images/Pasted image 20250722134956.png

But how to perform beam search sequence?
Search for k best word

Trick to better compute the goodness of sequence we generate?
Use log to compute sum instead of multiplication
=> Compute the log probability then choose the k from $k^2$
images/Pasted image 20250722142946.png

When we stop decoding?
Let’s say one of the highest-scoring hypotheses ends in <END>
Save it, along with its score, but do not pick it to expand further (there is nothing to expand)
Keep expanding the k remaining best hypotheses
images/Pasted image 20250722143554.png
Continue until either some cutoff length T or
until we have N hypotheses that end in <EOS>

How to pick the sequence?
The longer the sequence the lower its total score (more negative numbers added together)
images/Pasted image 20250722143924.png
=> We can compute the average

Overall the process to perform beam search for generate the sequence?
images/Pasted image 20250722144211.png

Attention

What is the bottleneck problem of seq2seq?
All information decoder know is only the hidden state that encoder put to decoder
But it not work well in long sequence
Seq2seq like try to read the long sentence once and try to remember each word. What we most remember is the latest word but we need somehow to glimpse back to previous relate information to understand current sentence (attention)
images/Pasted image 20250722145452.png

But what we want to perform better?
Idea: what if we could somehow “peek” at the source sentence while decoding
Can we look back for some important information
images/Pasted image 20250722145520.png

Let go for some terminology of attention

What is key?
The piece activation value in encoder (output of specific timestamp)
images/Pasted image 20250722150532.png

What is query?
Query represent what want want to looking for in encoder
=> We want to find what key that close to query
images/Pasted image 20250722150724.png
The point is keys and queries is learn by the neural system, we don’t need to manually specify it!

Why need have $h_t$ but not only the $x_t$?
images/Pasted image 20250727161209.png
__
Share learn weight across the sequence like RNN

Represent the h for encoder and s for decoder and k for the key (is the result function in specific timestamp)
images/Pasted image 20250722151826.png

What is the relation of $k_t$ and $h_t$?
Let thing we apply the function k on specific timestamp
The k function could be something like linear function in $h_t$
images/Pasted image 20250722155230.png

What is relation of $q_l$ and $s_l$?
Perform linear or non-linear q function on the specific $s_l$

How to evaluate the attention of query and value?
Perform dot product between $q_l$ and $k_t$
$e_{t,l}=q_l \cdot k_t$
The perform softmax on attention score
$\alpha_{t,l}=softmax(e_{t,l})$
Then with each timestamp in the encoder we half specific information $h_t$ correspond with the value $\alpha_{t,l}$
What we send to the current decoder is the sum with correlate value (we send the context vector)
images/Pasted image 20250722161757.png

How the decoder produce output base on result of attention?
The current activation value and the attention to produce the output $\hat{y}$ (not result of activation function)
images/Pasted image 20250722162123.png

What will be pass to the next decoder step?
images/Pasted image 20250722162632.png

Quick overview the whole process
images/Pasted image 20250722163158.png

List all the way to use the context vector $a_l$?
Concatenate to the hidden state: it mean we will pass it into next timestamp
Use for readout: eg $\hat{y_l}=f(s_l, a_l)$ use to product output in a timestamp (if we have sequence of result)
Concatenate as input to next RNN layer: use to be input for next RNN timestamp

Attention variant for key and value choice?

Simple => use direct the hidden state in encoder and decoder as key and value
Linear multiplication attention: combine 2 linear weight will be a linear weight
(The W always learnable but keep in mind this is linear)
Attention scale => perform some function v to change the scale of attention score

Why attention is good?
Decoder step can connect to all encoder step

Transformer

Can we use attention to replace the recurrent connection?
Yes, but attention allow us to access the previous input but how can we access to the previous output => self attention. Eg at s2 we want to access attention not only from encoder (h1-h3) but previous output s0 and s1
images/Pasted image 20250727155138.png

Feature use for the sequence model can be the image

Positional vector

What is purpose of positional vector?
Specify the position in when attention not use sequence for order element
images/Pasted image 20250727145537.png

What is the dimension of positional vector and its value?
Same with the dimension of embedding vector
We use sin and cos for odd and even position
images/Pasted image 20250727150040.png

What is t in positional vector?
The time step in sequence

What is the range value of this?
images/Pasted image 20250727150445.png
__
Increase according the value in the vector position and up to dimension

Why positional vector not use absolute position but a frequency represent (sin and cosin)?
Better to implicit the relative position
images/Pasted image 20250727162234.png

How to combine the embed vector with positional vector?
We often plus 2 matrix together

images/Pasted image 20250727145537.png
images/Pasted image 20250727162354.png

Self attention

Why we call it self attention?
Each word will compute the attention with every other word include itself to get the attention information
images/Pasted image 20250727151446.png

images/Pasted image 20250727151604.png

Attention? What it mean when 2 vector is relate each other?
Represent by the dot product of 2 vector
images/Pasted image 20250727151827.png

How to compute the attention over a specific time step?
images/Pasted image 20250727153516.png

__
Key and query extract from input time step
For example we compute the attention for time step 2
We compute the relate of time step 2 over every other time step to get the softmax and multiply with the correlate value.
With d is the dimension of key vector. To avoid exploding value of attention in high dimension space eg 512 we need to divide by the dimension square

What the softmax multiply the value vector mean?
$v_t$ extract the value from input, with the softmax will represent the attention (in percent) of this information
The sum will represent mostly the information from argmax which mostly information we need to focus on
images/Pasted image 20250727162716.png

What is the dimension of keys, queries and and attention?
images/Pasted image 20250727153339.png

Why need masked attention?
Prevent attention lookup for future prediction which cause recurrent dependencies and attention information not able to converge

How to add more hidden layer like feed forward network but for self attention?
Pass attention information to next layer again and again
But in transformer we need to add nonlinear function before pass to next layer
images/Pasted image 20250727161513.png

With self attention we allow to query for a specific information, how to query to many information and learn the relation of them?
images/Pasted image 20250727162749.png
__
We use multi head attention to query multiple position each layer
Eg: attention to query the subject and another for the verb

The output of self attention is linear or non linear?
Except the result of softmax but the sum over will be linear
The will be problem when we have many layer multi-headed attention connect to each other
images/Pasted image 20250727163305.png

Multi-headed attention

What is multi-headed attention?
We have more than one self attention layer to learn the information we extract
We combine them together (concatenation) to have full attention vector
images/Pasted image 20250727163104.png

How to make result of each multi-headed attention (self-attention) to be non-linear?
Apply non-linear function for output result
images/Pasted image 20250727163547.png

Masked encoding

How mask attention avoid lookup for future output to attention?
Assign the 0 value when in softmax computation
images/Pasted image 20250727164145.png

Classic transformer

Overall architect of transformer?
Use encoder compare input meaning
Decoder predict from input and previous output so far

Pasted image 20250728203754.png

How encoder and decoder interact each other?
TODO:

Architecture¶

CNN¶

RNN overview¶

RNN¶

Intuitive approach¶

Schedule sampling¶

Seq2Seq¶

Beam search for seq2seq¶

Attention¶

Transformer¶

Positional vector¶

Self attention¶

Multi-headed attention¶

Masked encoding¶

Classic transformer¶