
Explaining Seq2Seq Encoding-Decoding Processes
The Sequence-to-Sequence (Seq2Seq) model is a deep learning architecture widely used in tasks like machine translation, text summarization, and chatbot responses. Fundamentally, the model consists of two core components:
Encoder: Processes the input sequence into a fixed-size context representation (also called a thought vector or context vector).
Decoder: Uses the encoded representation to generate an output sequence, step by step.
Sequence to Sequence Learning – LSTM
Both the Decoder and the Encoder are structured as Recurrent Neural Networks (RNN). The model was proposed in a 2014 paper titled Sequence to Sequence Learning with Neural Network by researchers at Google. At the time, the group was leveraging their vast amounts of text data and figured this neural architecture was the best way to perform Machine Translation.
In the paper, the terminology makes references to the architecture of Long Short-Term Memory (LSTM), but fundamentally, this is still the same encoder-decoder approach. The contrastive context involves diverging from earlier Deep Neural Networks (DNNs) that required a fixed depth and a fixed length of string over which to perform inference. By contrast, LSTMs can handle variable-length sequences because they learn how to “store” or “forget” information over arbitrary spans of time.
For my linguists, let’s stop here: Imagine you have a messenger (the first LSTM) whose job is to carefully listen to an entire spoken sentence, summarizing it into a compact form—like taking notes and condensing them into a single message. Then, a second messenger (the second LSTM) reads that condensed message and expands it back into a new sentence, possibly in a different language or format.
We’re no longer constrained by a static input or output shape, as we would be with a purely feed-forward DNN; instead, the LSTM’s recurrent structure handles any number of time steps (words, tokens), making it a better fit for tasks like translation or text summarization where sequence lengths can vary significantly.
This two-step process allows the system to handle variable-length sentences, because it no longer needs a fixed-size input or output. Instead, the first LSTM just keeps “listening” until it finishes the sentence, and then hands off its final “summary” (sometimes referred to as the “context vector”) to the second LSTM, which starts “speaking” one word at a time until the full output sequence is produced. This approach is often referred to as Sequence-to-Sequence because it elegantly takes us from an input sequence (like an English sentence) to an output sequence (like its French or Spanish translation).
A Brief Look Under the Hood
Underneath this “messenger” metaphor are the LSTM equations, which ensure the model can propagate and selectively retain information over many time steps:
$$ \begin{aligned} i_t &= \sigma(W_{ix} x_t + W_{ih} h_{t-1} + b_i), \\ f_t &= \sigma(W_{fx} x_t + W_{fh} h_{t-1} + b_f), \\ \tilde{C}_t &= \tanh(W_{cx} x_t + W_{ch} h_{t-1} + b_c), \\ C_t &= f_t \odot C_{t-1} + i_t \odot \tilde{C}_t, \\ o_t &= \sigma(W_{ox} x_t + W_{oh} h_{t-1} + b_o), \\ h_t &= o_t \odot \tanh(C_t), \end{aligned} $$
where \( h_t \) is the hidden state, \( C_t \) is the cell state, and \( i_t, f_t, o_t \) are the input, forget, and output gates, respectively. Each gate is controlled by learned weight matrices \( W \) and biases \( b \), and \( \sigma \) denotes the sigmoid function.
Encoder: One LSTM “reads” (encodes) the entire input sequence \( \{ x_1, x_2, \dots, x_T \} \), producing a final hidden (and cell) state that distills the content of the entire sentence into a single vector.
Decoder: Another LSTM “writes” (decodes) the output sequence \( \{ y_1, y_2, \dots, y_{T’} \} \), conditioned on that final hidden state from the encoder.
This design allows a model to transduce an input sequence into a potentially different-length output sequence. By freeing the model from fixed input-output lengths, LSTMs elegantly solve many real-world problems that standard DNNs with fixed-size inputs could not handle.
As implied, text content is first encoded and then later decoded for whatever purpose. In a typical sequence-to-sequence setup, one RNN encodes the input sequence (e.g., a sentence) into a compact representation, while another RNN decodes that representation into an output sequence.
Concretely, for machine translation, consider the input sequence [ ”How”,”are”,”you” ][“How”,”are”,”you”] and the target sequence [ ”Como”,”te”,”va” ][“Como”,”te”,”va”].
- Encoder: Processes each word in [How, are, you][How, are, you] one by one. So, it requires 3 steps to encode the entire sentence.
- Decoder: Generates each word in [Como, te, va][Como, te, va] one by one. Thus, it takes 3 steps to decode the sentence. (In practice, an extra start symbol, like ⟨start⟩⟨start⟩, is also often used, so the decode steps might look like [⟨start⟩, como, te, va][⟨start⟩, como, te, va].)
Crucially, we no longer need to force our model into a single, rigid input or output shape—something we’d typically have to do with a standard feed-forward network. Thanks to the RNN’s recurrent design, we can naturally accommodate inputs and outputs of varying lengths. This makes RNNs (and by extension LSTMs and GRUs) far more suitable for tasks like machine translation or summarization, where sequence lengths can differ significantly from one example to the next.
No Isomorphism
There is something interesting about one of the points in that 2014 paper first introducing Seq2Seq.
French word order often differs from English in ways that can make the alignment between the two languages trickier for a left-to-right model. While both are mostly SVO (Subject–Verb–Object) languages, French often positions pronouns, adjectives, and certain modifiers in ways that can lead to different word alignments relative to English. By reversing the word order in the English source sentence, the model effectively processes the text such that the “last” English words—which might align more closely with early French words—are now handled earlier by the decoder. This can shorten the effective distance the model has to remember before predicting the corresponding French output tokens.
In other words, reversing the input can reduce long-range dependencies that the LSTM must learn, because tokens that need to align more closely are brought into closer proximity during training. This trick just happens to work very well for English→French (and can help in other translation pairs, depending on how much word order differs). It doesn’t necessarily mean English or French must have entirely reversed syntax in general—rather, the reversal helps the LSTM see more directly matched pairs without having to “reach” too far back in the input sequence.
However, it is important to note that this simplification (equal encoding and decoding steps) does not generally hold for real-world translations. Natural language syntax and semantics frequently diverge, resulting in varying numbers of source and target tokens. Though two sentences might convey the same meaning, there is no strict requirement that the sequences be isomorphic in structure or length.
This is somewhat of an oversimplification. For those engaging these relatively new methods and architectures, there is one simple fact to consider based on natural language. Translations do not necessarily involve an equal number of encoding and decoding steps. Structurally speaking, natural language syntax and word output do not pattern in an isomorphic manner. At a semantic level, we can intuit sentences mean the ‘same thing’ but sequentially as evidenced by sequences of words, they do not pattern alike.
Attention Mechanism
Two sequences going in a single direction sounds simple enough. However, the attention mechanism pioneered by the corporate research arm of Google brought to the fore an ‘attention mechanism’ that involved specific sequence states looking beyond their immediately preceding state. In other words, engineers recognized that information from word sequences not immediately preceding a word was valuable and accounted for this by allowing for a connection between these latent variables across sequences and within sequences (‘self-attention’).
More specifically, the weighted sum of word relevance is used to score the relevance of a piece of information given a word query. We compare the vector similarity between encoder values and decoder values by accessing the hidden states that are encoded and decoded.
While the Seq2Seq architecture is still relevant today, the more powerful generation of language models makes use of the Transformer architecture which refines and expands on key aspects of Seq2Seq, but extends crucial concepts in memory. There is a notion that efficiency is enhance, but the proofs around these are difficult for me to personally evaluate.
Anecdotally, training over several hundred billion parameters a model with the transformer architecture is still extremely expensive. So, yeah, there’s that!
The idea is that attention mechanisms infer what sequence to predict from encoding and context vectors. For now, the Seq2Seq RNN discussion helps us understand how sequential data can be transduced across domains for a task. In this case, a language translation task that pairs input and output sequences.
Toy Example
Returning to the concept of sequence to sequence encoding and decoding, we have a toy example below that will transform English into Spanish. We invoke just the numpy library, but under other conditions we would likely invoke something like PyTorch and large text samples with sufficiently rich encode decoder architecture, like BERT.
At any rate, the below Seq2Seq is showing how teach individual work token is represented as an array or embedding. The sum of vectors is taken and returned whereby similar scores pattern similar words based on their matrix representations. Programmatically, we split into two phases: “encode” the source sentence, then “decode” the target sentence step by step, feeding the previously generated token into the next step.
import numpy as np
# ------------------------------------------------------------------------------
# A tiny dictionary for token embeddings (completely made up)
# ------------------------------------------------------------------------------
token_embeddings = {
"How": np.array([1.0, 0.0, 0.0, 0.0]),
"are": np.array([0.0, 1.0, 0.0, 0.0]),
"you": np.array([0.0, 0.0, 1.0, 0.0]),
"como": np.array([0.9, 0.9, 0.1, 0.1]),
"estas": np.array([0.1, 0.8, 0.8, 0.1]),
"tu": np.array([0.2, 0.2, 0.2, 0.9]),
}
def rnn_step(hidden_state, token_vector):
"""
A toy 'RNN step' that just sums and applies a tanh, simulating a state update.
"""
return np.tanh(hidden_state + token_vector)
def encode(input_sequence):
"""
Encodes the input sequence using a simple RNN and returns the final hidden state
(the 'context vector').
"""
hidden_state = np.zeros(4) # initial hidden state (4D for this toy example)
print("=== ENCODING ===")
for i, token in enumerate(input_sequence):
hidden_state = rnn_step(hidden_state, token_embeddings[token])
print(f"Encoder step {i+1}, token='{token}': hidden_state={hidden_state}")
return hidden_state
def decode(context_vector, output_length=3):
"""
Decodes the final hidden state into an output sequence in 3 steps.
In a real Seq2Seq, the decoder would feed each generated token back in, etc.
"""
hidden_state = context_vector
output_sequence = []
candidate_outputs = ["como", "estas", "tu"]
print("\n=== DECODING ===")
for i in range(output_length):
next_token = candidate_outputs[i]
hidden_state = rnn_step(hidden_state, token_embeddings[next_token])
output_sequence.append(next_token)
print(f"Decoder step {i+1}, generated='{next_token}': hidden_state={hidden_state}")
return output_sequence
if __name__ == "__main__":
# Input: ["How", "are", "you"] - 3 strings long.
input_seq = ["How", "are", "you"]
# ENCODER: 3 steps
context = encode(input_seq)
# DECODER: 3 steps => ["como", "estas", "tu"]
translation = decode(context, output_length=3)
print(f"\nFinal translation: {translation}")
As shown on Hidden State = tanh
( Old Hidden State + Current Token Embedding ) is just adding the relevant encodings given the prior token. This is done one by one.
Your output will look like this below:
=== ENCODING === Encoder step 1, token='How': hidden_state=[0.76159416 0. 0. 0. ] Encoder step 2, token='are': hidden_state=[0.64201499 0.76159416 0. 0. ] Encoder step 3, token='you': hidden_state=[0.56626998 0.64201499 0.76159416 0. ] === DECODING === Decoder step 1, generated='como': hidden_state=[0.89886352 0.91245834 0.69707811 0.09966799] Decoder step 2, generated='estas': hidden_state=[0.76111645 0.93694847 0.90461885 0.19705623] Decoder step 3, generated='tu': hidden_state=[0.74477445 0.813384 0.802152 0.79943912] Final translation: ['como', 'estas', 'tu']