Learn more about me and the cool products I have contributed to!
Learn more about me and the cool products I have contributed to!
Installing GreenPlum & Python3.7 on Ubuntu Server
在Attention出现前,RNN、LSTM这种Encoder-Decoder模型是序列处理的主流模型。这种架构把任务分为两个部分,第一是Encoder:将输入通过某种方式编码为一个固定长度的向量,称为Context Vector;第二步是Decoder:将Context Vector解码,逐字生成目标输出序列。RNN和LSTM主要是将每一轮的输出作为下一轮的输入,循环地生成输出直到结尾。但是这样首先是一个串行的步骤,性能不佳,而且如果输入太长,由于Context Vector的长度有限,难以捕捉全部的输入信息,
2014年,为了解决context vector的这个问题,蒙特利尔大学的Dzmitry Bahdanau首次在NLP领域应用了Attention机制。这种初级的Attention机制被称为Bahadnau Attention。Bahdanau Attention认为Context Vector不应该是静态的。原本的RNN使用隐藏状态来保存记忆。Encoder计算隐藏状态的方法是:
$$ h_t=\tanh (W_{encode}\cdot h_{t-1} + U_{encode}\cdot x_t + b) $$
可以简写为:
$$h_t = \text{RNN}(h_{t-1}, x_t)$$
其中$ h_t$就是t时刻的隐藏状态,$x_t$就是t时刻的输入。通过分别和W、U相乘,决定了我们保留多少过去的记忆和新的信息。同样地,Decoder也有隐藏状态:
$$s_t = \tanh(W_{dec} \cdot s_{t-1} + U_{dec} \cdot [y_{t-1}, c_t] + b_{dec})$$
可以简写为
$$s_t = \text{RNN}(s_{t-1}, y_{t-1},c_t)$$
这里的$c_t$就是Context Vector。在普通RNN中,这个就是最后一轮Encoder的$h_t$。但是如果序列层过长,由于隐藏层记忆了太多轮,会导致记忆的丢失。
Bahdanau Attention解决这个问题的方法是:首先,不再只存一个$h$,而是对于整个Encode过程中的所有$h$,全部保留下来,存在一个矩阵里。然后在Decoder前面增加了一个小型的神经网络,输出每个token时,都计算一下当前已经输出的序列和每个h之间的分数:
$$score(s_{t-1}, h_j) = v^T \tanh(W s_{t-1} + U h_j)$$
这里的$s_t$是Decoder的隐藏状态。
使用Softmax把这些分数做一个归一化:
$$a_{tj} = \frac{\exp(score_i)}{\sum \exp(score_k)}$$
就计算出对每个h的注意力权重。然后根据这个注意力权重来算出Context Vector:
$$ c_t=\sum_{j}a_{tj}h_j $$
也就是对所有$h$进行一个加权求和。使用这个context vector给Decoder,Decoder再生成输出并计算$s_t$
由于这里的score分数是采用了s和h相加而成,因此Bahdanau Attention也被称为加性注意力(Additive Attention)。
2015年,斯坦福大学的Minh-Thang Luong对Attention机制进行了改良,与其还要单独用一个神经网络计算注意力,不如直接用点积来计算相似度:
$$score(s_{t-1},h_j)=s_{t-1}^T\cdot h_j$$
这样直接做矩阵乘法,充分利用了GPU的性能。
2017年,Google发布了著名论文Attention is All You Need,直接把传统RNN结构取代,改为只用Attention。之前的结构里,RNN在Encode时每次计算一个h,然后Decode时每次和h算一个注意力分数。
Attention is All You Need中指出,我们不需要再做这种麻烦的注意力计算方式,而是采用“Self Attention“,对于输入序列中的每个词,我们直接计算其与序列中其他词的
title: How to build a PDF Autofiller Agent?
tags: [agent, pdf]
categories: [agent]
date: [2026-01-15 17:15:00]
index_img: /img/agent.png
cover: /img/agent.png
thumbnail: /img/agent.png
excerpt: Notes
Design a Copilot Chatbox to provide such functionality: user uploads a PDF file with fields to fill in, and give Chatbox certain commands to fill out some fields. AI will use this command and identify the fields and values to fill, then fill those fields with the values that user specifies and return the form to users.
| Tool | Printed Select | Printed Edit | Scanned Select | Scanned Edit | Comments |
|---|---|---|---|---|---|
| Adobe Acrobat PDF | ✅ | ❓ | ✅ | ❓ | Needs Pro subscription to edit |
| ABBYY Finereader PDF | Can’t install on Mac | ||||
| PDFfiller | ✅ | ✅ | ❌ | ❌ | |
| LuminPDF | ✅ | ❓ | ✅ | ❓ | Need s Pro subscription to edit |
1 | PDF form Template |
Input: PDF raw data
1 | 12 0 obj |
Since inputs and labels are not connected data structure-wise, i.e., they are not linked in the source code, unlike HTML where labels and inputs might be linked by id. The only way to identify related labels and inputs is to compare the coordinates.
pdf.js: parse raw PDF in browser.
Output: return a map between each object (input or label) and its coordinates.
1 | { |
Given the coordinates of each object, find the matching ones. Especially, for all the input fields, find the matching label. Return the relationship as a JSON.
Using for-loops to calculate the Euclidean Distance between each coordinate pair can work.
use Vercel AI SDK to orchestrate the “Reasoning-Action” loop. The LLM does not modify the file directly; it acts as a router to decide which client-side tool to call.
ai (Vercel AI SDK).useChat hook to intercept the LLM’s tool call. When the LLM requests fill_fields, the browser executes the JavaScript logic to update the PDF.pdf-lib (Client-side JavaScript).Uint8Array in memory.form.getTextField(id).setText(value)form.getCheckBox(id).check()form.updateFieldAppearances() to ensure text is rendered visibly (generating the /AP stream).Input: PDF raw data (Bytes) Similar to the JS version, inputs (Widgets) and visual labels (Text) are disconnected in the PDF structure. We need to extract them separately.
Tool: PyMuPDF (import fitz)
Output: A map between each object and its coordinates.
Python
1 | # Extracted using page.widgets() and page.get_text("words") |
Logic: Spatial Matching (Euclidean Distance). Given the coordinates of widgets and text blocks, find the matching pair.
field_id to label_text (e.g., {"id": "t1", "label": "Date of Birth"}).Tool: LangChain + Pydantic Use Pydantic to define the strict schema for the LLM output (Structured Output), replacing the need for raw prompt parsing.
Workflow:
Python
1 | class FieldUpdate(BaseModel): |
Tool: pypdf
Action:
Load PDF bytes using PdfReader.
Map the LLM’s Pydantic output to a dictionary: { "field_id": "value" }.
Execute filling:
Python
1 | writer.update_page_form_field_values( |
Return the BytesIO stream to the user.
From C# and Python -- A Deep Dive into Language Execution and Typing Models