LLM-Sim: Peeking Inside a Language Model, One Token at a Time

There are plenty of articles explaining what a Large Language Model does. Far fewer show you exactly how it does it — step by step, probability by probability. LLM-Sim is a fully observable simulation of the LLM pipeline, built to make every internal decision visible and explorable in the browser.

What Is LLM-Sim?

LLM-Sim is an educational Python project that simulates the structural mechanics of a Large Language Model. The goal is clarity and observability, not realism: every prompt construction step, tokenisation decision, reasoning hop, tool call, and token-by-token probability table is captured in a structured JSON trace and rendered in a purpose-built browser UI.

You type a question. The six-stage pipeline runs. You get back:

the final answer, displayed as animated token pills colour-coded by confidence (🟢 ≥ 80% · 🟡 50–80% · 🔴 < 50%)
a link to a full execution trace viewer where you can drill into every candidate token and its probability at every generation step

The entire simulation core — prompt builder, tokeniser, agent, tools, and LLM core — is pure Python stdlib. No PyTorch, no Transformers, no external ML dependency of any kind.

The Six-Stage Pipeline

The LLMPipeline class in src/pipeline.py wires six single-responsibility components together. Here is how data flows through them.

Stage 1 — Prompt Construction

PromptBuilder wraps the raw user input in a labelled template identical in structure to what real instruction-tuned models receive:

[SYSTEM]
You are a helpful AI assistant. You reason step-by-step and use tools when needed.

[USER]
What is 42 * 7 + 15?

[ASSISTANT]

The result is a frozen PromptComponents dataclass containing system_prompt, user_input, and full_prompt. The trace records all three fields so you can see exactly what the model "reads".

Stage 2 — Tokenisation

SimpleTokenizer splits text on the regex \w+|[^\w\s] (word characters and punctuation as separate tokens). Four special tokens occupy the first IDs: [PAD]=0, [UNK]=1, [BOS]=2, [EOS]=3. Real tokens are assigned IDs starting at 4, and the vocabulary grows dynamically — every unseen token is registered on demand.

encode() and decode() are exact inverses. Up to 60 token/ID pairs are written to the trace so you can verify the round-trip.

Stage 3 — Agent Reasoning

ReasoningAgent.reason() applies two sequential heuristics before any generation happens:

Math detection — looks for trigger words ("calculate", "what is", "compute", …) or a bare numeric expression matching \d+\s*[\+\-\*\/]\s*\d+. If matched, the arithmetic sub-expression is extracted and routed to CalculatorTool.
Factual detection — looks for question prefixes ("what is", "explain", "tell me about", …) and extracts the subject noun phrase, which is then passed to FakeSearchTool.

Every decision is appended to reasoning_steps and recorded in the trace as agent_reasoning. The resulting ReasoningResult dataclass carries tool_used, tool_input, tool_result, and the enriched llm_prompt that includes the tool output.

Stage 4 — Tools

CalculatorTool evaluates arithmetic safely using the Python ast module — eval() is never called:

Input is validated against a strict character whitelist: ^[\d\s\+\-\*\/\.]+$
ast.parse(mode='eval') builds a syntax tree
A custom _eval_node() walker accepts only ast.Constant, ast.BinOp (+, -, ×, ÷), and ast.UnaryOp (negation/pos)
Result magnitude is bounded to 1e300; division by zero and NaN/Inf return a failed ToolResult with an error message

FakeSearchTool maintains an in-memory knowledge base of ~12 entries covering topics like "llm", "transformer", "tokenization", "softmax", "temperature", "attention", "docker", and more. Lookup normalises the topic to lowercase and scans for substring matches.

Both tools return a uniform ToolResult(tool_name, input, output, success, error) dataclass — making them trivially swappable.

Stage 5 — Token-by-Token Generation

This is the most instructive stage. LLMCore.generate() runs the following loop for each target token:

Draw top_k (default: 6) candidates — the correct target token plus top_k-1 random words from the vocabulary
Assign pseudo-random base scores via random.Random(seed=42).uniform(0.2, 3.0)
Apply repetition penalty: any token appearing in the last 10 context IDs gets its score multiplied by ×0.25
Boost the target token's score by ×2.8 (this keeps the demo output coherent — a real sampler would draw from the distribution)
Scale by temperature and apply numerically stable softmax:

$$\text{softmax}(x_i) = \frac{e^{(x_i / T) - \max}}{\sum_j e^{(x_j / T) - \max}}$$

Log the complete sorted candidate table — (token, score, probability) — to the trace as generation_step_N

Every generation step is visible in the trace viewer as a sortable probability table with the selected token highlighted in green.

Stage 6 — Answer Composition

LLMPipeline._build_target_answer() pre-specifies the answer string that drives the generation loop:

Calculator result → "The result is {output} ."
Search result → first 12 words of the knowledge base entry + " ."
Fallback → "I can help you with that question ."

_compose_answer() combines the tool output (prefixed [Tool: {name}]) with the generated text (prefixed [Generated response]) into the final response returned to the browser.

The Trace JSON

After every pipeline run a full trace is written to data/traces/<session-uuid>.json. A condensed example:

{
  "steps": [
    { "name": "input",             "data": { "user_query": "What is 42 * 7 + 15?" } },
    { "name": "prompt_construction","data": { "full_prompt": "[SYSTEM]\n..." } },
    { "name": "tokenization",       "data": { "token_count": 44, "vocabulary_size_after_encoding": 47 } },
    { "name": "agent_reasoning",    "data": { "tool_used": "calculator", "tool_result": { "output": "309.0", "success": true } } },
    { "name": "generation_step_0",  "data": {
        "candidates": [
          { "token": "The",    "score": 2.94, "probability": 0.91 },
          { "token": "Result", "score": 0.87, "probability": 0.04 },
          { "token": "So",     "score": 0.61, "probability": 0.02 }
        ],
        "selected_token": "The",
        "selected_probability": 0.91
    }},
    { "name": "final_answer", "data": { "final_answer": "[Tool: calculator]\n309.0\n\n[Generated response]\nThe result is 309 ." } }
  ]
}

The Web UI

Three static HTML pages (no external JS or CSS dependencies) share the same dark GitHub-inspired design:

/ — Query interface with a textarea, six example chips, a live stage animation during processing, and animated token pills
/trace — Execution trace viewer: collapsible step cards, text filter, quick-filter tags (All · generation · agent · tokenization), step-type icons, and a full probability table renderer for generation steps
/about — Educational article with interactive temperature/softmax visualisation and a Real LLM vs. Simulation comparison table

Running with Docker Compose

# 1. Clone the repository
git clone https://github.com/pernastefano/llm_sim.git
cd llm_sim

# 2. Create your .env from the example
cp config/.env.example config/.env

# 3. Set a real secret key (required — the app refuses to start with the placeholder)
#    On Linux/macOS:
python3 -c 'import secrets; print("SECRET_KEY=" + secrets.token_hex(32))' >> config/.env

# 4. (Optional) Match the container UID/GID to your host user
#    so that files written to ./data are owned by you, not root
echo "PUID=$(id -u)" >> config/.env
echo "PGID=$(id -g)" >> config/.env

# 5. Start the stack
docker compose up -d

# 6. Open the UI
xdg-open http://localhost:5000   # Linux
open http://localhost:5000        # macOS

To stop it:

docker compose down

Trace files and the audit log are persisted in ./data/ on your host via a bind mount, so they survive container restarts.

The Docker image is also published to GHCR on every push to main and on semver tags, so you can pull it directly without building:

docker pull ghcr.io/pernastefano/llm_sim:latest

What This Is Not

LLM-Sim deliberately omits embeddings, attention, and the neural network entirely — the two biggest gaps versus a real LLM. The "probability" assigned to each token candidate is simulated (seeded random + deterministic target boost), not derived from learned weights.

What it does faithfully reproduce is the structural skeleton shared by all modern LLM deployments: prompt formatting, tokenisation, tool-augmented reasoning, temperature-scaled softmax sampling, and the full observability story around it. If you want to understand why a model produces the answer it does, structure is often more illuminating than weights.

Live DEMO: llm-sim.stefanoperna.it/

The full source code is on GitHub: github.com/pernastefano/llm_sim