XF Internal Linker V2 - The Vibe Coding journey begins!

new_view

Member
Licensed customer
I will come with results and keep you updated, pinky promise. So far, I have a plan about XenForo SEO internal linking Web App Version 2 that runs locally (not that I made it myself; Claude Opus and Gemini Pro made it to help a noob, my goodness, such kindness; let's see how it goes! My current AI stack is Claude Code, Codex, and Google Antigravity.

I am just a noob who has no idea what I am doing. It's all about the vibes, I guess. I am a dog chasing cars. I just do things. 😂:ROFLMAO:


To view this content we will need your consent to set third party cookies.
For more detailed information, see our cookies page.


Also, Claude Opus suggested we use PageRank (a paper published in 1998 by Sergey Brin and Lawrence Page), which Claude also said we should use stuff that I have no idea how it works, although I did a little bit of prior research on Gemini Pro to explain like I am five, but I still can't understand why or how it will add up (it's vibes, I guess). Claude suggested, e.g., Django, Angular, PostgreSQL, Docker, Redis, multi-qa-MiniLM-L6-cos-v1 (GPT OSS 120b, later when I buy a super cool GPU and PC with tonnes of RAM, but RAM is expensive these days innit?), PyTorch, spaCy, and NVIDIA CUDA for sentence transformers, whatever that means.


This was my initial prompt to Claude Opus after asking Gemini Pro about XenForo SEO internal linking. It was written by Gemini Pro after going back and forth with it, and also because my grammar is so bad, English isn't my first language after all, at least I try, with the help of Grammarly.


Gemini Pro's prompt to tell Claude Opus what to do. Call it "prompt chaining," if you will, asking an LLM to ask another LLM to complete a task.

Gemini's revised prompt:


Code:
I am incredibly excited to begin building this project using Django, Angular, and PostgreSQL. We will need to set up a new repository called XF Internal Linker V2 and establish a solid master plan. Because I am a beginner, I need step-by-step guidance utilizing purely GUI tools, so I never have to touch the command line. Please guide me with kindness while remaining brutally honest about maintaining excellent coding practices. I have already completed dozens of UI slices, designing the interface to mirror the layout and colors of Google Search Console, though features like websockets remain unfinished. I also uploaded a file to my server that must be synced with our new stack.

We must create an enterprise-grade, fast, and highly efficient architecture that runs smoothly on my current PC with 16 GB of RAM. The design must be forward-thinking and highly scalable right from the start, keeping in mind my plans to purchase a much more powerful computer to run heavy AI models like gpt-oss-120b. The application needs precise resource management, meaning it should completely shut down when not actively running. It must support two execution modes: a balanced mode that relies solely on the CPU and a high-performance mode that utilizes both the CPU and GPU. Portability is also crucial, so the app should be designed for easy migration via Docker to other local machines or web servers. Furthermore, the architecture should support modular plugins to ensure backward compatibility and be built to accommodate Elasticsearch in the future.

The primary purpose of this read-only application is to scan content and generate internal linking suggestions without directly modifying my XenForo or WordPress databases. The contextual linking strategy should scan only the first 600 words of a thread and suggest a maximum of three links per post. These suggestions should lean toward long-tail anchor text, utilizing an anchor policy engine that bans generic anchors, limits exact-match reuse, and prefers natural variants while clustering topics to avoid cannibalization. The app must seamlessly integrate with the XenForo REST API, WordPress REST API, and XenForo Media Gallery. It should heavily utilize XenForo webhooks to trigger instant ranking pipelines upon new thread creation, auto-verify applied links, and flag stale suggestions if a post is deleted. To ensure reviewers do not act on outdated content, the app must run live stale checks and duplicate checks before any approval.

The frontend must be as flexible as WordPress, allowing complete customization down to every specification through a theme panel. The Django admin side should be exceptionally user-friendly, offering organized content categories and intuitive configuration pages. For the actual review process, the app needs a Zen-like focus mode that displays one suggestion at a time with large keyboard-friendly controls. Reviewers should see a side-by-side live diff preview, pulling the latest raw post message from XenForo and showing the proposed final edit locally before they manually paste and save it elsewhere. We also need an item-level explorer that explains why no suggestion was made for specific content, along with excellent error reporting and live warnings for canonical URL issues or redirect failures. Redis should be implemented to ensure lightning-fast, real-time results, carefully caching only what is necessary and respecting strict age limits.

Comprehensive analytics and tracking are vital to evaluating the success of our internal linking strategy. The application should integrate seamlessly with the Google Search Console API and GA4 to pull actual search performance data without exceeding rate limits. We need visual tools like a unified PageRank graph spanning both the forum and the blog, a two-dimensional or three-dimensional link graph to identify orphan content and highly linked hubs, and a link density heatmap to prevent link fatigue. The dashboard should feature materialized views for underlinked content and review backlogs, alongside reviewer scorecards that track approval rates and rejection reasons. Furthermore, the app must provide actionable insights, such as an SEO gap analysis for missing keywords and before-and-after impact reports built with beautiful D3 visualizations to track ranking, views, and engagement changes following manual link applications.

Finally, we must maintain impeccable documentation and strict development workflows. A dedicated AI-context.md file must be utilized and continuously updated whenever tools like Codex, Antigravity, or Claude Code are involved. If it is safe to do so, commits and pushes to GitHub should happen automatically. The application should also provide regular prompts to check for errors, run updates, handle dependencies gracefully, and request new features, ensuring the entire ecosystem remains up-to-date, adaptive, and dynamic.
 
Claude Opus replied to Gemini's prompt and said:


Code:
FIRST ACTIONS (mandatory before writing a single line of code):
1. Read AI-CONTEXT.md in this repo
2. Read PROMPTS.md in this repo
3. Confirm you have read both before proceeding

---

PROJECT: XenForo Internal Linker V2
PURPOSE: A complete rebuild of a working V1 app into a professional,
enterprise-grade, Docker-first, GUI-driven internal linking tool for
XenForo forums — with Angular frontend, Django backend, and PostgreSQL database.

V1 REFERENCE (read-only, do not modify):
Location: ../xf-internal-linker
Status: Feature-complete, 485 tests passing
Stack: Flask + SQLite + spaCy + sentence-transformers + HTMX + Alpine.js
Everything that works in V1 must work in V2. The ML services are the heart
of the app and must be migrated with minimal changes.

---

TECH STACK (use these exact versions — always pick latest stable if a newer
patch version exists within the same major):

Backend:
- Python 3.12 (NOT 3.13 — ML libraries lag behind latest Python)
- Django 5.2 LTS (long-term support until April 2028 — do not use 5.0 or 5.1)
- Django REST Framework 3.15+
- Django Channels 4.1+ (WebSockets for real-time job progress)
- Celery 5.4+ with Redis broker (background ML jobs)
- Celery Beat (scheduled tasks like auto-resync)
- django-cors-headers (Angular ↔ Django communication)
- psycopg 3.x (PostgreSQL driver — use psycopg3, not psycopg2)
- django-environ (environment variable management)
- Pillow (image handling)
- paramiko (SSH sync)

ML / AI layer (migrated from V1, unchanged logic):
- spaCy 3.8+ with en_core_web_sm model
- sentence-transformers 3.x
- PyTorch 2.3+ (CPU default, CUDA optional)
- numpy 1.x (NOT 2.x — compatibility with sentence-transformers)
- scikit-learn (PageRank, scoring utilities)
- pgvector-python (store and search embeddings in PostgreSQL)

Frontend:
- Node.js 22 LTS
- Angular 19+ (latest stable — check angular.dev for current version)
- Angular Material 19+ (matching Angular version)
- Angular CDK 19+
- D3.js 7+ (link graph visualization)
- Socket.io-client or native WebSocket (real-time job progress)
- Chart.js or ngx-charts (analytics dashboards)

Database:
- PostgreSQL 17+ (latest stable)
- pgvector 0.7+ extension (embedding storage and similarity search)

Infrastructure:
- Redis 7.4+
- Docker Desktop (latest stable)
- Docker Compose v2 syntax (use "services:" not "version:")
- Nginx (Angular production builds)

---

ABOUT THE DEVELOPER (read carefully — this shapes everything):
- Complete beginner / noob vibe coder
- Uses GUI tools only: VS Code, GitHub Desktop, Docker Desktop, File Explorer
- Never touches the command line unless given an exact command to copy-paste
- AI assistants (Claude Code, Codex, Antigravity) do all the heavy lifting
- PC: Windows 11, 16 GB RAM, ~27 GB free disk space
- Plans to upgrade to a powerful PC later for large AI models (GPT-OSS-120B scale)
- Wants everything committed and pushed to GitHub automatically when safe
- Needs kind, clear explanations at every step
- Wants the app to look professional and be easy to use without technical knowledge

---

NON-NEGOTIABLE PRODUCT RULES:
- The app ONLY suggests links — it NEVER writes to XenForo or WordPress databases
- XenForo REST API and WordPress REST API: READ-ONLY access only
- Maximum 3 internal link suggestions per thread (enforced in pipeline)
- Suggestions scan only the FIRST 600 WORDS of a thread body
- Anchor text policy: prefer long-tail anchors, allow 1-2 word anchors,
  ban generic anchors (click here, read more, this post, etc.)
- Cap exact-match anchor reuse across the forum
- Manual human review ALWAYS stays in the loop before anything is applied
- App fully shuts down when not running (Docker Compose down = nothing runs)
- Two performance modes:
  * Balanced Mode = CPU only (default, safe for any PC)
  * High Performance Mode = CPU + GPU (CUDA, for when GPU is available)
- All data reads from XenForo via API or JSONL export — never direct DB access
- The app tracks and documents suggestions but the USER manually applies them

---

ARCHITECTURE OVERVIEW:

┌─────────────────────────────────────────────┐
│  Angular 19 Frontend (port 4200 dev)        │
│  Angular Material + GSC color theme         │
│  D3.js link graph + Chart.js analytics      │
└──────────────────┬──────────────────────────┘
                   │ REST API (JSON)
                   │ WebSockets (real-time)
┌──────────────────▼──────────────────────────┐
│  Django 5.2 LTS Backend (port 8000)         │
│  Django REST Framework API                  │
│  Django Channels (WebSockets)               │
│  Celery workers (background ML jobs)        │
│  Django Admin (customized, pretty)          │
└──────────────────┬──────────────────────────┘
                   │
┌──────────────────▼──────────────────────────┐
│  Python ML Services (migrated from V1)      │
│  spaCy + sentence-transformers + PyTorch    │
│  PageRank + Velocity scoring                │
│  Anchor extraction + pipeline               │
└──────────────────┬──────────────────────────┘
                   │
        ┌──────────┴──────────┐
        ▼                     ▼
┌───────────────┐    ┌────────────────┐
│ PostgreSQL 17 │    │   Redis 7.4    │
│ + pgvector    │    │ Cache + Queue  │
│ All app data  │    │ Job broker     │
│ + embeddings  │    │ WebSocket      │
└───────────────┘    └────────────────┘

Docker Compose starts/stops all 6 services with one command.

---

DJANGO APPS STRUCTURE (scaffold these apps from day one):

backend/
├── config/              # Django project settings (base/dev/prod)
├── apps/
│   ├── core/            # Shared models, base classes, utilities 
│   ├── content/         # content_items, posts, sentences, scope_items
│   ├── suggestions/     # suggestions, pipeline_runs, anchor policy
│   ├── pipeline/        # Celery tasks: import, embed, rank, sync
│   ├── analytics/       # GSC integration, GA4 integration, impact reports
│   ├── webhooks/        # XenForo webhooks (new thread, edit, delete)
│   ├── audit/           # Full audit trail, reviewer scorecards
│   ├── graph/           # Link graph data, PageRank, orphan detection
│   ├── plugins/         # Plugin registry, on/off toggle system
│   └── api/             # DRF routers, serializers, viewsets
├── services/            # ML services migrated from V1 (minimal changes)
│   ├── pipeline.py      # Retrieval + ranking (from V1)
│   ├── distiller.py     # Destination distillation (from V1)
│   ├── embeddings.py    # sentence-transformers wrapper (from V1)
│   ├── link_parser.py   # BBCode link extraction (from V1)
│   ├── sentence_splitter.py  # spaCy splitter (from V1)
│   ├── sync.py          # SSH sync service (from V1)
│   ├── ranker.py        # Scoring helpers (from V1)
│   └── anchor_extractor.py   # Anchor extraction (from V1)
└── manage.py

frontend/                # Angular 19 app (separate folder)
├── src/
│   ├── app/
│   │   ├── core/        # Auth, HTTP interceptors, WebSocket service
│   │   ├── shared/      # Reusable components, pipes, directives
│   │   ├── dashboard/   # Main dashboard page
│   │   ├── review/      # Suggestion review (focus mode + list mode)
│   │   ├── graph/       # D3.js link graph visualization
│   │   ├── analytics/   # GSC/GA4 stats, before/after reports
│   │   ├── settings/    # App settings + theme customizer
│   │   ├── jobs/        # Job management, progress, history
│   │   └── admin/       # Custom admin supplement
│   └── styles/          # GSC theme variables, global styles
└── angular.json

---

UI/UX REQUIREMENTS:

Color Theme (Google Search Console palette — exact hex values):
Primary blues:   #4285f4, #0b57d0, #1a73e8
Light blue:      #e8f0fe, #d3e3fd, #c2e7ff
Dark blue:       #001d35, #004a77
Success green:   #0f9d58, #0b8043, #146c2e
Light green:     #b7e1cd
Error red:       #d50000, #db4437, #b3261e
Light red:       #f4c7c3, #f9dedc
Warning yellow:  #f4b400, #f09300
Light yellow:    #fce8b2
Primary text:    rgba(0,0,0,0.87), #1f1f1f, #202124
Secondary text:  rgba(0,0,0,0.54), #5f6368
Disabled text:   rgba(0,0,0,0.38), #9e9e9e
Borders:         #e0e0e0, #c4c7c5
Backgrounds:     #ffffff, #fafafa, #f0f4f9, #f8fafd
Inverse:         #303030, #131314

Django Admin requirements:
- Clean, professional, easy to use without coding knowledge
- Custom branding with GSC colors
- Category-organized settings (general, ML/AI, sync, performance, API keys)
- Friendly labels and help text on every field
- No raw database IDs visible where avoidable
- Grouped sections, not one long list
- Dashboard overview on admin home

Theme Customizer (WordPress Appearance panel equivalent):
- Change primary color, accent color, font size, sidebar width
- Light/dark mode toggle
- Compact/comfortable layout density
- Save as named themes
- Live preview before applying

---

FEATURE LIST (implement across phases — do NOT do all at once):

Phase 0 — Scaffolding (TODAY):
[ ] Django project + all apps created
[ ] Docker Compose: postgres+pgvector, redis, django, celery, celery-beat, nginx
[ ] Dockerfiles for backend
[ ] Angular 19 project with Angular Material scaffolding
[ ] .env.example with all environment variables documented
[ ] Django settings: base.py, development.py, production.py
[ ] requirements.txt + requirements-dev.txt
[ ] package.json for frontend
[ ] .gitignore (node_modules, venv, .env, *.npy, __pycache__)
[ ] AI-CONTEXT.md updated with V2 architecture
[ ] Initial commit pushed to GitHub

Phase 1 — Django foundation:
[ ] PostgreSQL models for all tables (migrated from V1 SQLite schema)
[ ] pgvector fields on content_items for embeddings
[ ] Django admin registered and customized
[ ] DRF serializers and viewsets for core models
[ ] Redis + Celery connected and working
[ ] Django Channels WebSocket layer

Phase 2 — ML services migration:
[ ] Copy all V1 services/ files with minimal changes
[ ] Adapt file paths and DB calls for Django/PostgreSQL
[ ] Celery tasks wrapping pipeline, embed, sync jobs
[ ] CPU/GPU mode switching (Balanced vs High Performance)
[ ] Two Celery workers: one CPU-bound, one GPU-bound

Phase 3 — Data migration:
[ ] migration script: V1 linker.db → PostgreSQL
[ ] Verify all content, suggestions, history migrated correctly
[ ] pgvector: migrate .npy embedding files to vector columns

Phase 4 — Angular frontend core:
[ ] Routing, auth guard, HTTP interceptor
[ ] Dashboard with job status, recent suggestions, key metrics
[ ] Settings page with theme customizer
[ ] Jobs page with real-time Celery progress via WebSocket

Phase 5 — Suggestion review UI:
[ ] List mode: paginated suggestion cards with filters
[ ] Focus Mode: one suggestion at a time, keyboard shortcuts
[ ] Side-by-side diff preview (before/after post content)
[ ] Live duplicate check via XenForo API before approving
[ ] Batch approve/reject/skip actions

Phase 6 — XenForo webhooks:
[ ] Webhook receiver endpoint
[ ] Instant suggestion trigger on new thread
[ ] Auto-verify on post edit
[ ] Stale detection on post delete

Phase 7 — Link graph:
[ ] D3.js 2D visualization (hubs vs orphans)
[ ] 3D mode toggle
[ ] Topic silo visualization
[ ] Orphan content identification
[ ] Link density heatmap

Phase 8 — GSC + GA4 integration:
[ ] GSC OAuth2 flow (rate-limit safe)
[ ] GA4 connection
[ ] Before/after impact reports (views, clicks, ranking changes)
[ ] Top-performing and worst-performing suggestions dashboard

Phase 9 — Advanced features:
[ ] Plugin system (on/off without breaking changes, backward compatible)
[ ] WordPress cross-linking support (subdomain.example.com REST API)
[ ] Redirect/404 monitor for suggestion destinations
[ ] Anchor policy engine (full ruleset)
[ ] Elasticsearch-ready schema design
[ ] Reviewer scorecards
[ ] Full audit trail
[ ] "Why no suggestion?" explorer
[ ] SEO gap analysis (missing keyword alerts)

Phase 10 — Polish + production-ready:
[ ] Error reporting page (friendly, actionable)
[ ] Tooltip and inline help on every UI element
[ ] Good documentation prompts (errors, upgrades, new features)
[ ] Docker migration guide (move to new PC in under 30 minutes)
[ ] Web server deployment guide (DigitalOcean, Railway, etc.)

---

REDIS CACHING STRATEGY:
- Cache: pipeline run results, PageRank scores, distilled text (TTL: 1 hour)
- Cache: XenForo API responses for content that changes rarely (TTL: 15 min)
- Cache: GSC/GA4 API responses (TTL: 6 hours — respect rate limits)
- NEVER cache: live job progress, suggestion review state, user settings
- Cache keys must include a version prefix for easy invalidation
- Use Redis Pub/Sub for WebSocket job progress events
- Django cache framework with django-redis backend

---

XENOFORO WEBHOOK EVENTS TO HANDLE:
- thread_create → trigger instant suggestion pipeline for new thread
- thread_edit / post_edit → re-check if suggestion still valid, auto-verify
- thread_delete / post_delete → mark related suggestions as stale immediately
- resource_create → trigger suggestion pipeline for new resource

---

DOCUMENTATION RULES (enforce in every session):
- Read AI-CONTEXT.md before every session — no exceptions
- Update AI-CONTEXT.md at the end of every session — no exceptions
- Every Django model field needs a help_text= string
- Every API endpoint needs a docstring
- Every Angular component needs a JSDoc comment block
- Every settings field in the UI needs a tooltip
- PROMPTS.md must contain prompts for: error checking, upgrades,
  new feature requests, and migration commands

---

GIT RULES:
- Commit after every completed phase
- Push to GitHub automatically if no conflicts
- Never force push
- Never commit: .env, node_modules/, venv/, *.npy, __pycache__/,
  linker.db, *.log, docker volumes
- Commit message format: "phase-X: brief description of what was built"
- If git is unsafe (conflicts, unexpected files), stop and explain — never push blind

---

PORTABILITY REQUIREMENTS (Docker-first):
- git clone + docker compose up = fully working app in under 20 minutes
- All secrets in .env file (never hardcoded)
- All file paths relative, never absolute
- Database connection via environment variable only
- Designed to run on: Windows laptop → powerful desktop PC → web server
- GPU support: optional, detected automatically, graceful CPU fallback

---

SCALABILITY NOTES (for future powerful PC + large AI models):
- Celery worker count configurable via .env (CELERY_WORKER_CONCURRENCY)
- GPU batch size configurable (EMBEDDING_BATCH_SIZE)
- pgvector HNSW indexes ready for millions of embeddings
- Schema designed to accommodate Elasticsearch later (no circular dependencies)
- The plugin system allows adding new AI models without breaking existing features
- Stateless Django workers (all state in PostgreSQL + Redis)

---

START TODAY — PHASE 0:
Do the following in order. Explain each step. Tell me when I need to
Click something in Docker Desktop, GitHub Desktop, or VS Code:

1. Scaffold the full Django project structure with all apps listed above
2. Create docker-compose.yml with all 6 services
3. Create a Dockerfile for the Django backend
4. Create an Angular 19 project with Angular Material
5. Create .env.example with every environment variable documented
6. Create Django settings: base.py, development.py, production.py
7. Create requirements.txt and requirements-dev.txt
8. Create a .gitignore that excludes all the right things
9. Update AI-CONTEXT.md with the full V2 architecture
10. Commit and push everything to GitHub with the message "phase-0: project scaffolding."

After each step, tell me: what was just created, why it matters,
and what to click if I need to do anything in a GUI tool.


Anyway, here goes nothing, I will be back, pinky promise. 😁😂:ROFLMAO:

Let the vibe coding games begin! On your marks....
 
It's shaping up. Codex did some math about Google and Bing patents.



Markdown (GitHub flavored):
# FR-006 - Weighted Link Graph / Reasonable Surfer Scoring

## Confirmation

- Active phase confirmed: `Phase 9 / FR-006 - Weighted Link Graph / Reasonable Surfer Scoring` is the exact next target in `AI-CONTEXT.md`.
- Backlog confirmed: `FR-006` is a real pending request in `FEATURE-REQUESTS.md`. It is not the `FR-016` template placeholder.
- Repo confirmed: no existing `weighted authority`, `weighted pagerank`, or `reasonable surfer` implementation is present in the codebase today.

## Current Repo Map

### Existing-link extraction and persistence

- `backend/apps/pipeline/services/link_parser.py`
  - Parses BBCode links, HTML anchors, and bare URLs.
  - Resolves internal targets and returns `LinkEdge`.
- `backend/apps/graph/services/graph_sync.py`
  - Reconciles parsed edges into `ExistingLink`.
- `backend/apps/graph/models.py`
  - Stores `ExistingLink(from_content_item, to_content_item, anchor_text, discovered_at)`.
- `backend/apps/pipeline/tasks.py`
  - Calls `extract_internal_links(...)`, `sync_existing_links(...)`, and `refresh_existing_links()` during import/sync.

### Graph or edge models and migrations

- `backend/apps/graph/models.py`
- `backend/apps/graph/migrations/0001_initial.py`
- `backend/apps/graph/admin.py`

### Authority or PageRank computation

- `backend/apps/pipeline/services/pagerank.py`
  - Loads `ExistingLink` into a sparse matrix.
  - Uses uniform outbound weights `1 / outdegree(source)`.
  - Persists into `ContentItem.pagerank_score`.
- `backend/apps/pipeline/tasks.py`
  - Runs `run_pagerank()` after sync/import.
- `backend/apps/content/models.py`
  - Stores `pagerank_score` on `ContentItem`.

### Ranking feature assembly

- `backend/apps/pipeline/services/pipeline.py`
  - Loads `ContentRecord`.
  - Calls `score_destination_matches(...)`.
  - Persists `Suggestion` records.
- `backend/apps/pipeline/services/ranker.py`
  - Current final score uses semantic + keyword + node affinity + quality + silo.
  - Current `score_quality` is based on the host page's normalized `pagerank_score`.
  - Current destination `score_pagerank` and `score_velocity` are persisted for review, but are not part of `score_final`.

### Diagnostics or explanations returned to review

- `backend/apps/suggestions/models.py`
  - `Suggestion` stores score breakdown fields.
  - `PipelineDiagnostic` stores skip reasons.
- `backend/apps/suggestions/serializers.py`
  - `SuggestionDetailSerializer` returns score fields to review.
- `backend/apps/suggestions/admin.py`
  - Django admin exposes suggestion score breakdown.
- `frontend/src/app/review/suggestion.service.ts`
  - Frontend `SuggestionDetail` type includes `score_pagerank`.
- `frontend/src/app/review/suggestion-detail-dialog.component.html`
  - Review dialog shows score bars for semantic, keyword, node affinity, quality, PageRank, and velocity.

### Settings persistence and settings API

- `backend/apps/core/models.py`
  - `AppSetting` is the typed key/value settings store.
- `backend/apps/core/views.py`
  - Current settings APIs live here.
  - Current shipped endpoints are appearance, silos, and WordPress.
- `backend/apps/api/urls.py`
  - Wires `/api/settings/...` routes.
- `frontend/src/app/settings/silo-settings.service.ts`
  - Frontend service for current settings APIs.
- `frontend/src/app/settings/settings.component.ts`
- `frontend/src/app/settings/settings.component.html`

### Serializers, admin, or UI that already expose authority-like values

- `backend/apps/content/models.py`
  - `ContentItem.pagerank_score`
- `backend/apps/content/serializers.py`
  - Exposes `pagerank_score` in content list/detail serializers.
- `backend/apps/content/views.py`
  - Allows ordering by `pagerank_score`.
- `backend/apps/content/admin.py`
  - Shows `pagerank_score` in Django admin.
- `backend/apps/suggestions/models.py`
  - `Suggestion.score_pagerank`
- `backend/apps/suggestions/serializers.py`
  - Returns `score_pagerank` in suggestion detail.
- `backend/apps/suggestions/admin.py`
  - Shows `score_pagerank` in admin.
- `frontend/src/app/review/suggestion-detail-dialog.component.html`
  - Shows `PageRank` in review UI.

## Workflow Drift / Doc Mismatch Found During Inspection

- `AI-CONTEXT.md` and `FEATURE-REQUESTS.md` correctly point to `Phase 9 / FR-006` as the next real target.
- Important code-vs-doc drift: the continuity docs say hybrid scoring includes PageRank + velocity, but live `backend/apps/pipeline/services/ranker.py` does not use destination PageRank or velocity in `score_final` today.
- `PipelineRun.config_snapshot` exists in `backend/apps/suggestions/models.py`, but `backend/apps/suggestions/views.py` does not populate it when a run starts.
- `backend/apps/pipeline/services/link_parser.py` does not preserve true mixed-syntax link order today. It collects BBCode matches, then HTML matches, then bare URLs. That is fine for plain existence checks, but it is not safe for position-based weighting.
- `backend/apps/graph/services/graph_sync.py` currently creates and deletes edges, but does not update non-key edge fields for edges that still exist. FR-006 needs update-in-place behavior because edge features can change without the source/destination pair changing.

## Source Summary

### Source documents actually read

- [US7716225B1](https://patents.google.com/patent/US7716225B1/en)
- [US6285999B1](https://patents.google.com/patent/US6285999B1/en)
- [Brin and Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine" (1998)](https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf)

### Concepts used from the sources

- From `US7716225B1`:
  - a link can be assigned a weight tied to the probability that a user would choose it;
  - the weight can depend on link features, source-document features, and target-document features;
  - example feature families include link position, visual prominence, source link count, document type, and surrounding text.
- From `US6285999B1` and Brin/Page:
  - PageRank is a probability distribution over pages;
  - it is computed from a normalized link matrix by iterative power updates;
  - the random-surfer view is the right baseline mental model;
  - uniform outbound probability is the current standard case.

### What was clear

- A weighted link graph is meant to bias transitions by link-follow likelihood.
- The feature is separate from standard PageRank, not a replacement for it.
- Position and prominence are squarely inside scope for a reasonable-surfer style graph.

### What remained ambiguous

- `US7716225B1` describes weights as link-selection probabilities, but the patent's equation is written as a weight multiplied into the classic `1/outdegree` form.
- The patent allows a learned weighting model, but this repo has no click logs, no rendered DOM/CSS feature pipeline, and no user-behavior training data.
- The patent mentions source update frequency and surrounding-text topical relevance, but using those directly here would collide with later planned phases for freshness and phrase/field relevance.

## Math-Fidelity Note

### Directly supported by the sources

- Use the existing directed link graph.
- Assign per-edge weights that represent relative follow-likelihood.
- Run a PageRank-style iteration over those weighted edges.
- Keep standard `pagerank_score` as-is.

### Adapted for this repo

- No learned model is used.
  - Reason: this repo has no primary behavior data to train one safely.
- No font/color/rendered-DOM features are used.
  - Reason: the current extractor only sees BBCode/HTML strings, not computed layout.
- No freshness, update-frequency, phrase matching, field weighting, siloing, click-distance, reranking, clustering, or diversity features are used.
  - Reason: those belong to later phases or existing separate logic.
- Raw edge scores are normalized per source page into outbound probabilities before iteration.
  - Reason: this preserves a valid probability distribution, gives exact uniform-weight parity with today's `pagerank.py`, and is the narrowest low-regression interpretation of the patent.

### Alternatives considered

1. Patent-literal raw multiplier: `w_ij / outdegree(i)`
2. Per-source normalized transition probability: `P(i->j) = s_ij / sum_k s_ik`
3. Learned click-choice model

### Chosen interpretation

- Choose option 2.
- Treat persisted edge features as inputs to a positive raw score `s_ij`.
- Normalize `s_ij` across each source page's outgoing links to get the weighted transition matrix.
- This is the safest interpretation because:
  - it stays faithful to the "probability a user selects the link" language;
  - it preserves PageRank mass conservation;
  - it converges in the same sparse-matrix shape as the current implementation;
  - it falls back exactly to current PageRank when all outgoing links from every source have equal raw score.

## Problem Definition

Simple version first.

Right now every outgoing internal link from a page counts the same. FR-006 adds a second authority score where links that look more like real editorial links count more, and links that look more like boilerplate count less.

Technical definition:

- compute a new destination metric called `weighted_pagerank_score`;
- base it on the same `ContentItem` node set and `ExistingLink` edge set already used by `pagerank_score`;
- persist stable edge features needed to derive reasonable-surfer weights;
- keep `pagerank_score` untouched;
- expose enough settings and review data to compare standard vs weighted authority safely;
- keep ranking impact opt-in by default to avoid silent regressions.

## Chosen Weighted-Link Interpretation

### Non-goals and phase boundary

FR-006 must stay separate from:

- freshness or growth signals (`FR-007`)
- phrase-based destination/context relevance (`FR-008`)
- learned anchor vocabulary (`FR-009`)
- field weighting (`FR-011`)
- structural click-distance priors (`FR-012`)
- reranking, clustering, and diversity (`FR-013` to `FR-015`)
- current silo logic (`FR-005`)

### Edge identity and dedup policy

- Weighted authority uses one logical edge per `source -> destination`.
- If a source links to the same destination more than once, keep the earliest resolved occurrence in true document order as the weighted representative edge.
- Keep current `ExistingLink` storage as the edge table, but the weighted graph loader must defensively coalesce accidental duplicate rows by `from_content_item` + `to_content_item` if any exist.

### Persisted edge features

Extend `ExistingLink` with these stable, extraction-time fields:

- `extraction_method`
  - choices: `bbcode_anchor`, `html_anchor`, `bare_url`
- `link_ordinal`
  - zero-based order of this resolved internal link inside the source content, after sorting in true document order
- `source_internal_link_count`
  - total number of resolved internal links on the source content after dedup policy
- `context_class`
  - choices: `contextual`, `weak_context`, `isolated`

Existing persisted field reused directly:

- `anchor_text`
  - already stored today
  - blank means "no visible anchor text was captured"

### Edge feature definitions

- `extraction_method`
  - `bbcode_anchor`: link came from `[URL=...]anchor[/URL]`
  - `html_anchor`: link came from `<a href="...">anchor</a>`
  - `bare_url`: link came from a plain naked URL in text
- `link_ordinal`
  - computed from ordered match spans, not regex-family order
- `source_internal_link_count`
  - count of resolved internal links on the source content after same-destination dedup
- `context_class`
  - `contextual`: stripped local text window has normal prose tokens on both sides of the link
  - `weak_context`: prose tokens appear on only one side
  - `isolated`: no prose tokens on either side, or the link appears standalone/list-like

### Chosen raw edge scoring function

Let `anchor_blank(e)` be true when `anchor_text.strip()` is empty.

Let:

- `kind_factor(e)` =
  - `bare_url_factor` if `extraction_method == "bare_url"`
  - `empty_anchor_factor` if `anchor_blank(e)` is true for a non-bare anchor
  - `1.0` otherwise

- `position_ratio(e)` =
  - `0.0` if `source_internal_link_count <= 1`
  - otherwise `link_ordinal / (source_internal_link_count - 1)`

- `position_factor(e)` =
  - `1.0 - position_bias * position_ratio(e)`

- `context_factor(e)` =
  - `1.0` for `contextual`
  - `weak_context_factor` for `weak_context`
  - `isolated_context_factor` for `isolated`
  - `1.0` if the feature is missing

Raw edge score:

`raw_edge_score(e) = max(1e-6, kind_factor(e) * position_factor(e) * context_factor(e))`

Notes:

- This is intentionally small and conservative.
- It uses only features that are both source-backed in spirit and realistically extractable in this repo.
- `source_internal_link_count` is used only to derive relative position.
  - It is not used as its own multiplicative factor because any source-constant factor would cancel during per-source normalization.

### Outbound normalization method

For each source node `i` with active outgoing edges to non-deleted destinations:

- compute `raw_edge_score(i, j)` for every outgoing edge;
- if all scores are finite and the row sum is greater than zero:
  - `P_w(i -> j) = raw_edge_score(i, j) / sum_k raw_edge_score(i, k)`
- otherwise:
  - fallback to uniform weighting: `P_w(i -> j) = 1 / outdegree(i)`

This normalization is required.

It gives two good properties:

- the outbound probabilities from each source sum to `1`;
- if every outgoing edge on a source page has equal raw score, the weighted graph collapses back to standard PageRank behavior for that source.

### Missing-feature handling

- missing `context_class` => use neutral `context_factor = 1.0`
- missing `link_ordinal` or `source_internal_link_count` => treat like a single-link page, so `position_factor = 1.0`
- blank `anchor_text`
  - bare URL => `bare_url_factor`
  - non-bare anchor with blank text => `empty_anchor_factor`
- malformed or non-finite raw score => ignore that row's feature weighting and fallback that source row to uniform outbound probability
- source with no outgoing edges => treat as dangling exactly like current `pagerank.py`

### Weighted authority iteration

Use the same damping semantics and convergence strategy as the current `backend/apps/pipeline/services/pagerank.py`.

Definitions:

- `teleport = 0.15`
- `max_iter = 100`
- `tolerance = 1e-6`

Iteration:

- `link_mass = P_w @ ranks`
- `dangling_mass = sum(ranks[source] for dangling sources)`
- `next_ranks = (1 - teleport) * link_mass`
- `next_ranks += ((1 - teleport) * dangling_mass + teleport) / N`
- renormalize so `sum(next_ranks) == 1`
- stop when L1 delta is below `tolerance`, or after `max_iter`

Persist result into `ContentItem.weighted_pagerank_score`.

Deleted content handling:

- same as current PageRank
- non-deleted content participates
- deleted content gets reset to `0.0`

## Stored Fields Required

### ExistingLink

Add fields to `backend/apps/graph/models.py`:

- `extraction_method: CharField`
- `link_ordinal: PositiveIntegerField(null=True, blank=True)`
- `source_internal_link_count: PositiveIntegerField(null=True, blank=True)`
- `context_class: CharField`

Recommended indexes:

- `Index(fields=["from_content_item", "link_ordinal"])`
- keep current `from_content_item` and `to_content_item` indexes

### ContentItem

Add field to `backend/apps/content/models.py`:

- `weighted_pagerank_score: FloatField(default=0.0, db_index=True)`

Recommended index:

- `Index(fields=["content_type", "weighted_pagerank_score"])`

### Suggestion

Add field to `backend/apps/suggestions/models.py`:

- `score_weighted_pagerank: FloatField(default=0.0)`

Reason:

- review needs to compare standard vs weighted destination authority side by side;
- this mirrors the existing `score_pagerank` pattern;
- it does not require a larger explanation-model rewrite in this phase.

## Settings, Defaults, Bounds, and Validation

### Settings storage

Persist through `AppSetting` in category `ml`.

Keys:

- `weighted_authority.ranking_weight`
- `weighted_authority.position_bias`
- `weighted_authority.empty_anchor_factor`
- `weighted_authority.bare_url_factor`
- `weighted_authority.weak_context_factor`
- `weighted_authority.isolated_context_factor`

### Defaults

- `ranking_weight = 0.0`
- `position_bias = 0.5`
- `empty_anchor_factor = 0.6`
- `bare_url_factor = 0.35`
- `weak_context_factor = 0.75`
- `isolated_context_factor = 0.45`

### Bounds

- `0.0 <= ranking_weight <= 0.25`
- `0.0 <= position_bias <= 1.0`
- `0.1 <= empty_anchor_factor <= 1.0`
- `0.1 <= bare_url_factor <= 1.0`
- `0.1 <= weak_context_factor <= 1.0`
- `0.1 <= isolated_context_factor <= 1.0`

### Validation rules

- every numeric setting must be finite
- context and kind factors must be positive
- `isolated_context_factor <= weak_context_factor <= 1.0`
- `bare_url_factor <= 1.0`
- saving settings does not change `pagerank_score`
- changed settings only affect `weighted_pagerank_score` after a weighted-authority recalculation or the next full graph refresh/import cycle
 
More crunching...


Markdown (GitHub flavored):
## Ranking Feature Assembly



### Current-state constraint



Do not rewrite current `score_quality`, velocity handling, or destination PageRank handling as part of FR-006.



That would mix FR-006 with other backlog cleanup.



### Chosen ranking behavior



- Always compute and persist `weighted_pagerank_score`.

- Keep standard `pagerank_score` untouched and still exposed everywhere it already appears.

- Add `weighted_pagerank_score` as a new optional destination-level ranking signal.

- Gate its effect with `weighted_authority.ranking_weight`, default `0.0`.



### Exact ranker behavior



In `backend/apps/pipeline/services/pipeline.py` and `backend/apps/pipeline/services/ranker.py`:



- load `weighted_pagerank_score` into `ContentRecord`

- derive global min/max bounds for `weighted_pagerank_score`

- normalize it with the same log-minmax style already used for PageRank-derived quality signals

- compute:

  - `score_weighted_authority_component = normalized(destination.weighted_pagerank_score)`

- add to final score:

  - `score_final += ranking_weight * score_weighted_authority_component`



Persist on `Suggestion`:



- `score_pagerank = destination.pagerank_score`

- `score_weighted_pagerank = destination.weighted_pagerank_score`



Important:



- `score_weighted_pagerank` is for review comparison

- the normalized internal component does not need its own stored DB field in FR-006



Reason:



- this keeps the schema and UI change small;

- the review surface can still compare standard vs weighted authority directly;

- ranking impact remains opt-in and bounded.



## Diagnostics to Expose



### Review detail



Extend `SuggestionDetailSerializer` and the Angular review dialog to show:



- `score_pagerank`

- `score_weighted_pagerank`

- simple delta label: `weighted - standard`



Display label recommendation:



- `PageRank`

- `Weighted Authority`



### Content APIs and admin



Expose `weighted_pagerank_score` beside `pagerank_score` in:



- content list/detail serializers

- content admin list/detail

- suggestion admin score breakdown



### Edge inspection



Expose edge weighting features in `ExistingLinkAdmin`:



- `extraction_method`

- `link_ordinal`

- `source_internal_link_count`

- `context_class`



No new public ExistingLink API is required in FR-006.



### Run/config comparison



For reproducibility, FR-006 should begin using `PipelineRun.config_snapshot` for the FR-006 settings when a pipeline run starts.



Minimum snapshot payload addition:



- the six `weighted_authority.*` settings values



Reason:



- the field already exists;

- FR-006 explicitly asks for tuning and comparison;

- without a snapshot, later review cannot tell which weighted-authority settings produced a run.



## API, Admin, Review, and UI Impact



### Backend API



Add:



- `GET /api/settings/weighted-authority/`

- `PUT /api/settings/weighted-authority/`

- `POST /api/settings/weighted-authority/recalculate/`



Recalculate endpoint behavior:



- dispatch a Celery task that recomputes `weighted_pagerank_score` from the current graph and current FR-006 settings

- return `202` with a `job_id`

- do not force a full content resync



### Admin



Likely touched:



- `backend/apps/content/admin.py`

- `backend/apps/graph/admin.py`

- `backend/apps/suggestions/admin.py`



### Review UI



Likely touched:



- `frontend/src/app/review/suggestion.service.ts`

- `frontend/src/app/review/suggestion-detail-dialog.component.html`



Required change:



- add one new review row for `Weighted Authority`

- keep current `PageRank` row

- do not replace existing labels



### Settings UI



Likely touched:



- `frontend/src/app/settings/silo-settings.service.ts`

- `frontend/src/app/settings/settings.component.ts`

- `frontend/src/app/settings/settings.component.html`



Required controls:



- numeric fields for the six FR-006 settings

- save action

- separate `Recalculate Weighted Authority` action

- short helper text saying this does not replace standard PageRank



### Out of scope UI work



- no full `/graph` visualization work in FR-006

- no new diagnostics dashboard screen in FR-006



## Fallback Behavior When Disabled or Incomplete



- Standard `pagerank_score` keeps computing exactly as it does today.

- If `ranking_weight == 0.0`, suggestion ranking stays unchanged even though weighted authority is computed and stored.

- If FR-006 settings are saved but recalculation has not run yet, old `weighted_pagerank_score` values remain in place until recalculated.

- If graph rows still have legacy null FR-006 edge features, weighted computation uses neutral fallbacks and still produces a valid matrix.

- If a whole source row cannot produce valid weighted features, that source row falls back to uniform outbound probability.



## Regression Risks and Concrete Mitigations



### 1. Wrong link order because current extractor is not ordered across syntax families



Mitigation:



- replace `_find_urls()` with ordered match-span extraction

- sort all candidate matches by source offset before dedup and feature assignment

- add mixed BBCode + HTML + bare URL tests



### 2. Edge features go stale because current graph sync does not update retained rows



Mitigation:



- update `sync_existing_links(...)` so it bulk-updates changed FR-006 feature fields on edges that still exist

- do not rely on create/delete only



### 3. Settings changes create stale derived scores



Mitigation:



- persist only stable extraction features on edges

- do not persist settings-dependent normalized probabilities on edges

- provide a dedicated recalculation task for `weighted_pagerank_score`



### 4. Scope creep into later phases



Mitigation:



- do not use source freshness, target freshness, surrounding-text-to-destination phrase relevance, field weights, silo features, or reranking logic in the FR-006 weight formula



### 5. Ranking regressions



Mitigation:



- keep `ranking_weight` default at `0.0`

- preserve `pagerank_score` and all existing ranking math

- add parity tests that prove unchanged ranking when FR-006 ranking weight is zero



### 6. Performance regression on large graphs



Mitigation:



- keep sparse-matrix approach

- reuse current node-loading shape from `pagerank.py`

- calculate per-source normalized weights once per run, not per iteration

- persist only small stable edge features



## Exact Repo Modules / Files Likely To Be Touched



### Graph extraction and persistence



- `backend/apps/pipeline/services/link_parser.py`

- `backend/apps/graph/services/graph_sync.py`

- `backend/apps/graph/models.py`

- `backend/apps/graph/admin.py`

- `backend/apps/graph/tests.py`

- `backend/apps/graph/migrations/<new migration>`



### Authority computation



- `backend/apps/pipeline/services/pagerank.py` or a new sibling service such as `backend/apps/pipeline/services/weighted_pagerank.py`

- `backend/apps/pipeline/tasks.py`

- `backend/apps/pipeline/tests.py`



### Content and ranking



- `backend/apps/content/models.py`

- `backend/apps/content/serializers.py`

- `backend/apps/content/views.py`

- `backend/apps/content/admin.py`

- `backend/apps/content/migrations/<new migration>`

- `backend/apps/pipeline/services/pipeline.py`

- `backend/apps/pipeline/services/ranker.py`



### Suggestions and review



- `backend/apps/suggestions/models.py`

- `backend/apps/suggestions/serializers.py`

- `backend/apps/suggestions/views.py`

- `backend/apps/suggestions/admin.py`

- `backend/apps/suggestions/migrations/<new migration>`

- `frontend/src/app/review/suggestion.service.ts`

- `frontend/src/app/review/suggestion-detail-dialog.component.html`



### Settings and API



- `backend/apps/core/views.py`

- `backend/apps/api/urls.py`

- `frontend/src/app/settings/silo-settings.service.ts`

- `frontend/src/app/settings/settings.component.ts`

- `frontend/src/app/settings/settings.component.html`



## Test Plan



### 1. Uniform-weight parity behavior



- Build a small synthetic graph where every outgoing link has the same FR-006 feature values.

- Assert `weighted_pagerank_score` matches `pagerank_score` within a tight tolerance.

- Assert ranking order is unchanged when `ranking_weight = 0.0`.



### 2. Monotonicity



- In a small graph with one source linking to two targets, increase only one edge's favorable features:

  - earlier ordinal

  - contextual instead of isolated

  - anchor instead of bare URL

- Assert that edge's normalized outbound probability rises.

- Assert the favored destination's weighted authority does not decrease in the simple graph.



### 3. Boundedness



- Assert every outbound weighted row sums to `1.0` within tolerance.

- Assert all weighted-authority scores are non-negative.

- Assert all scores sum to `1.0` after convergence.



### 4. Normalization stability



- Feed rows with null FR-006 features.

- Feed rows with malformed data that would otherwise create non-finite raw scores.

- Assert the implementation falls back to uniform outbound probability for the affected source row.



### 5. Boilerplate down-weighting



- Create a source page with:

  - one early contextual anchor link

  - one late isolated bare URL

- Assert the early contextual anchor gets the higher normalized edge probability.



### 6. Contextual or editorial up-weighting



- Same source, same approximate position, different link types:

  - explicit anchor in prose

  - naked URL

- Assert the prose anchor wins.



### 7. Ordered extraction correctness



- Mixed BBCode, HTML, and bare URLs in one source body.

- Assert `link_ordinal` follows true source order, not regex-family order.

- Assert same-destination duplicates keep the earliest occurrence.



### 8. Existing-link persistence correctness



- Change an existing source post so an edge keeps the same source and destination but moves position or changes context class.

- Assert `sync_existing_links(...)` updates the FR-006 feature fields in place.



### 9. Coexistence with existing `pagerank_score`



- Assert standard `run_pagerank()` still produces the same values as before on unchanged fixtures.

- Assert `weighted_pagerank_score` is stored separately.

- Assert `pagerank_score` is never overwritten by the weighted run.



### 10. Diagnostics correctness



- Serializer tests:

  - `ContentItem` detail includes both `pagerank_score` and `weighted_pagerank_score`

  - `SuggestionDetail` includes both `score_pagerank` and `score_weighted_pagerank`

- Admin tests:

  - new fields are present and read-only where expected

- Frontend contract test or smoke test:

  - review dialog renders both labels and does not replace the existing PageRank row



### 11. Settings validation and recalculation flow



- Invalid values outside bounds return `400`.

- Saving valid settings persists to `AppSetting`.

- Recalculation endpoint returns `202`.

- After recalculation, `weighted_pagerank_score` changes while `pagerank_score` remains unchanged.



## Implementation Decision



Path chosen: **Path B**.



The source material is sufficient for a defensible FR-006 implementation specification if the repo uses:



- a conservative, deterministic link-weight heuristic;

- explicit per-source outbound normalization;

- separate persistence for standard and weighted authority;

- opt-in ranking influence by default;

- clear phase boundaries that keep later features out of this slice.
 
Back
Top Bottom