Wisdom — Discord Knowledge Harvester
Three-phase tool that extracts technical knowledge from the twinBASIC Discord server and drafts documentation additions for human review. Source under wisdom/.
Phase 1 — Export: fetch Discord messages to data/raw/. Phase 2 — Process: convert raw JSON to structured Markdown in data/threads/. Phase 3 — Extract: run Claude agents over the threads, draft documentation additions, produce a staging file for review.
Prerequisites
- Node.js 18+ (uses native
fetchand ES modules). - No external dependencies — the tool uses only Node.js built-in APIs. No
npm installneeded. - Discord token — for Phases 1 and 2.
- Claude Code session — for Phase 3 (the extract workflow runs Claude agents).
Setup
Discord token
Create a wisdom/.token file with your Discord token on one line:
xMTEz...your-token-here
Lines starting with # are ignored. Alternatively, set the DISCORD_TOKEN environment variable.
The tool auto-detects whether the token is a bot token or a user token and applies the appropriate rate limits (bot: 5 req/s, 200/session; user: 2 req/s, 100/session).
Configuration
wisdom/config.jsonc controls which channels to include, concurrency, and rate limits. The defaults target the twinBASIC Discord server. Override per-run via CLI flags:
--guild <id>overridesguild_id(or setDISCORD_GUILD_ID).--concurrency <n>,--rate-limit <n>,--cap <n>override the limits block.
Typical run
The default flow is incremental: each phase tracks its own watermark so only content that has changed since the last successful run is processed.
node wisdom/wisdom.mjs export
node wisdom/wisdom.mjs process
node wisdom/wisdom.mjs extract
Then tell Claude: “Run the wisdom extract workflow.” Claude invokes the workflow, writes the results, and runs the merge step, which grafts additions into staging.md and advances the watermark in extract-state.json.
If the workflow completes but the merge step was not run (e.g. the session was interrupted), merge manually — no agents run and no API costs are incurred:
node wisdom/wisdom.mjs extract --merge
First run (no watermark yet): when extract-state.json does not exist, the incremental filter has no baseline and treats every thread as new. For a large corpus this produces thousands of agent calls. Use --since to scope the first run to a manageable date range, or bootstrap the state file by running the extract workflow once with --all and then merging.
Date-scoped run (bypass the watermark, filter by thread creation date instead):
node wisdom/wisdom.mjs export --since 2025-06-01
node wisdom/wisdom.mjs process --since 2025-06-01
node wisdom/wisdom.mjs extract --since 2025-06-01
The --since mode writes to a sideband file (staging-since-<date>.md) and does not touch the canonical staging.md or the watermark.
The extract step automatically partitions large thread sets into batches of 200, so filtering is optional — but --since and --channel reduce the number of threads analysed (and therefore agent invocations and API costs).
Reviewing staging.md
staging.md is the long-lived review file. Each ## section is one proposed documentation addition, with structured metadata at the bottom (source thread IDs, confidence level, date range, optional reviewer note). Sections are grouped by target page and delimited by --- lines.
Removing sections. Delete any section that is not useful — the removal is stable. If the source thread is unchanged on the next run, the watermark filter skips it entirely and the section stays gone. If the thread later receives new messages, the thread re-enters the pipeline and the agent may produce a fresh finding that accounts for the new context; it reappears with a [REFINED?] marker so the reviewer knows it is a revision of something already triaged.
Locking sections. Append [LOCKED] to any section heading (e.g. ## docs/Reference/VB/Form/index.md · after-remarks [LOCKED]) to freeze it. The merger skips locked sections on every future run, even if a new extraction produces a matching key.
Markers. The merger may tag section headings with bracketed markers:
[DUPLICATE?]— multiple threads produced overlapping findings for the same page and section.[REFINED?]— this finding was previously emitted and removed from staging (presumably reviewed and integrated); the source thread has since changed and the agent produced a new version.[LOCKED]— reviewer-applied; the merger will not touch this section.
CLI reference
All commands run from the repository root via node wisdom/wisdom.mjs <command> [options].
Phase 1 — Export
Fetches messages from Discord channels and forum threads.
node wisdom/wisdom.mjs export
Outputs raw JSON under wisdom/data/raw/. Supports incremental runs — a manifest tracks the highest message ID per channel, so re-running fetches only new messages. Use --force to re-fetch everything.
| Flag | Effect |
|---|---|
--since <date> | Only fetch messages after this ISO 8601 date |
--channel <id> | Restrict to one channel (repeatable) |
--dry-run | Discover channels/threads; do not fetch messages |
--force | Ignore manifest; re-fetch all history |
When the session request cap is reached, the tool exits with code 2 — re-run to continue where it left off.
Phase 2 — Process
Converts raw JSON into structured Markdown with YAML frontmatter.
node wisdom/wisdom.mjs process
Outputs one .md file per forum thread under wisdom/data/threads/<channel-name>/, plus one per text channel. Each file has frontmatter (thread ID, channel, tags, reaction counts, has_answer flag) and a rendered message body with timestamps and author names.
| Flag | Effect |
|---|---|
--since <date> | Only process threads created after this date |
--channel <id> | Restrict to threads from this channel ID (repeatable) |
--force | Regenerate all files (skip modification-time check) |
Phase 3 — Extract
Two steps: a CLI prep step, then a Claude Code workflow.
Prepare
node wisdom/wisdom.mjs extract
Scans the processed thread files and the docs/Reference/ directory, builds a sitemap of all documented symbols, and writes prep data to wisdom/data/findings/. No Claude agents run yet.
- 200 threads or fewer: writes a single
extract-prep.json. - More than 200 threads: partitions into batch files (
extract-batch-0.json,extract-batch-1.json, …) and writes anextract-manifest.jsondescribing the batches.
The prep step also writes two shared reference files that workflow agents read from disk: package-summary.txt (the documented-symbols list) and page-index.json (a compact lookup from package/symbol to documentation file path).
| Flag | Effect |
|---|---|
--since <date> | Only analyse threads created after this date |
--channel <name> | Restrict to threads from this channel name (repeatable) |
--min-confidence <level> | Skip findings below high, medium, or low (default: low) |
--all | Bootstrap: process all threads, ignoring state and channel filter |
--force | Re-process threads even if their watermark matches state |
--dry-run | Write the prep file but do not invoke the workflow |
--merge | Graft extract-results-*.json into staging.md and advance state (no agents) |
Run the extraction workflow
After the prep step finishes, tell Claude (in the same session or a new one):
Run the wisdom extract workflow.
Claude detects whether a manifest or a single prep file was produced and handles either case:
- Single batch: reads the prep file, invokes the workflow, writes the result file, runs the merge.
- Multi-batch: reads the manifest, runs one workflow per batch, saves each result to
extract-results-{i}.json, then merges all results intostaging.md. If interrupted, re-running picks up from the last incomplete batch.
Implementation
File layout
wisdom/
wisdom.mjs Entry point --- CLI parser, runExport(), dispatch
config.mjs Load config.jsonc, apply CLI overrides
config.jsonc Server/channel/rate-limit configuration
discord/ Phase 1 --- Discord API layer
api.mjs HTTP client, auth, rate-limiter, snowflake utilities
discover.mjs Channel + thread discovery, member fetch
messages.mjs Paginated message fetch, manifest (watermark) I/O
process/ Phase 2 --- raw JSON to structured Markdown
thread.mjs Orchestrator: iterate raw files, apply filters, write .md
filter.mjs Keep only user-content message types (default + reply)
frontmatter.mjs Build YAML frontmatter from thread/channel metadata
render.mjs Render messages to Markdown (timestamps, replies, reactions)
slugify.mjs Thread-name-to-filename slug
extract/ Phase 3 --- Claude-agent knowledge extraction
prep.mjs runExtract() (prep + batch partitioning) and runMerge()
state.mjs extract-state.json I/O, watermark comparison, emission log
sitemap.mjs Walk docs/Reference/, build package-summary + page-index
merger.mjs Parse/merge/serialise staging.md
schemas.mjs JSON Schema definitions for findings and additions
workflow.mjs Claude Code Workflow script (runs inside the Workflow tool)
Entry point — wisdom.mjs
Parses process.argv into { command, flags } and dispatches to one of three functions: runExport() (defined inline), runProcess() (imported from process/thread.mjs), or runExtract()/runMerge() (imported from extract/prep.mjs).
Also defines runConcurrent(items, concurrency, fn) — a simple worker-pool: spawns min(concurrency, items.length) async workers that pull from a shared index counter. If any worker throws CapReachedError, the pool drains without starting new items.
Discord API client — discord/api.mjs
createClient(config) returns { request, queryCount, tier, sessionCap }.
- Auth detection. Probes
/users/@mewithBot <token>first; on failure, retries with the bare token (user-token auth). Setstierto'bot'or'user', which selects the rate-limit profile from config. - Rate limiter. Enforces
requests_per_secondvia a minimum inter-request delay. User-tier adds +/-20% jitter. Also reads Discord’sX-RateLimit-Remaining/X-RateLimit-Reset-Afterheaders and sleeps when a route bucket is exhausted. - Session cap. Throws
CapReachedErrorwhenqueryCountreaches the configured cap. The caller catches this and exits with code 2; re-running resumes via the export manifest. - 429 handling. On HTTP 429, reads
retry_afterfrom the response body and recursively retries. - Snowflake utilities.
snowflakeToTimestamp()andtimestampToSnowflake()convert between Discord snowflake IDs and Unix-millis timestamps using BigInt arithmetic (shift by 22 bits + the Discord epoch).
Phase 1 control flow
- Discover (
discord/discover.mjs): fetch the full channel list from/guilds/{id}/channels, filter by type (text/forum) and exclude patterns. For forums, paginate/channels/{id}/threads/archived/publicand (bot-only)/guilds/{id}/threads/active, deduplicate, and filter bymin_message_count. - Fetch members (
discord/discover.mjs): paginate/guilds/{id}/members(bot-only; user tokens get an empty map). Writeguild.jsonandmembers.json. - Build target list: merge text channels and forum threads into a single list. Targets that previously returned 403 (tracked in
denied.json) sort to the end. - Fetch messages (
discord/messages.mjs): run targets throughrunConcurrent. For each target:- Check manifest: if the target’s highest-seen snowflake is recorded and the output file exists, skip (up-to-date).
- Call
fetchMessages(client, channelId, afterSnowflake):- Incremental (afterSnowflake set): page forward with
?after=, collecting new messages. - Full (afterSnowflake null): page backward with
?before=, collecting all history. - Sort chronologically (ascending snowflake).
- Incremental (afterSnowflake set): page forward with
- Write
{ channel | thread, messages }toraw/channels/{id}.jsonorraw/threads/{id}.json. - Update manifest with
highestSnowflake(messages)and flush to disk after each target.
The export manifest (raw/manifest.json) is a flat { channelOrThreadId: highestSnowflake } object. It governs incremental fetches — on the next run, only messages newer than the stored snowflake are requested.
Phase 2 control flow
- Load
guild.jsonto buildchannelMap(id to channel object) andtagMap(forum tag id to tag name). - Load
members.jsonfor author display-name lookup. - Iterate
raw/threads/*.json:- Skip if
--channelfilter does not matchthread.parent_id. - Skip if
--sinceand thread creation date is before the cutoff. - Skip if output
.mdexists and its mtime >= input mtime (unless--force). filterMessages()(filter.mjs): keep only type 0 (DEFAULT) and type 19 (REPLY).shouldSkipThread(): drop threads with one or fewer messages or all-bot authors.buildFrontmatter()(frontmatter.mjs): extract thread_id, title, channel, tags, created/archived timestamps, message_count, last_message_id (highest snowflake — the Phase 3 watermark), reply_count, reaction aggregates, has_answer flag.renderMessages()(render.mjs): format each message as**Author** _timestamp_+ content + attachments + reactions. Reply messages get a reply-to header with a truncated quote of the parent.- Write
threads/<channel-name>/<thread_id>--<slug>.md.
- Skip if
- Iterate
raw/channels/*.json— same pipeline but simpler frontmatter (channel_id, title, message_count). One output file per text channel.
Phase 3 control flow
Phase 3 has three stages: prep (Node.js), workflow (Claude agents), merge (Node.js).
Prep
runExtract() in extract/prep.mjs:
- Build sitemap (
extract/sitemap.mjs): walkdocs/Reference/, parse frontmatter to collect{ path, title, permalink, parent }for every documented page. From this build two reference files:package-summary.txt: a compact- Package > Module: Symbol1, Symbol2, ...text listing. Read by Extract agents to know what symbols exist.page-index.json: a flat{ "Package/Title": repoPath }JSON object with bare-title shortcuts for unambiguous names. Read by Draft agents to resolve findings to file paths.
-
Load state (
extract/state.mjs): readextract-state.json— a{ processedThreads: { threadId: { last_message_id, message_count, emitted } } }map. If absent, returns empty state (all threads appear new). -
Filter threads: scan every
.mdfile underdata/threads/, parse its frontmatter, and checkisThreadChanged(state, threadId, last_message_id, message_count). A thread passes the filter iff either field differs from the stored value, or the thread is not in state at all. -
Partition: if 200 or fewer threads, write a single
extract-prep.json. Otherwise, sort by (channel, created), chunk into batches of 200, and writeextract-batch-{i}.jsonfiles plus anextract-manifest.json. Each batch file contains thread file paths, a parallel file-size array (for the workflow’s byte-budget grouping), config, and mode metadata.
Workflow
extract/workflow.mjs is a Claude Code Workflow script — it runs inside the Workflow tool engine, not as a standalone Node.js process. The engine provides agent(), pipeline(), parallel(), phase(), log(), and an args global.
-
Parse args: the batch file contents arrive as
args(JSON). Extract thread paths, sizes, and config. -
Group threads by byte size: threads above 15 KB get a solo group. The rest are packed into groups of ~25 KB cumulative size. This reduces agent count by batching small threads together.
-
Two-stage pipeline over the groups:
Stage 1 — Extract. One Sonnet agent per group. The agent reads the thread
.mdfile(s) from disk, readspackage-summary.txtto know the documented symbol surface, and returns{ findings: [...] }via structured output. Each finding has:thread_path,package,symbol,kind(gotcha/workaround/example/clarification/deprecation),summary,detail,confidence(high/medium/low),date_earliest,date_latest.Stage 2 — Draft. One Sonnet agent per group (only if Stage 1 produced findings above the confidence threshold). The agent reads
page-index.jsonto resolve each finding’s package/symbol to a documentation file path, reads the resolved target page(s) to understand current content, and returns{ additions: [...] }via structured output. Each addition has:thread_path,target_page(repo-relative path or “UNMAPPED”),section(after-remarks/example/see-also/new-section),draft(exact Markdown to insert),confidence,date_earliest,date_latest,reviewer_note. -
Assemble: flatten all additions, extract thread IDs from file paths, and return
{ additions: [...] }.
The pipeline runs both stages without a barrier — Stage 2 for group A starts as soon as group A’s Stage 1 completes, while other groups may still be in Stage 1.
Merge
runMerge() in extract/prep.mjs, invoked by node wisdom/wisdom.mjs extract --merge.
-
Collect results: read all
extract-results-*.jsonfiles and concatenate their additions arrays. - Graft into staging.md (
extract/merger.mjs—graftAdditions):- Parse existing
staging.mdinto{ preamble, sections[] }. The parser splits on---delimiter lines, then parses each chunk into heading (target_page + section + optional marker), body lines, and trailing meta lines (source threads, confidence, date range, reviewer note). - For each addition, compute a match key:
(target_page, section, sorted finding_ids).- Key exists in staging.md (and section is not
[LOCKED]): replace the section body and meta in place. - Key not in staging, but in the emission log (from
extract-state.json): this was previously emitted, reviewed, and removed. Insert with a[REFINED?]marker. - Key not in staging and never emitted: insert at the end of the target_page’s contiguous group (or create the group).
- Key exists in staging.md (and section is not
- UNMAPPED sections sort before mapped sections.
- Atomic write: backup existing
staging.mdasstaging.md.bak, write to.tmp, rename.
- Parse existing
-
Advance state: for each thread ID that contributed additions, call
recordEmission()to store the current watermark values and the(target_page, section, finding_ids)tuples in the emission log. Then writeextract-state.jsonatomically.
staging.md format
## docs/Reference/VB/Form/index.md · after-remarks
> [!NOTE]
> Draft prose here...
_Source threads: 9876543210 · confidence: high_
_Date range: 2025-06-04_
_Reviewer note: Verify against current .twin source before publishing._
---
Each ## section is one addition. The heading encodes target_page · section plus an optional [marker] bracket (DUPLICATE?, REFINED?, LOCKED). Trailing italic lines are structured metadata the merger parses on re-read. Findings that do not map to any existing page use UNMAPPED as the target_page and are collected at the top of the file under an Unmapped Findings header.
Incremental watermarks
Each phase has its own incrementality mechanism:
| Phase | Watermark | Storage | Granularity |
|---|---|---|---|
| Export | Highest message snowflake per channel/thread | raw/manifest.json | Per target |
| Process | Output file mtime vs. input file mtime | Filesystem | Per file |
| Extract | (last_message_id, message_count) per thread | findings/extract-state.json | Per thread |
The extract watermark uses two fields so it catches both new messages (snowflake advances) and message deletions (count decreases without snowflake change).
Data directory
Almost everything under wisdom/data/ is gitignored. The exception is data/findings/staging.md, which is tracked — it is the long-lived review file and has value beyond any single session.
data/
raw/ Phase 1 output
guild.json
members.json
manifest.json
channels/*.json
threads/*.json
threads/ Phase 2 output
<channel-name>/
<thread_id>--<slug>.md
<channel-name>.md
findings/ Phase 3 output
package-summary.txt shared: symbol list for Extract agents
page-index.json shared: symbol-to-doc-path lookup for Draft agents
extract-state.json per-thread watermark (advanced on merge)
extract-prep.json single-batch prep (<=200 threads)
extract-manifest.json multi-batch manifest (>200 threads)
extract-batch-{i}.json per-batch thread-path lists
extract-results-{i}.json per-batch workflow results
staging.md final review file (tracked in git)