Evolving Stock Copilot Into a Multi-Agent Trading Co-Pilot

I built Stock Copilot as a solo PM who codes. It runs on Firebase, processes 500+ stocks every evening through a scheduled pipeline, generates AI-powered audio briefings, and pushes alerts to users before they wake up. It works. The architecture is simple — five Cloud Functions firing in sequence, Firestore as the glue between stages, Pub/Sub for parallel fan-out when the batch gets large.

But I’ve hit a ceiling that more Cloud Functions won’t fix.

The problem the current architecture can’t solve

Today, a user opens the app and sees yesterday’s analysis. If they want to dig deeper — “Why is NVDA down? Is this a sector rotation or company-specific? What does the options flow say?” — they get a chatbot that answers from whatever is already cached. It can’t go research something new. It can’t pull a 10-K filing, cross-reference it with recent earnings commentary, and come back with a view.

The current system is a push system. It computes analysis on a schedule and delivers it. The user consumes what the system decided to produce. That was a fine v1. But the users I’m building for — active retail traders — don’t just want a morning briefing. They want a research partner that can go investigate a thesis on demand.

That’s a fundamentally different architecture. You can’t schedule curiosity.

What a multi-agent version would look like

The evolution is from a time-driven pipeline to an orchestrator-worker system where specialized agents pursue different research angles in parallel, then a synthesis agent assembles the findings.

The user says: “Analyze NVDA, 6-month horizon, focus on whether AI demand is sustainable.”

What happens:

A lead agent receives the request, confirms scope, and decomposes it into four research angles. Four specialist agents run in parallel — each with its own tools, its own prompt tuning, and its own output format:

A fundamentals agent pulls revenue growth, margin trajectory, and capex trends from financial data APIs. A market structure agent examines recent price action, options skew, and institutional positioning. A news and sentiment agent ingests analyst notes, earnings call transcripts, and social signal. A macro agent surveys the semiconductor cycle, regulatory landscape, and comparable AI infrastructure plays.

Each specialist returns structured findings. A synthesis agent integrates them into a coherent analysis: bull case, bear case, key risks, and scenarios. Critically — and this is a product decision with regulatory teeth — the system never produces a buy or sell recommendation. It produces structured analysis. The trader decides.

A compliance agent runs last. It verifies every numerical claim against its source, scans the language for anything that sounds like advice, attaches disclaimers, and writes the full audit log.

The honest gap between today and that future

I want to be clear about where Stock Copilot actually is versus where this architecture points.

Today’s system has the raw ingredients: it already calls OpenAI for analysis, already pulls from TwelveData and Yahoo Finance, already runs batch processing through Pub/Sub workers. The AI chat feature already answers questions about stocks. What it doesn’t have is the agent loop — the ability for the AI to decide what to research next, call tools based on what it finds, and iterate until the analysis is complete.

The leap isn’t adding more API calls. It’s shifting control flow from “the developer wrote the sequence” to “the model decides the sequence.” That’s the difference between a workflow and an agent, and it’s the most consequential architectural decision in the evolution.

Three product decisions and the reasoning

1. Specialist agents, not one generalist — because domain-tuned tools produce meaningfully better output.

I tested this informally with the current chatbot. When I prompt GPT with a narrow role — “you are a fundamentals analyst, here are the financial statements, assess margin trajectory” — the output is materially better than asking a generalist “analyze this stock.” The model’s reasoning is sharper when the tool set is constrained and the prompt is domain-specific.

This maps directly to architecture. Each specialist agent gets its own tools (the fundamentals agent can pull 10-K data, the market structure agent can read options chains), its own few-shot examples, and its own output schema. The cost is coordination complexity — four agents running in parallel need an orchestrator that handles partial failures, timeouts, and aggregation. The payoff is that each angle of the research is actually good, not a shallow paragraph from a generalist that touched everything and nailed nothing.

2. Never produce buy/sell calls — enforced in the architecture, not just the prompt.

This isn’t a philosophical choice. If the system produces language that reads like investment advice, it’s potentially a regulatory violation. “Buy NVDA” from an automated system crosses a line that “here are the bull and bear cases for NVDA” does not.

The temptation is to handle this with prompt engineering — “never recommend buying or selling.” That works until it doesn’t. Prompts are probabilistic. Compliance needs to be deterministic. So the architecture has a dedicated compliance agent that runs after synthesis, scanning the output against a checklist: no imperative financial language, every number cited to a source, disclaimers attached. If the scan fails, the output goes back to synthesis for revision — it doesn’t reach the user.

This is the kind of decision that feels like overhead when you’re building and feels like the product when a user trusts you with real money.

3. The audit log is the product, not paperwork.

For a casual user, the analysis summary is the product. For a serious trader — and especially for anyone who manages other people’s money — the reasoning chain is the product. They want to see which sources the fundamentals agent pulled, what the sentiment agent found, how the synthesis agent weighed conflicting signals.

Architecturally, this means every agent writes structured output to a shared trace — not just the final answer, but intermediate findings, tool calls, and confidence signals. The trace is stored, queryable, and surfaceable in the UI. It’s not a debugging tool for me. It’s a trust mechanism for the user.

This also solves a practical problem: when the analysis is wrong (and it will be), the trace tells the user exactly where it went wrong. Was it stale data? A bad source? A synthesis that overweighted one angle? The user can diagnose the failure themselves instead of losing trust in the entire system.

What I’d build first

If I were starting this evolution tomorrow, I wouldn’t build four specialist agents in parallel with an orchestrator. That’s the destination, not the starting point.

I’d start with one specialist — the fundamentals agent — wired into the existing chat interface. User asks a fundamentals question, the chat routes it to a focused agent with financial data tools, the agent researches and responds. No orchestration, no synthesis, no compliance agent. Just one specialist doing one job well.

If the output quality is meaningfully better than the current generalist chatbot — and I believe it will be — that validates the core thesis: domain-tuned agents with specific tools beat generalist prompts. Then I’d add the second specialist. Then the orchestrator. Then synthesis. Then compliance.

Each phase earns the right to the next. That’s how you evolve a product without rewriting it.

See the full Stock Co-Pilot case study →

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top