AI at the Speed of Silicon: Powering the Era of Intelligence Everywhere

  • The next era requires inference that is cheap and fast enough to embed intelligence into daily life: every surface, workflow, and device adapting in real time, and agents thinking longer without making users wait. This is the foundation for the next AI hyperscalers and entirely new categories of companies.
  • That future is delayed because inference remains too expensive and too slow. The GPU-centric stack is bottlenecked by memory bandwidth, pushing HBM and massive parallelism and driving a brute-force arms race of bigger chips, denser racks, and escalating power and complexity.
  • Taalas takes a fundamentally different approach by hardwiring models into dedicated inference chips, targeting orders-of-magnitude gains in cost, power, and latency, delivered in standard servers without exotic packaging.
  • When inference becomes abundant and affordable, the frontier shifts. Developers can weave AI into every feature and workflow, and models can spend more tokens thinking and checking their work while still responding instantly.
KEY INSIGHTS:

For the past several years, the technology industry has been rightfully obsessed with model intelligence. How do we build larger, more capable, more creative AI models? This pursuit has yielded incredible results, pushing the boundaries of what we thought was possible.

But the brilliance of a model is only potential. The true measure of progress isn’t just how intelligent our models are, but how widely and deeply they are applied.

Imagine a future where intelligence is embedded into the rhythm of daily life. The world should reshape itself around each of us. Every surface, application, and device should mold to our needs in real time, making life easier, safer, and more productive. Intelligence should be everywhere, and it should feel effortless.

Now picture models that can spend far more tokens thinking, exploring, and checking their work, and still respond instantly. Agents would reason their way to brilliance without making users wait, delivering answers and writing correct code in milliseconds. In that world, intelligence would not just be more available, it would be more reliable: deeper reasoning, fewer mistakes, and richer results, all at human speed.

Today’s models are capable enough to begin both transformations, yet only a tiny fraction of the digital world runs on AI. What holds developers back is not imagination. It is economics and speed. Cost per token keeps intelligence from being woven into every feature and workflow. Latency and throughput limits keep models from thinking longer without making users wait.

A city alive with streaming tokens guiding cars, robots, and people in real time | Taalas | Quiet Capital Essays

The hardware bottleneck

Modern AI is running into a wall because of a fundamental mismatch between the demands of large models and the hardware and software that power them. LLMs are memory-bandwidth bound. They spend more time waiting for data to move in and out of memory than performing actual computation. This incompatibility between model architecture and compute-centric GPUs has forced the industry into heroic feats of engineering for diminishing returns. We’re chasing the next 2x improvement through brute force: designing massive, multi-die chips that test the limits of manufacturing, installing engineering racks that draw as much electricity as entire neighborhoods, and building data centers that resemble industrial plants with their own power substations and advanced liquid cooling systems.

This GPU arms race isn’t sustainable. We’re not just running up against the laws of physics, we’re allowing spiraling cost and complexity to block the true promise of intelligence everywhere. It’s time to go back to the drawing board with a simpler, smarter design that lets AI scale without limits.

The breakthrough: Taalas unlocks 1000x performance

Two years ago, entrepreneur and semiconductor veteran Ljubisa Bajic and the founding team at Taalas saw a clear path through these challenges. They knew that for intelligence to be ubiquitous, inference – the actual running of models in applications – couldn’t be just twice as good. It had to be orders of magnitude faster, cheaper, and more efficient.

So they made an incredibly bold bet. Instead of building a better general-purpose computer to run models, they asked: What if we could turn the models themselves into specialized computers?

Their approach takes specialization to its logical conclusion. Rather than building a general accelerator, Taalas turns a model into a dedicated inference chip. By hardwiring the model into silicon, they unlock performance jumps measured in orders of magnitude – not percentages – including output of 17,000 tokens per second per user, at 1/20th the cost and power of today’s GPUs. 

That capability is delivered on simple PCIe cards that drop into standard servers. There’s no HBM, no interposer, and no need for low-level kernel tuning or proprietary software stacks just to make the hardware run. An 8-card 4U system draws about 2.5 kW and runs on air-cooling, completely removing power as the binding constraint. Taalas inference servers can be deployed in any data center built in the last 20 years. Even with this level of specialization, the system still supports fine-tuning, allowing customers to adapt models to specific domains or refine behavior through post-training and reinforcement learning.

But Taalas didn’t stop there. The team also built an automated pipeline that can take any new model and turn it into data-center-ready inference chips in just eight weeks, for a fraction of the traditional costs. In an industry where new systems take years to tape out, Taalas flips the hardware bottleneck on its head, making chip creation feel like just another step in the software development lifecycle. 

They do this by “programming” the chip in its final wiring layers. The core chip stays the same, and a new model is created by changing only a small set of top layers, so it is a mask update rather than a full chip redesign. They even build wafers up to that point and hold them as inventory, then finish them once a model is ready. The economics are game-changing. Cloud operators can recoup their investment in weeks instead of years, transforming each new generation of state-of-the-art models from a massive capital expense into a routine upgrade that’s as simple as swapping a PCIe card and popping in a new one.

We believe that a breakthrough of this magnitude could only have been achieved by this specific team. Ljubisa, COO Lejla Bajic, and CTO Drago Ignjatovic are quiet legends in the semiconductor industry, deeply respected for decades of foundational work at ATI, AMD, and Nvidia. Before starting Taalas, Ljubisa co-founded Tenstorrent, a multi-billion dollar AI accelerator company. He is a true 0.01% founder, combining world-class technical depth across semiconductors and machine learning with hard-won real-world experience shipping complete systems, plus an unusually penetrating strategic view of where the market and technology are headed and the conviction to make ambitious bets. The entire Taalas team is a small, elite force of less than 30 seasoned engineers, many of whom have worked together across multiple companies for decades.

The future is already here

Talk is one thing; execution is another. In just 25 months from company founding, Taalas has turned a bold idea into a breakthrough chip that’s been proven in the real world. Its first system, the HC1, is already live in the company’s Toronto data center – making it one of the fastest and most capital-efficient bring-ups of a novel chip in modern semiconductor history. The future it promises is no longer theoretical. It’s here.

Today, Taalas opens HC1 to the public in two ways. First, a live demo chatbot lets anyone experience a step-change in AI inference firsthand. Second, a new sign-up-based API endpoint service goes live for developers who want to start building on the same infrastructure.

As an aside, this shift may mark a turning point for open-source AI. Historically, even the best open-source models faced a built-in economic disadvantage versus proprietary systems running on hyperscalers’ deeply optimized, vertically integrated infrastructure, making performance per dollar hard to match. Now that balance is changing: the cost and performance gap is closing, and for the first time open-source models, powered by Taalas, can challenge and even surpass the closed-source monopoly. We’re eager to see what the world’s most creative developers build with access to abundant, affordable compute.

It’s time to build (faster)

So what exactly is the future Taalas building for? The opportunity is a world where we can throw 1000x more tokens at our hardest problems, where intelligence isn’t constrained by compute but unleashed by it. Here’s just a glimpse of what becomes possible:

  • Massive test-time compute scaling: Over the past year, most of the biggest  gains in model intelligence have come from scaling test-time compute (i.e., giving AI models extra processing time and resources post-training to “think” more deeply). But as a result, the user experience has suffered, with inference often feeling slow and unresponsive. Taalas changes that equation. Its architecture enables orders-of-magnitude faster iteration, allowing teams to throw 10x more compute at a problem while still receiving a result 5x faster.
  • Agentic development: Coding agents will do far more, instantly. They will spawn many subagents in parallel, explore multiple implementation paths, run tests and checks continuously, fix issues as they appear, and loop through deep reasoning and verification, all while iterating with the user in real time.
  • Real-time applications: Voice interactions with AI-enabled devices or software will feel instantaneous and human. Personalized, dynamic experiences – from conversational language translation to adaptive interfaces that anticipate your intent – will become the norm.
  • Internet of Things and robotics: Every device around us will be intelligent, constantly learning, adapting, and anticipating our needs. Robots will respond with human-level speed and precision. Everyday objects like your toaster or thermostat could continuously stream tokens in real time, quietly spending compute to make your life easier, simpler, and safer.
  • Synthetic data generation: AI systems will generate vast, high-quality datasets that can be used to train the next generation of models. This data will help improve performance, safety, and generalization in domains like drug discovery and physics modeling. 
  • Utility inference: Thousands of tiny, invisible LLM calls will hum in the background of every piece of software we use – summarizing information, drafting responses, and predicting next actions – quietly removing the friction from our digital lives.

Multimodality: The heavy computational load of processing and generating rich media – video, audio, and beyond – will become manageable. People will generate videos of any length instantly, and edit or update them in real time. Content could even adapt dynamically to each viewer, responding to their reactions and emotions.

Quiet Capital’s role: Backing the bold from Day 0

At Quiet, we’re deeply grateful for the opportunity to partner with the Taalas team. It’s been a privilege to watch them execute an ambitious, world-changing vision at extraordinary speed. With every milestone, our respect for the team and our conviction in what they’re building has only grown.

We were the company’s largest seed investor, led the Series A, increased our position in the Series B led by Fidelity, and most recently quadrupled down upon the successful tape-out and validation of HC1.

We believe Taalas will be so transformational that we’ve sought out, funded, and in some cases incubated businesses that only become possible with this new kind of compute. For the past two years, Quiet has operated with this future in mind, and we intend to continue to be at the forefront as investors and as builders alongside the teams shaping what comes next.

Looking at the AI landscape in 2026 feels like gazing back at the internet in 1993. The foundational platforms are only now being forged, and the hyperscalers of the AI era have yet to emerge. When they do, they will have custom silicon at their core.

Ljubisa and the Taalas team are just getting started. They already have the next two generations of chips in development – each an equal feat of engineering promising yet another order-of-magnitude improvement across all metrics. We’re honored to be on this journey with them and can’t wait to see the future they’ll help all of us build.

Essay Library