Bespoke Software in the Age of AI: Orchestrating Intelligence with Phonosyne

Software used to be about reuse. Build a module once, apply it everywhere. But the next wave isn’t about general-purpose tools,it’s about deeply personal systems. Not just customizable interfaces, but workflows encoded with your taste, your rhythm, your constraints. Tools that don’t just wait for commands but anticipate and adapt,intelligent collaborators that move with you.

Phonosyne is my personal experiment in this direction. It’s not a plugin or product, it’s a working system for sound design that interprets my text prompts and returns structured, validated .wav files. But more than that, it’s a philosophical statement: a demonstration that software can be handcrafted again, not line-by-line, but by intent. Orchestrated through multi-agent AI. Tuned not to a market, but to a maker.

An open mouth mid-scream, surrounded by swirling purple and magenta water.

“Agentic AI is proactive, pursuing objectives end-to-end… where generative AI waits for prompts, agentic systems book the studio, mix the track, and release the album.”

Aalpha, Agentic AI vs. Generative AI

What Is Phonosyne?

Phonosyne isn’t just code, it’s a collaboration. A group of AI agents working together to turn musical ideas into actual sound. A prompt like distorted pirate radio broadcast intercepted mid-transmission becomes not just interpretable, but actionable, mapped, expanded, synthesized, and rendered into .wav files with no manual coding.

The core system is composed of three dedicated agents. The Designer expands the prompt into a structured plan. The Analyzer deepens that plan into synthesis recipes. The Compiler translates those recipes into audio using SuperCollider. Each agent is specialized, modular, and coordinated through a central Orchestrator that handles sequencing, parallelization, and error recovery.

They don’t just relay information, they adapt. They coordinate, troubleshoot, and validate their own work. Phonosyne isn’t a plugin you tweak. It’s not a tool you operate. It’s a purpose-built team of collaborators, designed not to be flexible for everyone, but to be fluent in the specific, evolving vocabulary of my sound design practice.

But if the system does the work, who’s actually creating the sound?

Building With Behavior, Not Code

The Role of the Orchestrator

The challenge in building Phonosyne wasn’t generating the sounds, it was designing how the agents behave when things get vague, fail, or go off-script. That’s not traditional coding. That’s orchestration.

In a multi-agent system, you’re not specifying what to do step by step. You define the intent. You assign roles. You set the conditions for collaboration, agents that negotiate, retry, escalate, and adapt. The code isn’t static, it’s dialectic, shaped by an ongoing conversation between intent and execution.

“Developers are increasingly defining what they want the system to achieve, leaving the how to the emergent, collaborative intelligence of the orchestrated agent teams.”

Plaat et al., Agentic LLMs(arXiv, 2025)

System Design Through Prompts

This is the Orchestrator role: shaping interactions between autonomous workers, not lines of logic. You’re not implementing features, you’re building relationships under pressure. This is where prompt engineering becomes a form of systems design: not just writing the right string of text, but shaping how your agents perceive, plan, and respond.

Prompts become architecture. A role prompt sets the scope of agency. A system prompt defines the collective mission. Retry logic isn’t just a failsafe, it’s a contract negotiation. Debugging becomes less about syntax and more about resolving misunderstandings between cooperating minds.

How Phonosyne Works

Orchestrator: accepts prompt and controls other agents
  → Designer: generates soundscape of 18 samples
→ Orchestrator: attempts to generate a sample for each description
  → Analyzer: turns each description into synthesis instructions
  → Compiler: generates python code from instructions

1. User Prompts the Orchestrator

It starts off with a detailed prompt describing the vibe of the soundscape.

**Back Alley Boom Bap:** The sound of cracked concrete and rusted MPCs. Thick, neck-snapping kicks loop under vinyl hiss and analog rot. Broken soul samples flicker in the haze, distorted claps punch through layers of magnetic tape dust, and modular glitches warp the swing like a busted head nod in an abandoned lot. Pure rhythm decay.

2. Designer Creates a Soundscape Plan

The Orchestrator sends this to the Designer, which expands it into a structured plan.

**DESIGNER**

Sample L1.1:
Crunchy 808 kick drum with saturated analog drive pulses on every quarter note, layered with faint vinyl crackle and a low-passed urban hum (cutoff 900 Hz). Gentle chorus widens the stereo image, while a slow LFO at 0.4 Hz modulates tape flutter for lo-fi authenticity.

Sample L1.2:
[...]

3. Analyzer Generates Synthesis Instructions

The Analyzer gets each sample description and duration then turns it into extremely detailed synthesis instructions.

**ANALYZER**

This effect combines classic drum synthesis, layered noise, and analog-style processing to create a modern yet lo-fi urban beat texture. Layer 1: The core is a synthesized 808 kick drum, generated by a sine oscillator (SinOsc) at 41 Hz with a pitch envelope that sweeps from 60 Hz down to 41 Hz over 70 ms, layered with a short-decay triangle wave click (TriOsc, 1800 Hz, decay 18 ms) for transient definition. Drive pulses are created by routing the kick through a waveshaping distortion (Shaper) with a tanh transfer curve, input gain automated to add saturation on every quarter note—this is achieved by modulating drive depth with an LFPulse at 1 Hz synced to the tempo (quarter notes), producing aggressive, crunchy peaks while preserving low-end punch. Layer 2: A subtle vinyl crackle bed is constructed using filtered WhiteNoise gated by random short envelopes (EnvGen with randomized attack/release between 12–32 ms), panned randomly per event for stereo width and set at -30 dB relative to the kick. Layer 3: An urban hum is synthesized using two LFPar oscillators at 55 Hz and 67 Hz with slow amplitude modulation (LFNoise1.kr(0.15)), then passed through an LPF at a fixed cutoff of 900 Hz (resonance Q=0.7), providing a dark, atmospheric undertone. The hum's amplitude envelope fades in over the first 2 seconds and out over the last 2 seconds for gentle entry/exit. Global FX: The summed signal is processed through a stereo chorus (Chorus effect: delay modulated between 13–24 ms, rate set to 0.23 Hz LFO, depth at 45%) for spatial widening. Tape flutter is simulated by inserting a DelayL line on the master bus, with delay time modulated by an LFNoise2.kr(0.4) LFO between ±5 ms; this imparts slow wow/flutter artifacts characteristic of vintage tape. Final chain includes a Compander set to -16 dB threshold, ratio 2.5:1 for glue, followed by Normalizer to -1 dBFS peak. All layers are carefully mixed: Kick/drive at full level; vinyl at -30 dB; hum at -18 dB; master output stereoized by chorus but mono-compatible. The result is a punchy, saturated drum groove with authentic lo-fi movement and deep urban ambience.

4. Compiler Generates SuperCollider Code

The Compiler takes the Analyzer’s instructions and generates a SuperCollider script to synthesize the sound. It runs the script, checks and fixes errors, and returns the path to a validated .wav file.

Co-Creation Through Debugging

Phonosyne didn’t emerge from a static design doc. It came from iteration, long loops of failure and refinement. I’d sketch a behavior: “Retry rendering if the waveform is silent.” The Compiler would misread it. I’d revise the prompt. Adjust the retry logic. Watch again. Eventually, the system learned my edge cases, and I learned how to speak clearly to a team that wasn’t human, but was learning.

What kind of authorship emerges when you debug through conversation instead of code?

Creative Systems, Not Creative Outputs

Why One Model Isn’t Enough

Even solo creative work is a form of internal collaboration, between drafts, tools, layers, and decisions. Composing music, designing sounds, or shaping a story isn’t a straight line. It loops back, forks off, and accumulates nuance through revision. These workflows are inherently complex, not because they’re inefficient, but because they’re alive.

That complexity is where traditional generative AI starts to fall apart. Most models are built to respond to prompts one at a time, one model, one input, one output. But real creative work isn’t a transaction. It’s a system.

Multi-Agent Collaboration in Practice

Multi-agent architectures offer something closer to that system logic. Instead of collapsing everything into one model, they distribute roles: planners who sketch intent, analyzers who add detail, renderers who produce results, critics who refine and respond.

Projects like LVAS-Agent use this structure to break down long-form video dubbing into storyboard segmentation, script synthesis, sound layering, and final mix. Audio-Agent pairs LLMs with diffusion engines to turn descriptions into editable audio, atomized by task. SonicRAG, the closest to Phonosyne, lets agents retrieve and blend samples from a library, mixing language and signal as modular inputs. These systems don’t just generate, they collaborate.

Specialization over Generalization

Phonosyne shares their DNA but leans in a different direction. It isn’t trying to cover every use case. It doesn’t generalize. It specializes. It’s built not for scalability but for intimacy, for the kind of creative work where the system learns your aesthetic logic, adapts to your pacing, and renders sound in a way that feels like your hands.

Are we still designing software, or assembling teams?

Designing for One: Phonosyne as Bespoke Tool

What Makes It Personal

Most software is built to scale. Phonosyne was built to fit.

It wasn’t made for “audio professionals.” It doesn’t support every DAW, format, or genre. It was designed around one person, me. The workflows mirror how I think. The defaults reflect what I prefer. The language model is tuned to how I describe sound. Not adaptable, accurate. Not general-purpose, deeply personal.

Language-Native, Not GUI-First

There’s no GUI. No knobs, no sliders, no parameter trees. You don’t sculpt the sound by hand, you describe it. Sonically, metaphorically, spatially. And the agents interpret that into action. This isn’t GUI-first design. It’s language-native orchestration.

Judgment Over Flexibility

What makes it bespoke more than the interface is the alignment. Each agent is trained or prompted with my aesthetic values. The Designer knows how I outline sonic ideas. The Analyzer knows which timbres I chase and which I avoid. The Compiler knows when to let a shimmer through, and when to try again. They don’t just follow instructions, they share my judgment.

If a system reflects one person’s taste perfectly, is that still “software”, or something else entirely?

Authorship in the Age of Agents

Who made the sound?

That’s the question people keep asking. But it misses the point. The sounds Phonosyne creates aren’t composed in any traditional sense. They’re not played or programmed. They’re orchestrated through intent, system behavior, and a back-and-forth between me and a machine ensemble trained to speak my sonic language.

“The intentionality gap between human creators and AI-generated content forces a critical reevaluation of authorship itself.”
— Harvard Law Review, Artificial Intelligence and the Creative Double Bind

Exactly. That gap is where authorship lives now.

This isn’t like AI image generation, where debates revolve around consent, appropriation, or stolen style. Sound, especially in experimental and electronic music, has always been a collage: samples, algorithms, found noise. Reuse is the baseline. What’s unusual isn’t that I use machine-generated samples. It’s that I use them intentionally, within a system I built to reflect my aesthetic.

Phonosyne’s outputs aren’t precious or sacred. They’re raw material, structured enough to feel like memory and flexible enough to break apart. What matters isn’t who technically generated them. It’s what I do with them.

And what I do is play.

Phonosyne doesn’t generate “songs.” It’s not trying to impress anyone with end-to-end genre emulation. It feeds my live rig: loopers, samplers, granular tools. That’s where meaning takes shape. In the way a warped radio fragment catches on tape heads. In the moment a failed synth glitch becomes the emotional center of a set. That’s not prompt engineering. That’s instrumental authorship.

Most generative music tools aren’t made for that. They’re built for clean outputs—one prompt, one product. But Phonosyne comes from a different lineage: algorithmic composition, procedural sound, interactive systems. It’s a spiritual cousin to Xenakis, Oval, Autechre, algorave. It’s about process, not product. Performance, not artifact.

So no, I didn’t write every waveform.

But I built the ensemble. I trained its behavior. Tuned it to my taste. Fed it my metaphors. Pushed it to fail in interesting ways. And from that, I shaped something playable, personal—mine.

That’s not just authorship. That’s agency.
That’s orientation.
That’s the whole fucking point.

Beyond Tools: A Personal Philosophy

Orchestration as Identity

Phonosyne isn’t just a sound design system, it’s a way of thinking about what software can become.

It says software doesn’t have to be universal to be powerful. It can be particular. Personal. It can fit the shape of one person’s thinking, not just the contours of a market. When language becomes the interface, taste becomes the architecture. Your sensibilities aren’t settings, they’re the system.

I didn’t build this to solve sound design. I built it to see what happens when the way I think creatively becomes executable. Not metaphorically, but literally. Phonosyne doesn’t care about the average user. It isn’t built to scale or generalize. It’s built to resonate, with one mind, one voice, one set of strange, recursive instincts.

There are no knobs or sliders here. No menus. No DAW integration roadmap. Just a group of agents who speak in my dialect of desire. One expands my prompt into a structure. Another deepens it into synthesis instructions. Another renders it into sound. Not just AI as collaborator, AI as ensemble. A system that knows how I mishear, how I revise, how I get it wrong on purpose.

So listen. Listen to what they reveal about the system that made them. Imagine a future where your tools don’t just take instructions. They learn your voice.

Not just UI. Not just UX. You.