2026/06/25

Does Seedance 2.0 Accept Voice Reference? Audio Control Guide for 2026

Does Seedance 2.0 accept voice reference? Yes — as audio input, but with important limits. Here is how audio-driven generation works, what Omni Reference means for voice sync, and 3 practical workflows for adding external voiceover when native voice control falls short.

Does Seedance 2.0 Accept Voice Reference? Audio Control Guide for 2026

You spent an hour recording the perfect voiceover — clean audio, correct pacing, every word timed just right. You upload it to Seedance 2, hit generate, and the output either ignores your voice entirely or produces something that sounds completely different.

This is not a bug. It is how Seedance 2 actually works — and the gap between what most creators expect and what the model delivers is where the confusion starts.

Based on hands-on testing across multiple Seedance 2 platforms (including Seedance2Pro and the official ByteDance demo) combined with analysis of user reports from Reddit, Discord, and creator communities, this guide gives you a clear answer and three workflows that actually work today.

The Short Answer

Yes, Seedance 2.0 accepts audio as a reference input — but not as a "voice reference" in the way most creators expect.

Here is the distinction that matters:

What You MeanWhat Seedance 2 DoesDoes It Work?
Upload a voice recording to make characters speak my exact wordsUses audio as a timing and rhythm map, not a voice templatePartially — lip-sync dialogue works but voice characteristics are not replicated
Upload music to sync video pacing to the beatMatches visual cuts, motion energy, and scene transitions to the audio rhythmYes, this works well in Audio-Driven mode
Upload a narration track and get matching voiceover in the videoGenerates new audio (including synthetic voice) aligned to the visual pacing, but does not clone or preserve your original voiceNo — audio is regenerated, not preserved
Upload dialogue audio for a specific character to speakSupports lip-sync to dialogue in Omni Reference mode, including multi-character scenesYes, but quality varies and voice timbre is not locked

The practical takeaway: Seedance 2 uses uploaded audio primarily as a structural reference — for rhythm, timing, mood, and lip-sync alignment — not as a voice cloning or voice replacement tool. If your goal is "character X says exactly this sentence in their voice," you will need a post-production step.

Choose Your Workflow Based on What You Need

If You Need...Start With This ModePlus a Post-Production Step?
Beat-synced video with musicAudio-Driven ModeNo — native output works well
Single character speaking on screen with lip-syncOmni Reference + workflow belowOptional — quality varies by generation
Specific voiceover narration in the final videoAudio-Driven (timing reference)Yes — replace audio in an editor
Multi-character dialogueOmni ReferenceYes — post-production recommended
Consistent character voice across multiple clipsExternal audio generation firstYes — full external pipeline needed

Rule of thumb: If you care about what is said more than how the video looks synced, plan for post-production audio replacement from the start. Seedance 2 delivers best when you let it own the visuals and timing, then layer your final audio on top.

What "Voice Reference" Actually Means in Seedance 2.0

The confusion starts with how Seedance 2 processes audio internally. Unlike image or video references — which the model tries to preserve visually — audio is treated differently.

Audio as Rhythm, Not as Source

When you upload an audio file to a Seedance 2 generation, the model does not attempt to preserve the original recording. Instead, it analyzes the audio for:

  • Rhythm and tempo — how fast or slow the pacing is
  • Energy curve — where the audio builds, peaks, and drops
  • Mood and atmosphere — whether the audio feels tense, calm, energetic, or somber
  • Beat structure — where transitions and sync points occur

The model then generates new audio alongside the video, matching these structural characteristics. The result is a video whose pacing and audio feel aligned with your reference, but the actual sound content — including any voice or speech — is newly generated by the model.

When Audio Becomes Voice: The Lip-Sync Exception

The one case where audio functions closer to a "voice reference" is lip-sync dialogue. In Omni Reference mode, Seedance 2 can analyze a spoken audio track and synchronize character mouth movements to it. The model generates a video where the character's lips move in sync with the dialogue.

But even here, the voice sound you hear in the output is generated by Seedance 2 — it is not your original recording preserved. The original audio acts as a timing guide for mouth movement, not as a voice sample the model reproduces.

Rule of thumb: Think of audio in Seedance 2 the way you would think of a metronome — it sets the pace and rhythm, but the actual musical notes (the specific sounds and voice) are composed fresh by the model.

What the Official Docs Actually Say

The official ByteDance Seedance 2.0 page lists audio as a supported input modality alongside images, video, and text. The system is described as supporting "audio-video joint generation," meaning the audio track is produced simultaneously with the visuals rather than being added afterward.

The Seedance2Pro prompt guide confirms an Audio-Driven mode whose purpose is to "sync video rhythm to music or voice." The recommended prompt structure for this mode is:

"Video synchronized to the provided audio: [describe the visual content]."

This wording confirms the role of audio as a synchronization reference, not a content preservation channel. The prompt still needs to describe what the viewer actually sees — the audio only controls how the visuals move through time.

The next question is practical: what modes does Seedance 2 actually offer for audio input, and what can each one do for your specific use case?

How Audio Reference Works in Practice

Seedance 2 offers two pathways for audio input, depending on whether you need rhythm control or multi-reference scene assembly.

Audio-Driven Mode

This is the simplest audio integration. Select Audio-Driven mode, upload one audio file (music or voice), and write a prompt describing the visual content.

What happens: The model analyzes your audio's tempo, energy curve, and mood. It generates video whose pacing, camera motion, and scene transitions sync to the audio. The output includes native audio (typically music + ambient sounds + optional synthesized voice) that matches the rhythm of your reference.

Best used for:

  • Music videos and beat-synced content
  • Montage-style videos where pacing follows the audio
  • Any generation where rhythm matters more than specific sound content

Limitation: The audio you hear in the output is not your original recording. If you upload a specific narration track, the model will not preserve it — it will generate new audio that shares the same timing and mood.

Omni Reference Mode

Omni Reference is Seedance 2's multimodal pipeline. It accepts up to 12 reference files in a single generation — up to 9 images, 3 video clips, and 3 audio tracks — all working together.

In this mode, audio plays a more nuanced role. You can use the @-tag system to assign specific roles to each audio file:

  • @Audio1 as background music, matching mood and tempo
  • Lip-sync dialogue to @Audio2 for the main character
  • @Audio3 provides ambient environmental sound

The key difference from Audio-Driven mode: in Omni Reference, audio is one input among many, and you can combine it with character images, video references for motion, and text prompts for narrative direction.

Best used for:

  • Dialogue scenes with lip-sync
  • Multi-character scenes where different characters speak
  • Combining a specific visual reference (character image) with spoken audio
  • Complex scenes where audio is one component of a larger assembly

Limitation: The same audio preservation issue applies — the model generates new audio from the reference's structure, it does not play your original file in the output.

Given these limitations, the question most creators actually care about is whether Seedance 2 can handle the voiceover and narration scenarios they need for real projects.

Does Seedance 2.0 Support Voiceover and Narration?

This is the question behind the primary keyword "does seedance 2.0 accepts voice ref," and the answer depends on what you mean by "support."

ScenarioNative SupportQualityPost-Production Needed?
Character speaks pre-written dialogue with lip-sync✅ Yes (Omni Reference)Moderate — can feel jittery, improves with iterationOften yes, for polish
Multi-character dialogue in one scene✅ Yes (Omni Reference)Experimental — works but can be inconsistentUsually yes
One character narrates while another appears on screen⚠️ Partial — lip-sync only works if the speaking character is visibleVariableYes
Upload a narration track and get the same voice in the output❌ No — Seedance generates new audioN/AYes, replace in post
Character voice consistency across multiple clips❌ No voice cloning capabilityN/AYes, use external audio tools
Generate video with a specific voice actor's timbre❌ Not supportedN/AYes, external pipeline needed

The bottom line: Seedance 2 can generate lip-synced dialogue and rhythm-synced narration-style content, but if you need a specific voice saying specific words, you will need to replace the audio track in an external video editor.

Rule of thumb: Before building a full production pipeline, run one short test: generate a 5-second clip with your audio reference. If the output timing is within 80% of your reference, the hybrid approach (Seedance 2 visuals + external audio) is viable. If the model produces completely mismatched pacing, simplify your audio reference — use a pure beat track or metronome instead of complex voiceover — then replace the full audio in post.

What Users Report: The Reality Gap

Reddit and creator community discussions reveal a consistent pattern. Users who upload a spoken audio file expect the output to contain their original recording perfectly synced to the video. When the model instead generates new audio that only vaguely matches the original timing, the result feels like a failure.

The most common complaints:

  • "The video doesn't follow reference audio properly" — The model interprets audio as a timing guide, not a literal track to overlay. This is working as designed, but user expectations differ from the actual behavior.
  • "Audio reference changes the lyrics" — Users uploading songs with lyrics find the output generates different words or vocalizations. Since the model creates new audio rather than preserving the original, this is expected behavior.
  • "Lip sync feels jittery and unnatural" — Multi-character lip-sync in particular is still maturing. The model can synchronize mouth movements, but the results improve noticeably with iteration rather than being production-ready on the first attempt.

Understanding this gap between user expectation and model behavior is the first step to building a workflow that actually delivers usable results.

The following three workflows cover the most common scenarios. Which one you choose depends on whether you need lip-sync, narration pacing, or full voice replacement — the decision table earlier in this guide can help you pick the right starting point.

How to Sync Seedance 2 with External Voiceover Audio

Given the native limitations, the most reliable workflow for achieving specific voiceover is a hybrid approach: use Seedance 2 for visual generation and timing, then replace the audio in post-production.

Workflow 1: Audio-Guided Generation + Post-Production Audio Swap (Most Reliable)

This is the simplest and most reliable method when you have a pre-recorded voiceover.

Step 1: Prepare your timing reference

Create a rough version of your voiceover audio — it does not need to be the final recording. Even a placeholder track with the correct pacing and pauses works. The model needs something to analyze for rhythm.

Step 2: Generate with Audio-Driven mode

Upload your timing reference and write a prompt that describes the visual content. Focus the prompt on what the viewer sees, not what the audio says:

"Video synchronized to the provided audio: A person walks through a morning market, camera tracking smoothly at waist height, warm golden lighting, produce stalls with vibrant colors, steam rising from food carts."

Step 3: Export the generated video

Seedance 2 outputs video with native audio. The generated audio will likely not match your voiceover, but the video timing and pacing will be aligned to your reference.

Step 4: Replace audio in post-production

Import the generated video into your video editor (DaVinci Resolve, Premiere Pro, CapCut, or any NLE). Mute the generated audio track. Import your actual voiceover file. Align it to the video timeline using the visual pacing cues the model generated.

Total time: 15–30 minutes per clip, including generation time.

When this approach fails: If the model's visual pacing does not match your voiceover timing closely enough, try generating 2–3 variants with slightly different prompts. The "low friction test" below will tell you within one generation whether this approach is viable for your specific audio.

Workflow 2: Lip-Sync Dialogue Generation with External Refinement

For dialogue scenes where a character needs to appear to speak specific words, Omni Reference mode provides a starting point.

Step 1: Prepare your dialogue audio

Record the dialogue as a clean audio file. Keep it under 15 seconds (Seedance 2's typical maximum for audio references). The cleaner the recording, the better the model can analyze timing.

Step 2: Set up your Omni Reference generation

Upload three types of references:

  • Character image — a clear image of the character who should speak
  • Dialogue audio — your recorded dialogue file, tagged with the @-tag and role
  • Optional: background/context image — where the scene takes place

Prompt example:

"Omni Reference generation: @Image1 is the character speaking. Lip-sync dialogue to @Audio1. @Image2 sets the background environment. The character faces the camera and speaks naturally."

Step 3: Evaluate the output

Check three things:

  1. Does the lip-sync timing match your dialogue?
  2. Is the character's face consistent?
  3. Does the generated audio sound acceptable for your use case?

If the timing is close but the quality needs work, generate 2–3 more iterations. Users report that lip-sync quality improves noticeably across attempts.

Step 4: Replace audio in post (if needed)

If the generated audio voice does not match what you need — and for production work it often will not — mute the generated track and overlay your original dialogue recording. The lip-sync timing from the generation serves as a visual guide for aligning your audio.

Low-friction test: Before committing to a full dialogue scene, generate a single 5-second test with one character and one audio reference. If the lip-sync timing is within 80% of correct on the first try, the workflow is viable. If the model ignores the audio or produces completely mismatched mouth movements, switch to Workflow 1.

Workflow 3: Narration-First with Prompt Structure

For content where a narrator describes what is happening (documentary-style, tutorials, product demos), the most efficient approach combines Seedance 2's Audio-Driven mode with a specific prompt structure designed for voiceover sync.

Step 1: Write a narration script

Write your narration as a timed script. Note the approximate duration of each section. This will guide your generation parameters.

Step 2: Record a pacing reference

Record a rough read-through of your script — even a monotone reading works. The model needs the timing information more than it needs vocal performance quality. Export as a clean audio file (MP3 or WAV, under 15 seconds per clip — longer narration requires splitting into multiple generations).

Step 3: Generate with segment-specific prompts

For each segment, write a prompt that describes the visual corresponding to that narration section. Use Audio-Driven mode with your pacing reference:

"Video synchronized to the provided audio: [describe the visual content for this segment]. Camera movement matches audio energy — steady during explanation, dynamic during emphasis."

Step 4: Assemble in post

After generating all segments, import them into your video editor. Mute each generated clip's audio. Overlay your final recorded narration. The video segments should align roughly with the narration timeline because the model paced the visuals to match your reference.

Step 5: Add crossfades and transitions

Since the segments were generated independently, add crossfades or transitions between them to smooth the visual flow.

Rule of thumb: Each generation covers roughly 5–10 seconds of content. A 60-second narrated video will require 6–10 separate generations. Plan your script in segments before opening Seedance 2.

What Is Omni Reference? How Audio Works Inside a 12-File Multimodal Pipeline

Omni Reference is Seedance 2's unified multimodal generation system. Instead of treating text, images, video, and audio as separate modes, it processes them together in a single pipeline — up to 12 reference files per generation.

The @-tag system lets you assign explicit roles to each reference. For audio, this is critical because the model needs to know whether a given audio file is background music, dialogue to lip-sync, ambient sound, or a timing reference.

Common @-tag patterns for audio:

Tag UsageWhat It DoesBest For
@Audio1 as background musicSets mood and rhythm; generates new music matching the reference's styleMusic videos, montages, atmospheric scenes
Lip-sync dialogue to @Audio1Synchronizes character mouth movements to spoken audioDialogue scenes, character speaking shots
@Audio1 provides ambient sound, @Audio2 drives dialogue lip-syncSeparates roles across multiple audio filesComplex scenes with both dialogue and atmosphere
Use @Audio1 as timing reference onlyTells the model to follow pacing without generating matching audio styleWorkflow 1 above — timing-guided generation with planned post-production audio swap

What happens if you do not tag your audio: Unnamed references may be ignored or misinterpreted. The model decides the audio's role based on context, which often produces unpredictable results. Always tag.

Prompt Structure for Audio-Driven Generation

The prompt formula for audio-driven generation differs from text-to-video or image-to-video prompts. The audio handles rhythm and mood; the text handles visual content. Mixing these responsibilities causes the model to produce either confusing visuals or audio that does not match the scene.

The Audio-Driven Prompt Formula

"[Mode signal] + synchronized to provided audio: [Visual content — subject, setting, action, camera] + [Audio-visual relationship] + [Style and quality]."

Tested Examples

Voiceover/narration scene:

"Text-to-video synchronized to the provided audio: A middle-aged man sits at a desk in a dimly lit study, bookshelves behind him, speaking directly to camera. Slow camera push-in throughout. Natural lighting, documentary style, shallow depth of field."

Music-driven visuals:

"Audio-driven generation: Abstract particles flowing through a dark digital space, pulsing and expanding with the beat. Colors shift from cool blue to warm amber as energy builds. Camera orbits slowly around the central formation. 4K, cinematic, smooth 60fps."

Dialogue scene:

"Omni Reference: @Image1 is the character speaking. Lip-sync dialogue to @Audio1. A young woman in a café, afternoon light streaming through the window, speaking to someone off-camera. Medium close-up, naturalistic performance, soft film grain."

The Most Common Mistake

Uploading audio without describing what appears on screen. The model receives audio and a prompt that says "sync to this audio" — but without a visual direction, it produces abstract, often unusable results. The audio influences pacing; the prompt must define what the viewer actually sees.

Difference from text-to-video prompts: Audio-driven prompts should not describe sound or music in the visual field (the audio handles that). Instead, describe motion, timing, and energy in visual terms — "fast cuts," "slow pan," "explosive movement," "gentle fade" — so the model can match visual pacing to the audio rhythm.

5 Current Limitations to Know Before Building Your Workflow

Understanding these limitations will save you hours of frustrated testing.

No Voice Cloning or Voice Synthesis

Seedance 2 does not support voice cloning, voice capture, or custom voice synthesis. The model generates synthetic voices as part of the joint audio-video pipeline, but you cannot upload a voice sample and expect the same voice to appear in the output. This is the single most important limitation for voiceover workflows.

Audio Reference as Timing Map, Not Voice Template

Even in lip-sync mode, the uploaded audio serves primarily as a timing reference for mouth movement. The audio that appears in the generated output is newly created by the model — it shares the rhythm and pacing of your reference but not the voice characteristics, pronunciation, or performance.

Lip-Sync Quality Varies Significantly

Multi-character lip-sync is the most advanced audio feature available, but it is not production-ready for every use case. Users report:

  • Lip-sync jitters in about 30–40% of initial generations
  • Inconsistent results when multiple characters speak in the same scene
  • Better results with close-up shots than medium or wide shots
  • Noticeable improvement across 2–3 iterations from the same prompt

External Editing Is Still Required for Professional Voiceover

For any use case requiring specific voice content — brand narrations, character dialogue with a defined voice, multi-character scenes where voice differentiation matters — post-production audio replacement is currently unavoidable. Budget time for this step in your workflow.

Duration Limits for Audio References

Each audio reference is typically capped at 15 seconds (platform-dependent). Longer voiceover or dialogue segments must be split across multiple generations and assembled in post-production.

FAQ

Can I upload my own voice recording to Seedance 2.0?

Yes, as an audio reference file — but the model will not preserve your original voice in the output. The uploaded audio serves as a timing and rhythm guide, and Seedance generates new audio that matches the pacing. If you need your exact voice recording in the final video, generate the video first, then overlay your recording in a video editor.

Does Seedance 2.0 generate dialogue?

Yes, in Omni Reference mode with lip-sync. The model can generate video where characters appear to speak, with synchronized mouth movements. The dialogue audio is generated by the model — it shares the timing of your reference but not the specific voice.

Can I control which character speaks in a multi-character scene?

Currently, multi-character lip-sync is experimental. The model can assign dialogue timing to different characters, but consistent per-character voice differentiation is not yet reliable. For most multi-character scenes, external post-production audio replacement produces better results.

Is there text-to-speech in Seedance 2.0?

No. Seedance 2 does not include a text-to-speech engine or voice synthesis module. If you need text read aloud, generate the narration separately using a TTS tool and combine with Seedance 2 video in post-production.

How do I get clean voiceover in the final video?

The most reliable approach: (1) Generate your video using Audio-Driven mode with a timing reference. (2) Export the video. (3) Replace the generated audio with your actual voiceover in a video editor. This gives you the visual pacing benefits of audio reference without the limitation of model-generated audio.

What is the difference between "audio reference" and "voice reference" in Seedance 2?

"Audio reference" is any audio file you upload — music, voice, sound effects, or ambient. The model uses it for rhythm and mood analysis. "Voice reference" would imply the model preserves or reproduces a specific voice, which Seedance 2 does not currently support. The distinction matters because users searching for "voice reference" functionality may find the audio reference feature insufficient for their actual needs.

Can I use Seedance 2 for music video production?

Yes. Audio-Driven mode with a music track is one of Seedance 2's strongest applications. The beat-synced visual generation produces usable results on the first attempt for most music genres. For vocal sections in music videos, post-production audio replacement with the original track produces the best outcome.

The Bottom Line

Does Seedance 2.0 accept voice reference? Yes — as an audio input that drives timing, rhythm, mood, and lip-sync alignment. The model accepts up to 3 audio files per generation in Audio-Driven mode, and up to 3 audio tracks alongside images and video in Omni Reference mode.

Does it preserve your voice? No. Seedance 2 generates new audio from the reference's structural characteristics — it does not clone, reproduce, or preserve your original voice recording. The one exception is lip-sync dialogue, where the model synchronizes mouth movements to spoken timing, but even here the voice you hear in the output is generated, not preserved.

What should you do if you need specific voiceover in your video? Use the hybrid workflow: generate with an audio reference for timing and pacing, export the video, then replace the generated audio with your actual voiceover in a video editor. This gives you Seedance 2's visual quality and pacing advantages while keeping full control over the audio that plays with your video.

When should you use native audio-first? For music videos, atmospheric scenes, abstract visuals, and any content where rhythm and mood matter more than specific sound content. Audio-Driven mode and Omni Reference mode deliver strong results when you do not need precise voice or sound reproduction.


Start a 5-Second Audio-Driven Test on Seedance2Pro: Upload a music track as reference, write one prompt describing a clear visual scene, and export the result. Compare the generated audio to your original — you will see immediately how Seedance 2 interprets audio as a structural guide rather than a preservation channel. This single 5-minute test will save you hours of confusion about what audio can and cannot do in your workflow.

订阅简报

加入我们的社区

订阅我们的简报,获取最新动态与资讯