Upload a clean WAV clip (5–15 seconds, no background noise) and the exact transcript of what's spoken in it. F5-TTS clones the voice from this reference at every inference call.