A question: if this is a two-voice build, why are there not two of the entire signal chain needed to make up a voice? There's two complex VCOs -- check. Two VCAs(-ish; not how I'd do this) -- check. One VCF...ah, that might be a problem. Basically, this isn't how two-voice polyphony works. You've instead arrived at something referred to (not very well, I think) called 'paraphony', where two independent sources get funneled into the same modifier chain. By default, you lose the separation you're referring to when that happens. Plus, once you mash it all into the single VCF, there's no point in having two of the Noise Engineering EG/VCAs anymore. You're just dynamically modifying the same sound in two different ways.
For reference, go have a close look at an Oberheim Two-Voice. These have been around since the early 1970s, still made today with some modern upgrades, and for very good reasons. You'll notice that, since it uses the SEM-based Oberheim architecture, you actually have two discrete signal paths with their controllers, modulation sources, etc. That's what you're trying to do here.