Hires fix #28
Replies: 2 comments
-
Heck... I may even be able to hack this in the current version, with some effort. Aka, taking a spectrogram of a preexisting song, but then reeinterpreting it through a different artist / style, breaking it up into 512-byte chunks with some overlap and merging the seams. Of course, that'd be a hacky version... |
Beta Was this translation helpful? Give feedback.
-
Hey @enn-nafnlaus, I agree! There's a ton to experiment with here. Especially keeping the height the same but sweeping the width to cover seams and trying methods to get high-level structure. I will add a streamlit app soon into the repo that may help with this kind of experimentation, but I'd love to see your results. |
Beta Was this translation helpful? Give feedback.
-
It occurred to me today that what AUTOMATIC1111 calls "Hires fix" would apply perfectly to music.
Basically, if you want to generate, say, a 1600x1600 image, but you're using a 512x512 model, it's going to come out looking really wonky. So what they do is first generate basically a thumbnail of the image,and then they run it through img2img to upscale it and fill in the details.
I'd think that approach would lend itself really well to songs - though perhaps ideally with two separate models rather than just one (one trained on "broad composition" of music, and the other trained on fine details) . With this, you should be able to do complete tracks, with no seed - with the "thumbnail" (broad composition) being generated by txt2img, and the fine details handled by img2img.
There would be no need to generate the "thumbnails" in realtime (though faster is better), so a higher horizontal resolution would be ideal. Still, sticking with 512x512 it could represent a 2 minute track at 240bpm or a 4 minute track at 120bpm (one datapoint per beat... so in the former case, if the song were actually 60bpm, you'd get 4 datapoints per beat, and in the latter case, 2 datapoints per beat). I bet img2img would fill in the gaps extremely well.
img2img would be best run at 512 height with as high of a width as the person's GPU can handle, in overlapping blocks. I don't know how well blending the seams would work, but I imagine it would probably work decently.
I'm really curious what it would come up with!
Beta Was this translation helpful? Give feedback.
All reactions