My TI creation process

A preliminary step-by-step guide of my TI creation process. I will probably revisit this post at some point in the future to add detail, improve readability and so on.

1- Pick some good quality pictures of the subject. Around 30, where their face is clearly visible, as high quality as possible. Avoid pictures with too many things in the background or that can be hard to describe precisely. Also avoid oversaturated pictures or images with "artifacts" (for instance, spots or marks that may be misinterpreted by the AI). As my good friend and fellow TI creator hmonk pointed out, it's also recommended that the subject appears alone in the picture. If people appear blurred in the background, it should be no issue, but avoid other clearly distinguishable faces and even mirror reflections of the subject themself.

2- Crop the pics so that they have a 512x512 format. I use Photoshop for this, but many people prefer faster options, like Birme.

3- Make a final selection of the already cropped pictures. I always go for 15 because it's a multiple of 3. This is related to my usual batch size (3) and gradient (also 15)*, and the idea is that I run 3 full epochs every step. Why? Because by now I know that the sweet zone for TIs trained with these settings lies within the range of 360 to 420 epochs (i.e., between steps 120 and 140, training with said settings).

4- Create the embedding. Choose a name and the number of vectors per token. I now go normally for 8 vectors. Leave the initialization text field blank so that you start with zeroed out vectors (means the training will begin from absolute scratch, without being influenced by any other information the SD database may contain, and therefore only your dataset will be used).

5- Preprocess the images. I always select Use BLIP for caption. This will create text files that describe every picture in your dataset. The captions are machine-generated, so they won't always be as precise as one would desire. In fact, they often contain laughable mistakes, so you'll probably want to edit them. You'll also want to edit your customsubjectfilewords.txt file located in your stable-diffusion-webui/textualinversiontemplates folder (if said file doesn't exist, create it) and paste the following text in it:

a photo of [name], [filewords]

The text this file contains will be used during training as the prompt for generating training images. [name] will be replaced by the name you chose for your embedding during step 4. [filewords] will be replaced by the text in the captions of your images.

6- Once the images are preprocessed, carefully edit their respective text files. I try to describe everything that DOES NOT belong to the subject (what they're wearing, where they are, what the background is, any items around the subject, etc.) I use commas to separate by segments. For instance: "a woman in a white dress, with earrings and a necklace, wearing a black hat, posing for a picture in her bedroom, with a white wall in the background". The commas are important if you choose the Shuffle tags by ',' when creating prompts option for training, as I always do.

7- Once you have the text files edited, the training begins. Remember to always select the base SD model for training (usually v1-5-pruned-emaonly.ckpt). I've uploaded a screenshot of my usual settings here . What I usually do now is train until step 120 with intervals of 10 steps (both for generating pics and for saving said steps). Once that part is done, I set the intervals down to 1 step and train until step 145 or thereabouts, allowing me to test all of the different steps of my so called sweet zone (steps 130 to 140 approximately). I always end up finding at least a decent version among those steps.

*PS: Always make sure that your batch size multiplied by your gradient size is either equal to the number of images in your dataset OR a multiple of it. For instance, if your dataset consists of 15 images, you can go for batch size 1 and gradient 15 (1x15=15), batch size 3 and gradient 5 (3x5=15), or any other batch size that your GPU can handle as long as the gradient remains at 15 (since the resulting number will always be a multiple of 15). Training like this is not mandatory, of course, and perfectly good results can be achieved via different methods, but it is the easiest way to calculate until what step you should train and around what step you can expect the best possible results.

Buy Jernau Gurgeh a coffee

More from Jernau Gurgeh