en fr

Factory-Boy : Optimize bulk creation

Posted on Thu 12 September 2024 in development

The problem

When creating large dataset using factory_boy, you may find yourself using MyFactory.create_batch() which is great for specifying a size, but falls short in terms of performance when using factories based on Django models.

Indeed, here's the related source code:

    @classmethod
    def create_batch(cls, size, **kwargs):
        """Create a batch of instances of the given class, with overridden attrs.

        Args:
            size (int): the number of instances to create

        Returns:
            object list: the created instances
        """
        return [cls.create(**kwargs) for _ in range(size)]

This means that an instance is generated and created for each iteration, resulting in numerous SQL queries, especially if your factory uses SubFactory (related model factories).

The solution

To prevent too much SQL queries, it would be better to use bulk_create from the Django manager.

A simple solution can be to generate the instances and then saving them, you can also pass parameters (notifications_enabled for example):

class ContactFactory(DjangoModelFactory):
    class Meta:
        model = Contact

# Create a thousand contacts
contact_list = ContactFactory.simple_generate_batch(
    create=False, size=1000, notifications_enabled=True
)
contact_list = Contact.objects.bulk_create(contact_list)

But what if our factory has a SubFactory? You would certainly hit a N+1 problem. To overcome it, you may bulk create sequentially, while retaining primary keys:

class ContactFactory(DjangoModelFactory):
    class Meta:
        model = Contact

class NotificationFactory(DjangoModelFactory)
    contact = factory.SubFactory(ContactFactory)

    class Meta:
        model = Notification

size = 1000
# Create contacts
contact_list = ContactFactory.simple_generate_batch(
    create=False, size=size, notifications_enabled=True
)
contact_list = Contact.objects.bulk_create(contact_list)

# Create a notification for each contact
obj_list = NotificationFactory.simple_generate_batch(
    create=False, size=size, contact=None
)
for pos, obj in enumerate(obj_list):
    obj.contact_id = contact_list[pos].pk
Notification.objects.bulk_create(obj_list)