Fake Data With Python

Real data is best, but sometimes it's not available - or it's not allowed.

Fake data is useful for testing, and for prototyping. It's also useful for generating data for demos, and for training models. Generating fake data is easy and quick.

In Python, there are a few libraries for generating fake data. The two we'll look at here are Faker and Mimesis.

We'll go from a slow, simple, example to a fast, multi-core example.

Faker is slow

Faker has a simple API, and it very extensible.

Here's a quick Python program that takes a few args, and produces a CSV file with fake data.

import argparse
import csv
from faker import Faker
from typing import Dict


def fake_record(fake: Faker) -> Dict:
    return {
        "name": fake.name(),
        "address": fake.address(),
    }


if __name__ == "__main__":
    parser = argparse.ArgumentParser(prog="Generate Data")
    parser.add_argument("-f", "--file", default="data.csv")
    parser.add_argument("-n", "--num_rows", type=int)
    args = parser.parse_args()

    fake = Faker()
    records = (fake_record(fake) for _ in range(args.num_rows))

    with open(args.file, "w", newline="") as csvfile:
        writer = csv.DictWriter(
            csvfile, quoting=csv.QUOTE_ALL, fieldnames=["name", "address"]
        )
        writer.writeheader()
        for row in records:
            writer.writerow(row)

Running this code, for a benchmark yielded the following results.

Row count	Time taken
100K	12s
1M	122s

Almost two minutes! That's a long time to wait for not much data.

Using cProfile, it's easy to see where the time is spent. There's the overhead of cProfile to consider, but it's clear that the bulk of the time is spent in Faker.

# time python -m cProfile -s cumulative simple_faker.py -n 100000
         40287037 function calls (37980098 primitive calls) in 21.911 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    112/1    0.001    0.000   21.911   21.911 {built-in method builtins.exec}
        1    0.038    0.038   21.911   21.911 simple_faker.py:1(<module>)
   100001    0.030    0.000   21.483    0.000 simple_faker.py:21(<genexpr>)
   100000    0.138    0.000   21.452    0.000 simple_faker.py:7(fake_record)
467966/200000    0.124    0.000   19.842    0.000 generator.py:161(parse)
1031791/200001    0.823    0.000   19.787    0.000 {method 'sub' of 're.Pattern' objects}
1200130/597340    0.800    0.000   19.470    0.000 generator.py:177(__format_token)
1200130/597340    0.573    0.000   18.880    0.000 generator.py:84(format)
  1292977    0.492    0.000   15.717    0.000 __init__.py:528(random_element)
  1292977    7.742    0.000   15.225    0.000 __init__.py:405(random_elements)
   100000    0.049    0.000   13.479    0.000 __init__.py:68(address)
   170645    0.047    0.000    7.501    0.000 __init__.py:211(last_name)
   100000    0.048    0.000    7.460    0.000 __init__.py:201(name)
  1292977    1.714    0.000    7.087    0.000 distribution.py:57(choices_distribution)
    89322    0.041    0.000    6.919    0.000 __init__.py:55(street_address)

Mimesis is faster

Let's change the fake_record function, and use Mimesis. This library claims to be much faster and has published benchmarks.

from mimesis import Address, Person
from mimesis.locales import Locale


def fake_record(person: Person, address: Address) -> Dict:
    return {
        "name": person.full_name(),
        "address": f"{address.address()}, {address.city()}, {address.postal_code()}",
    }

The Address object is odd, and we have to call it three times to yield something similar to Faker. However, it's much faster. This time, let's generate 10 million rows.

Row count	Time taken
100K	0.8s
1M	8s
10M	80s

Mimesis is ~15x faster at generating data than Faker.

Multiprocessing to use all CPU cores

By default, Python will run on a single CPU core. We can easily use all CPU cores by using the multiprocessing library.

The simplest approach here is a shared-nothing design. Each core will write to its own CSV file, and there are no race conditions or synchronization problems.

If it takes 80 seconds to generate 10 million rows on one core, how long could it take with eight cores? Let's find out:

def worker(file_prefix: str, part: int, n: int):
    ...

if __name__ == "__main__":
    ...

    rows_per_process = int(args.num_rows / args.num_processes)
    task_args = [
        (args.file_prefix, i, rows_per_process) for i in range(0, args.num_processes)
    ]

    pool = mp.Pool(processes=args.num_processes)
    pool.starmap(worker, task_args)
    pool.close()

Row count	Time taken
100K	0.3s
1M	1.3s
10M	12s
100M	103s
1B	17m

Much better. As expected, using eight cores makes it around eight times faster!

Plus, this table shows that we can generate 100x more data in the same time as single-core Faker.