Skip to content

Colin Webb

Fake Data With Python

Real data is best, but sometimes it's not available - or it's not allowed.

Fake data is useful for testing, and for prototyping. It's also useful for generating data for demos, and for training models. Generating fake data is easy and quick.

In Python, there are a few libraries for generating fake data. The two we'll look at here are Faker and Mimesis.

We'll go from a slow, simple, example to a fast, multi-core example.

Faker is slow

Faker has a simple API, and it very extensible.

Here's a quick Python program that takes a few args, and produces a CSV file with fake data.

import argparse
import csv
from faker import Faker
from typing import Dict

def fake_record(fake: Faker) -> Dict:
    return {
        "address": fake.address(),

if __name__ == "__main__":
    parser = argparse.ArgumentParser(prog="Generate Data")
    parser.add_argument("-f", "--file", default="data.csv")
    parser.add_argument("-n", "--num_rows", type=int)
    args = parser.parse_args()

    fake = Faker()
    records = (fake_record(fake) for _ in range(args.num_rows))

    with open(args.file, "w", newline="") as csvfile:
        writer = csv.DictWriter(
            csvfile, quoting=csv.QUOTE_ALL, fieldnames=["name", "address"]
        for row in records:

Running this code, for a benchmark yielded the following results.

Row countTime taken

Almost two minutes! That's a long time to wait for not much data.

Using cProfile, it's easy to see where the time is spent. There's the overhead of cProfile to consider, but it's clear that the bulk of the time is spent in Faker.

# time python -m cProfile -s cumulative -n 100000
         40287037 function calls (37980098 primitive calls) in 21.911 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    112/1    0.001    0.000   21.911   21.911 {built-in method builtins.exec}
        1    0.038    0.038   21.911   21.911<module>)
   100001    0.030    0.000   21.483    0.000<genexpr>)
   100000    0.138    0.000   21.452    0.000
467966/200000    0.124    0.000   19.842    0.000
1031791/200001    0.823    0.000   19.787    0.000 {method 'sub' of 're.Pattern' objects}
1200130/597340    0.800    0.000   19.470    0.000
1200130/597340    0.573    0.000   18.880    0.000
  1292977    0.492    0.000   15.717    0.000
  1292977    7.742    0.000   15.225    0.000
   100000    0.049    0.000   13.479    0.000
   170645    0.047    0.000    7.501    0.000
   100000    0.048    0.000    7.460    0.000
  1292977    1.714    0.000    7.087    0.000
    89322    0.041    0.000    6.919    0.000

Mimesis is faster

Let's change the fake_record function, and use Mimesis. This library claims to be much faster and has published benchmarks.

from mimesis import Address, Person
from mimesis.locales import Locale

def fake_record(person: Person, address: Address) -> Dict:
    return {
        "name": person.full_name(),
        "address": f"{address.address()}, {}, {address.postal_code()}",

The Address object is odd, and we have to call it three times to yield something similar to Faker. However, it's much faster. This time, let's generate 10 million rows.

Row countTime taken

Mimesis is ~15x faster at generating data than Faker.

Multiprocessing to use all CPU cores

By default, Python will run on a single CPU core. We can easily use all CPU cores by using the multiprocessing library.

The simplest approach here is a shared-nothing design. Each core will write to its own CSV file, and there are no race conditions or synchronization problems.

If it takes 80 seconds to generate 10 million rows on one core, how long could it take with eight cores? Let's find out:

def worker(file_prefix: str, part: int, n: int):

if __name__ == "__main__":

    rows_per_process = int(args.num_rows / args.num_processes)
    task_args = [
        (args.file_prefix, i, rows_per_process) for i in range(0, args.num_processes)

    pool = mp.Pool(processes=args.num_processes)
    pool.starmap(worker, task_args)
Row countTime taken

Much better. As expected, using eight cores makes it around eight times faster!

Plus, this table shows that we can generate 100x more data in the same time as single-core Faker.