Real data is best, but sometimes it's not available - or it's not allowed.
Fake data is useful for testing, and for prototyping. It's also useful for generating data for demos, and for training models. Generating fake data is easy and quick.
In Python, there are a few libraries for generating fake data. The two we'll look at here are Faker and Mimesis.
We'll go from a slow, simple, example to a fast, multi-core example.
Faker is slow
Faker has a simple API, and it very extensible.
Here's a quick Python program that takes a few args, and produces a CSV file with fake data.
import argparse
import csv
from faker import Faker
from typing import Dict
def fake_record(fake: Faker) -> Dict:
return {
"name": fake.name(),
"address": fake.address(),
}
if __name__ == "__main__":
parser = argparse.ArgumentParser(prog="Generate Data")
parser.add_argument("-f", "--file", default="data.csv")
parser.add_argument("-n", "--num_rows", type=int)
args = parser.parse_args()
fake = Faker()
records = (fake_record(fake) for _ in range(args.num_rows))
with open(args.file, "w", newline="") as csvfile:
writer = csv.DictWriter(
csvfile, quoting=csv.QUOTE_ALL, fieldnames=["name", "address"]
)
writer.writeheader()
for row in records:
writer.writerow(row)
Running this code, for a benchmark yielded the following results.
Row count | Time taken |
---|---|
100K | 12s |
1M | 122s |
Almost two minutes! That's a long time to wait for not much data.
Using cProfile
, it's easy to see where the time is spent. There's the overhead of cProfile to consider, but it's clear that the bulk of the time is spent in Faker.
# time python -m cProfile -s cumulative simple_faker.py -n 100000
40287037 function calls (37980098 primitive calls) in 21.911 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
112/1 0.001 0.000 21.911 21.911 {built-in method builtins.exec}
1 0.038 0.038 21.911 21.911 simple_faker.py:1(<module>)
100001 0.030 0.000 21.483 0.000 simple_faker.py:21(<genexpr>)
100000 0.138 0.000 21.452 0.000 simple_faker.py:7(fake_record)
467966/200000 0.124 0.000 19.842 0.000 generator.py:161(parse)
1031791/200001 0.823 0.000 19.787 0.000 {method 'sub' of 're.Pattern' objects}
1200130/597340 0.800 0.000 19.470 0.000 generator.py:177(__format_token)
1200130/597340 0.573 0.000 18.880 0.000 generator.py:84(format)
1292977 0.492 0.000 15.717 0.000 __init__.py:528(random_element)
1292977 7.742 0.000 15.225 0.000 __init__.py:405(random_elements)
100000 0.049 0.000 13.479 0.000 __init__.py:68(address)
170645 0.047 0.000 7.501 0.000 __init__.py:211(last_name)
100000 0.048 0.000 7.460 0.000 __init__.py:201(name)
1292977 1.714 0.000 7.087 0.000 distribution.py:57(choices_distribution)
89322 0.041 0.000 6.919 0.000 __init__.py:55(street_address)
Mimesis is faster
Let's change the fake_record
function, and use Mimesis. This library claims to be much faster and has published benchmarks.
from mimesis import Address, Person
from mimesis.locales import Locale
def fake_record(person: Person, address: Address) -> Dict:
return {
"name": person.full_name(),
"address": f"{address.address()}, {address.city()}, {address.postal_code()}",
}
The Address object is odd, and we have to call it three times to yield something similar to Faker
. However, it's much faster. This time, let's generate 10 million rows.
Row count | Time taken |
---|---|
100K | 0.8s |
1M | 8s |
10M | 80s |
Mimesis is ~15x faster at generating data than Faker.
Multiprocessing to use all CPU cores
By default, Python will run on a single CPU core. We can easily use all CPU cores by using the multiprocessing
library.
The simplest approach here is a shared-nothing design. Each core will write to its own CSV file, and there are no race conditions or synchronization problems.
If it takes 80 seconds to generate 10 million rows on one core, how long could it take with eight cores? Let's find out:
def worker(file_prefix: str, part: int, n: int):
...
if __name__ == "__main__":
...
rows_per_process = int(args.num_rows / args.num_processes)
task_args = [
(args.file_prefix, i, rows_per_process) for i in range(0, args.num_processes)
]
pool = mp.Pool(processes=args.num_processes)
pool.starmap(worker, task_args)
pool.close()
Row count | Time taken |
---|---|
100K | 0.3s |
1M | 1.3s |
10M | 12s |
100M | 103s |
1B | 17m |
Much better. As expected, using eight cores makes it around eight times faster!
Plus, this table shows that we can generate 100x more data in the same time as single-core Faker.