How to create fake data using Python

How to create fake data using python

Often we are stuck in a situation where we wish we have more data to train our machine learning model or to test our model.  What if I show you how to create fake data using python?

Faker is a Python package that generates fake data for you. Whether you have to fill up your spreadsheet, have to create good-looking documents, train/test model. Faker is what you need.

how to install faker

In command prompt run

pip install Faker
how faker works
from faker import Faker
fake=Faker()

From faker package import Faker. Using Faker.Faker() you can create an object to further use its functions.

fake.name()
#Tyler Newton
fake.address()
#'7345 Shawn Port Suite 576 \n West Issacbury, 28544'
fake.text()
#'Deep relationship into indicate. Both item south gas drug. #Possible stratery manage success.'

We just have created a fake name, address and text. Isn’t dope?  Each generator has many properties like ‘name’, ‘address’, ‘text’ which are known as “fake”. A faker generator is made up of many of these bundled in ‘providers’.

Providers in faker package

Each provider mentioned in the image can generate specific fake data as per your needs.

localization

If an analyst desires to generate fake data for a country-specific ‘faker’ is capable of doing that too.

local present in faker package

Let’s say you live in India and want to generate data in the local language(Hindi). You can tell ‘faker’ at the beginning while creating an object that you would be using Localization by passing ‘hi_IN’ as an argument.

If you don’t pass the localization parameter ‘faker’ will assume it to be ‘en_US’ (United States).

from faker import Faker
fake = Faker('hi_IN')
for _ in range(10):
print(fake.name())
सरला गुप्ता
रामशर्मा, भरत
गावित, जया
सरस्वती मंडल
रिया गावित
नाम, जितेन्द्र
किरण रामलला
शर्मिला गर्ग
गांगुली, रोहन
महाराज, जयन्ती

Let’s create an entire fake dataset of thousand observations from scratch in less than fifteen lines of code.

ls_name= []
ls_address=[]
ls_fathername=[]
ls_dob=[]
ls_barcode=[]
ls_phone=[]
for _ in range(1000):
ls_name.append(fake.name())
ls_address.append(fake.address())
ls_fathername.append(fake.name_male())
ls_dob.append(fake.date_between(end_date="-18y"))
ls_barcode.append(fake.ean(length=13))
ls_phone.append(fake.phone_number())
df=pd.DataFrame({'name': ls_name, 'address':ls_address, 'father_name':ls_fathername,'DOB':ls_dob,'barcode':ls_barcode,'phone number':ls_phone})

df.head()

For more details, see the extended docs

About the author

Harsh Bhojwani

Hi, I'm Harsh Bhojwani, an aspiring blogger with an obsession for all things related to Data Science. This blog is dedicated to helping people learn Data Science.

View all posts

Leave a Reply

Your email address will not be published. Required fields are marked *