Getting Started · DataHub

The following guide gives a short introduction to the DataHub via a practical example.

Create a python virtual env

python -m venv env

Activate the virtual environment

On macOS and Linux:

source env/bin/activate

On Windows

.\env\Scripts\activate

Install datahub-core via PIP

pip install datahub-core

Create your first sample project

Recommended folder structure

├── data
│   ├── data.csv
│   ├── data.xlsx
├── project_name
│   ├── __init__.py
│   ├── generate.py
├── tests
│   ├── __init__.py
├── run.py

Your project's source code is mainly placed under the project_name folder. This is where your core data generation logic will be placed. run.py is a wrapper for your project entrance.

Generate your synthetic data

Once you have your folder structure prepared based on the example in the above section. You can write some code to generate your synthetic data now. There's an example in the folder examples/demo. The key data generation logic sit in file examples/demo/demo/generate.py.

import numpy as np
import datahub_core.generators as gen

def run(seed=130319810):

    df = gen.generate(
        props={
            'region': gen.choice(data=['NAM']),
            'country': gen.country_codes(region_field='region'),
            'person_name': gen.person('country'),
            'age': gen.random_range(low=1, high=100, round_dp=0),
        },
        count=50,
        randomstate=np.random.RandomState(seed)
    ).to_dataframe()

    df['ccy'] = df['country'].apply(lambda x: x.currency)
    df['country'] = df['country'].apply(lambda x: x.alpha3_code)

    return df

This example generates faked person name in the NAM region with his/her country code, currency information, and random age. In order to build your own synthetic generation model, you need to define a run function, in which you will use the generators in module datahub_core.generators to generate your data. Both gen.choice, gen.country_codes, gen.random_range are the generators defined in datahub_core/generators/data_frame.

Enjoy your journey to the synthetic data generation now!