DataHub

DataHub

  • Docs
  • Roadmap
  • Team
  • GitHub

›Guide

Guide

  • Getting Started
  • Developer Guide
  • Team
  • Contributing to Datahub
  • Contributing

API Reference

    data

    • countries
    • regions

    dataframe

    • address
    • choice
    • company_namer
    • counter
    • country_codes
    • normal_sampler
    • person
    • random_range
    • sic_industry
    • sic_range

    datasets

    • address
    • country
    • lei
    • person
    • sic_code
    • sic_range

Getting Started

The following guide gives a short introduction to the DataHub via a practical example.

Create a python virtual env

python -m venv env

Activate the virtual environment

On macOS and Linux:

source env/bin/activate

On Windows

.\env\Scripts\activate

Install datahub-core via PIP

pip install datahub-core

Create your first sample project

Recommended folder structure

├── data
│   ├── data.csv
│   ├── data.xlsx
├── project_name
│   ├── __init__.py
│   ├── generate.py
├── tests
│   ├── __init__.py
├── run.py

Your project's source code is mainly placed under the project_name folder. This is where your core data generation logic will be placed. run.py is a wrapper for your project entrance.

Generate your synthetic data

Once you have your folder structure prepared based on the example in the above section. You can write some code to generate your synthetic data now. There's an example in the folder examples/demo. The key data generation logic sit in file examples/demo/demo/generate.py.

import numpy as np
import datahub_core.generators as gen

def run(seed=130319810):

    df = gen.generate(
        props={
            'region': gen.choice(data=['NAM']),
            'country': gen.country_codes(region_field='region'),
            'person_name': gen.person('country'),
            'age': gen.random_range(low=1, high=100, round_dp=0),
        },
        count=50,
        randomstate=np.random.RandomState(seed)
    ).to_dataframe()

    df['ccy'] = df['country'].apply(lambda x: x.currency)
    df['country'] = df['country'].apply(lambda x: x.alpha3_code)

    return df

This example generates faked person name in the NAM region with his/her country code, currency information, and random age. In order to build your own synthetic generation model, you need to define a run function, in which you will use the generators in module datahub_core.generators to generate your data. Both gen.choice, gen.country_codes, gen.random_range are the generators defined in datahub_core/generators/data_frame.

Enjoy your journey to the synthetic data generation now!

Developer Guide →
  • Create a python virtual env
  • Activate the virtual environment
  • Install datahub-core via PIP
  • Create your first sample project
  • Generate your synthetic data
DataHub
Docs
Getting StartedWhy Project BlueprintUse Cases
Community
FINOSProject Blueprint WikiGoogle Groups
More
datahub
Follow @FinosFoundation
FINOS on LinkedIN
FINOS

Proud member of the Fintech Open Source Foundation

Copyright © 2020 DataHub - Citigroup