Skip to content

Commit

Permalink
Merge pull request #43 from fhtino/main
Browse files Browse the repository at this point in the history
Added Contoso Data Generator V2
  • Loading branch information
marcosqlbi authored Jul 22, 2024
2 parents 2c3f8f6 + 7f273e6 commit 4ea07b2
Show file tree
Hide file tree
Showing 22 changed files with 377 additions and 147 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ _site
vendor
.DS_Store
Thumbs.db
.vs
3 changes: 3 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,8 @@ GEM
racc (~> 1.4)
nokogiri (1.13.10-x86_64-darwin)
racc (~> 1.4)
nokogiri (1.13.10-x86_64-linux)
racc (~> 1.4)
octokit (4.25.1)
faraday (>= 1, < 3)
sawyer (~> 0.9)
Expand Down Expand Up @@ -256,6 +258,7 @@ GEM
PLATFORMS
arm64-darwin-23
universal-darwin-22
x86_64-linux

DEPENDENCIES
github-pages (~> 228)
Expand Down
37 changes: 37 additions & 0 deletions _mydocs/contoso-data-generator/config-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
layout: page
title: Configuration data (data.xlsx)
menu_title: Configuration data
published: true
order: /07
---

The Excel configuration file contains both fixed data and parameters to control the distribution of random data. For example, from here you can decide the relative percentage of orders for categories and subcategories.

The file contains several sheets, further described here. Each sheet contains multiple columns. The software reads some of the columns recognizing them by name. Columns with names that do not follow the standard requirements of the software are ignored. Columns have been conveniently colored in yellow if they are used by the software. Any non-yellow color is considered a comment and it is useful only for human purposes.

### Categories
From here you can configure sales of categories using two curves: W and PPC. "W" define the relative weight of each category in the set of all categories for different periods in the entire timeframe. "PPC" define the variation in the price of items of each category during the whole period (Price percent). Normally the last column is 100%.

### Subcategories
From here you can configure sales of subcategories using a weight curve with columns marked with W. The values are used to define the weight of a subcategory inside its category. Therefore, the numbers are summed by category and then used to weight subcategories inside the category.

### Subcatlinks
On this page, you can configure the likelihood that one product in a subcategory triggers the purchase of another product in another subcategory. The values are in percentage: <17, 18, 80%> means that if a product of subcategory 17 is added to an order, there is an 80% chance that a product of subcategory 18 will be added to the same order.

### Products
On this page, you configure, for each product, the initial price and the distribution of sales of the product over different periods. The weights identified in the W columns are relative to the subcategory to which the product belongs.

### CustomerClusters
On this page, you define clusters of customers. Each cluster is defined by two columns: OW (OrderWeight) and CW (CustomerWeight). Order Weight defines the percentage of orders assigned to customers belonging to the cluster, whereas CustomerWeight defines the percentage of the total customers used to fill the cluster.

It is possible to define a large cluster of customers that generates a small number of orders. The number of clusters is free.

### GeoAreas
This page is intended to define geographical areas, each with a set of weights to change the activity of the area over time. Each area is independent of the other and the definition of geographical areas needs to be done at the leaf level: no grouping is provided.
For each geographic area, you define the W columns to provide the activity spline.

### Stores
On this page, you enumerate the stores. For each store, you provide its geographical area and the open and close date. A store is active only between the two dates.
You do not provide weight activity for the stores, as the behavior is dictated by the customer clusters. A special store marked -1 as StoreID defines the online store.
Each order is assigned to either the online store or to a local store depending on the country of the customer.
58 changes: 58 additions & 0 deletions _mydocs/contoso-data-generator/config-json.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
layout: page
title: Configuration file (config.json)
menu_title: Configuration file
published: true
order: /06
---

This file contains the main configuration of the data generator.
- **OrdersCount**: (int) total number of orders to be generated.

- **StartDT**: (datetime) date of the first order.

- **YearsCount**: (int) total number of years generated. Orders are distributed over the years.

- **CutDateBefore**, **CutDateAfter**: (datetime optional parameters) the 2 parameters allow to create data starting from a day different from January 1st and ending on a date different from December 31st. Data before CutDateBefore and after CutDateAfter is removed

- **CustomerPercentage** : percentage of customers to be used. Range: 0.001 - 1.000

- **OutputFormat** : format of the data to be generated. Values: CSV, PARQUET, DELTATABLE

- **SalesOrders** : type if data to be generated. Values: SALES ORDERS BOTH. SALES = creates the "sales" table. ORDERS = creates the "orders" and the "orders details" table. BOTH = creates all the previous tables.

- **CustomerFakeGenerator**: (int) number of full random customers. Only used during tests for speeding up the process.

- **DaysWeight** (section)

- **DaysWeightConstant**: (bool) if set to true, the configuration about days is ignored.

- **DaysWeightPoints**, **DaysWeighValues**: (double[]) points for interpolating the curve of distribution of orders over time. It covers the entire YearsCount period.

- **DaysWeightAddSpikes**: (bool) if set to false, annual spikes are ignored.

- **WeekDaysFactor**: (double[] - length 7) weight multiplication factor for each day of the week. The first day is Sunday.

- **DayRandomness**: (double) percentage of randomness add to days, to avoid having a too-perfect curve over time.

- **OrderRowsWeights**: (double[]) distribution of the number of rows per order. Each element is a weight. The first element is the weight of orders with one row, the second is the weight of orders with two rows. and so on

- **OrderQuantityWeights**: (double[]) distribution of the quantity applied to each order row. Each element is a weight. The first element is the weight of rows with quantity=1, the second element is the weight of rows with quantity=2, and so on.

- **DiscountWeights**: (double[]) distribution of the discounts applied to order rows. Each element is a weight. The first element is the weight of rows with a discount of 0%, the second element is the weight of rows with a discount of 1%, and so on.

- **OnlinePerCent**: (double[]) distribution of the percentage of orders sold online, over the orders total.

- **DeliveryDateLambdaWeights**: (double[]) distribution of the days for delivery. The delivery date is computed by adding one day plus a random number generated using the distribution built from this parameter.

- **CountryCurrency**: table mapping Country to Currency

- **AnnualSpikes** : set of periods where orders show a spike. For each spike, you define the start day, the end day, and the multiplication factor.

- **OneTimeSpikes**: set of spikes with a fixed start and end date. For each spike, you define the start end, the end date, and the multiplication factor.

- **CustomerActivity** : contains the configuration for customer start/end date

- **StartDateWeightPoints**, **StartDateWeightValues**: configuration for the spline of customer start date

- **EndDateWeightPoints**, **EndDateWeightValues**: configuration for the spline of customer end dates
123 changes: 0 additions & 123 deletions _mydocs/contoso-data-generator/databasegenerator.md

This file was deleted.

41 changes: 41 additions & 0 deletions _mydocs/contoso-data-generator/details.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
layout: page
title: Details
menu_title: Details
published: true
order: /01
---

## Data structure

Output elements:
- Customers
- Stores
- Dates
- CurrencyExchanges
- Sales
- Orders & OrderRows (optional)

Data schema (Sales version):

![Schema Sales](images/schema-sales.svg)


Data schema (Orders & OrderRows version):

![Schema Sales](images/schema-orders.svg)

<br/>

Customers set of data is filled with fake, but realistic, customers data.


## Pre-data-preparation: static data from SQLBI repository

The tool needs some files containing static data: fake customers, exchange rates, postal codes, etc. The files are cached under cache folder specified as a parameter on the command line. The files are downloaded from a specific SQLBI repository if not found in the cache folder. In normal usage, if you reuse the same cache folder, the files are downloaded only on the first run.
After downloading, some files are processed to create a consistent set of fake customers. The output file, customersall.csv, is placed under cache folder. If you delete it, it will be recreated on the following run.

https://github.com/sql-bi/Contoso-Data-Generator-V2-Data/releases/tag/static-files



68 changes: 68 additions & 0 deletions _mydocs/contoso-data-generator/formats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
layout: page
title: Output formats and related parameters
menu_title: Output formats
published: true
order: /02
next_reading: true
---

Every output format has specific parameter to be set inside config.json

## CSV

| Parameter | Values | Notes |
| -- | -- | -- |
| OutputFormat | "CSV" | |
| CsvMaxOrdersPerFile | -1 or a number >1 | Maximum number of Orders per file |
| CsvGzCompression | 0 or 1 | Apply GZ compression to output CSV files |

For creating a single big CSV file:
```
"OutputFormat": "CSV"
"CsvMaxOrdersPerFile": -1
"CsvGzCompression": 0
```

For creating multiple CSV files:
```
"OutputFormat": "CSV"
"CsvMaxOrdersPerFile": 50000
"CsvGzCompression": 0
```

For creating multiple CSV.GZ files:
```
"OutputFormat": "CSV"
"CsvMaxOrdersPerFile": 50000
"CsvGzCompression": 1
```

## Parquet

| Parameter | Values | Notes |
| -- | -- | -- |
| OutputFormat | "PARQUET" | |
| ParquetOrdersRowGroupSize | integer | Number of orders per parquet Row Group. Default value is 500000. Do not change if not strictly required.

Example:

```
"OutputFormat": "PARQUET"
```


## Delta Table

| Parameter | Values | Notes |
| -- | -- | -- |
| OutputFormat | "DELTATABLE" | |
| DeltaTableOrdersPerFile | integer | Number of orders per parquet file. |
| ParquetOrdersRowGroupSize | integer | Number of orders per parquet Row Group. Default value is 500000. Do not change if not strictly required.

Example:

```
"OutputFormat": "DELTATABLE"
"DeltaTableOrdersPerFile": 250000
```
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 4ea07b2

Please sign in to comment.