Skip to content
Bradford Stephens edited this page Mar 14, 2016 · 3 revisions

Sinks

BigQuery

Sinking to BigQuery is rather simple:

  1. Define a schema in a .json file
  2. Declare the sink in the config file
  3. Make sure your pipeline code creates the correct JSON output

BigQuery Schema

Terraform needs to know the BigQuery schema so it can create and manage tables. This same schema is used inside angled-dream to map the JSON output to BigQuery fields.

This is an example orion.json file, notice the nesting possible with 'fields':

[
{"name": "event", "type": "string", "mode": "nullable"},
{"name": "owts", "type": "string", "mode": "nullable"},
{"name": "data", "type": "record", "mode": "nullable",
"fields": [
  {"name": "action", "type": "record", "mode": "nullable",
  "fields": [ {"name": "name", "type": "string", "mode": "nullable"}]},
  {"name": "test_name", "type": "string", "mode": "nullable"}]
]}]

Check this file into your pipeline-controller Github repo.

Config File

You need to put the BigQuery resource in the Sossity config file, so Sossity can use Terraform to create the table and dataset. The dataset and data table will not be deleted even if you remove them from the config file.

In the :sinks key, an example: "hxtrialorionbq" {:type "bq" :bigQueryDataset "hx_trial_orion_staging" :bigQueryTable "hx_trial_orion" :bigQuerySchema "/home/ubuntu/pipeline-controller/orion.json"}}

the path prefix for :bigQuerySchema will always be /home/ubuntu/pipeline-controller.

Sossity will read this config file and schema to create the necessary BigQuery table and dataset. Also, angled-dream will use the bigQuerySchema file to turn json strings into BigQuery objects.

Pipeline

To write data to BigQuery, simply make sure a pipeline emits a JSON string with the correct structure and field names. There is no special data structure for BigQuery -- just use the field names you desire, and the sink will figure out the rest.

For example, if the BigQuery table schema looks like:

{"event":<str>,
 "owts":<str>,
 "data":{"action":<str>}}

then have your pipeline emit:

{"event":"webapp",
  "owts":"v1",
  "data":{"action":"click"}}

In Java, this would look something like:

Gson gson = new Gson();

Map<String,Object> topMap = new HashMap<>();
Map<String, String> dataMap; new HashMap<>();

topMap.put("event","webapp");
topMap.put("owts","v1");
dataMap.put("action","click");
topMap.put("data",dataMap);

String output = gson.toJson(topMap);