sidebar_position | sidebar_label |
---|---|
8 |
Shaping and Type Fusion |
Data that originates from heterogeneous sources typically has inconsistent structure and is thus difficult to reason about or query. To unify disparate data sources, data is often cleaned up to fit into a well-defined set of schemas, which combines the data into a unified store like a data warehouse.
In Zed, this cleansing process is called "shaping" the data, and Zed leverages its rich, super-structured type system to perform core aspects of data transformation. In a data model with nesting and multiple scalar types (such as Zed or JSON), shaping includes converting the type of leaf fields, adding or removing fields to "fit" a given shape, and reordering fields.
While shaping remains an active area of development, the core functions in Zed that currently perform shaping are:
cast
- coerce a value to a different typecrop
- remove fields from a value that are missing in a specified typefill
- add null values for missing fieldsorder
- reorder record fieldsshape
- applycast
,fill
, andorder
They all have the same signature, taking two parameters: the value to be transformed and a type value for the target type.
Another type of transformation that's needed for shaping is renaming fields, which is supported by the
rename
operator. Also, theyield
operator is handy for simply emitting new, arbitrary record literals based on input values and mixing in these shaping functions in an embedded record literal. Thefuse
aggregate function is also useful for fusing values into a common schema, though a type is returned rather than values.
In the examples below, we will use the following named type connection
that is stored in a file connection.zed
and is included in the example Zed queries with the -I
option of zq
:
type socket = { addr:ip, port:port=uint16 }
type connection = {
kind: string,
client: socket,
server: socket,
vlan: uint16
}
We also use this sample JSON input in a file called sample.json
:
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "10.47.1.100",
"port": 41772
},
"uid": "C2zK5f13SbCtKcyiW5"
}
The cast
function applies a cast operation to each leaf value that matches the
field path in the specified type, e.g.,
super -Z -I connection.zed -c 'cast(this, <connection>)' sample.json
casts the address fields to type ip
, the port fields to type port
(which is a named type for type uint16
) and the address port pairs to
type socket
without modifying the uid
field or changing the
order of the server
and client
fields:
{
kind: "dns",
server: {
addr: 10.0.0.100,
port: 53 (port=uint16)
} (=socket),
client: {
addr: 10.47.1.100,
port: 41772
} (socket),
uid: "C2zK5f13SbCtKcyiW5"
}
Cropping is useful when you want records to "fit" a schema tightly, e.g.,
super -Z -I connection.zed -c 'crop(this, <connection>)' sample.json
removes the uid
field since it is not in the connection
type:
{
kind: "dns",
server: {
addr: "10.0.0.100",
port: 53
},
client: {
addr: "10.47.1.100",
port: 41772
}
}
Use fill
when you want to fill out missing fields with nulls, e.g.,
super -Z -I connection.zed -c 'fill(this, <connection>)' sample.json
adds a null-valued vlan
field since the input value is missing it and
the connection
type has it:
{
kind: "dns",
server: {
addr: "10.0.0.100",
port: 53
},
client: {
addr: "10.47.1.100",
port: 41772
},
uid: "C2zK5f13SbCtKcyiW5",
vlan: null (uint16)
}
The order
function changes the order of fields in its input to match the
order in the specified type, as field order is significant in Zed records, e.g.,
super -Z -I connection.zed -c 'order(this, <connection>)' sample.json
reorders the client
and server
fields to match the input but does nothing
about the uid
field as it is not in the connection
type:
{
kind: "dns",
client: {
addr: "10.47.1.100",
port: 41772
},
server: {
addr: "10.0.0.100",
port: 53
},
uid: "C2zK5f13SbCtKcyiW5"
}
As an alternative to the order
function,
record expressions can be used to reorder
fields without specifying types. For example:
super -Z -c 'yield {kind,client,server,...this}' sample.json
also produces
{
kind: "dns",
client: {
addr: "10.47.1.100",
port: 41772
},
server: {
addr: "10.0.0.100",
port: 53
},
uid: "C2zK5f13SbCtKcyiW5"
}
The shape
function brings everything together by applying cast
,
fill
, and order
all in one step, e.g.,
super -Z -I connection.zed -c 'shape(this, <connection>)' sample.json
reorders the client
and server
fields to match the input but does nothing
about the uid
field as it is not in the connection
type:
{
kind: "dns",
client: {
addr: 10.47.1.100,
port: 41772 (port=uint16)
} (=socket),
server: {
addr: 10.0.0.100,
port: 53
} (socket),
vlan: null (uint16),
uid: "C2zK5f13SbCtKcyiW5"
}
To get a tight shape of the target type,
apply crop
to the output of shape
, e.g.,
super -Z -I connection.zed -c '
shape(this, <connection>)
|> crop(this, <connection>)
' sample.json
drops the uid
field after shaping:
{
kind: "dns",
client: {
addr: 10.47.1.100,
port: 41772 (port=uint16)
} (=socket),
server: {
addr: 10.0.0.100,
port: 53
} (socket),
vlan: null (uint16)
}
A failure during shaping produces an error value
in the problematic leaf field. For example, consider this alternate input data
file malformed.json
.
{
"kind": "dns",
"server": {
"addr": "10.0.0.100",
"port": 53
},
"client": {
"addr": "39 Elm Street",
"port": 41772
},
"vlan": "available"
}
When we apply our shaper via
super -Z -I connection.zed -c 'shape(this, <connection>)' malformed.json
we see two errors:
{
kind: "dns",
client: {
addr: error({
message: "cannot cast to ip",
on: "39 Elm Street"
}),
port: 41772 (port=uint16)
},
server: {
addr: 10.0.0.100,
port: 53 (port)
} (=socket),
vlan: error({
message: "cannot cast to uint16",
on: "available"
})
}
Since these error values are nested inside an otherwise healthy record, adding
has_error(this)
downstream in our Zed pipeline
could help find or exclude such records. If the failure to shape any single
field is considered severe enough to render the entire input record unhealthy,
a conditional expression
could be applied to wrap the input record as an error while including detail
to debug the problem, e.g.,
super -Z -I connection.zed -c '
yield {original: this, shaped: shape(this, <connection>)}
|> yield has_error(shaped)
? error({
msg: "shaper error (see inner errors for details)",
original,
shaped
})
: shaped
' malformed.json
which produces
error({
msg: "shaper error (see inner errors for details)",
original: {
kind: "dns",
server: {
addr: "10.0.0.100",
port: 53
},
client: {
addr: "39 Elm Street",
port: 41772
},
vlan: "available"
},
shaped: {
kind: "dns",
client: {
addr: error({
message: "cannot cast to ip",
on: "39 Elm Street"
}),
port: 41772 (port=uint16)
},
server: {
addr: 10.0.0.100,
port: 53 (port)
} (=socket),
vlan: error({
message: "cannot cast to uint16",
on: "available"
})
}
}) (error({msg:string,original:{kind:string,server:{addr:string,port:int64},client:{addr:string,port:int64},vlan:string},shaped:{kind:string,client:{addr:error({message:string,on:string}),port:port=uint16},server:socket={addr:ip,port:port},vlan:error({message:string,on:string})}}))
If you require awareness about changes made by the shaping functions that aren't surfaced as errors, a similar wrapping approach can be used with a general check for equality. For example, to treat cropped fields as an error, we can execute
super -Z -I connection.zed -c '
yield {original: this, cropped: crop(this, <connection>)}
|> yield original==cropped
? original
: error({msg: "data was cropped", original, cropped})
' sample.json
which produces
error({
msg: "data was cropped",
original: {
kind: "dns",
server: {
addr: "10.0.0.100",
port: 53
},
client: {
addr: "10.47.1.100",
port: 41772
},
uid: "C2zK5f13SbCtKcyiW5"
},
cropped: {
kind: "dns",
server: {
addr: "10.0.0.100",
port: 53
},
client: {
addr: "10.47.1.100",
port: 41772
}
}
})
Type fusion is another important building block of data shaping. Here, types are operated upon by fusing them together, where the result is a single fused type. Some systems call a related process "schema inference" where a set of values, typically JSON, is analyzed to determine a relational schema that all the data will fit into. However, this is just a special case of type fusion as fusion is fine-grained and based on Zed's type system rather than having the narrower goal of computing a schema for representations like relational tables, Parquet, Avro, etc.
Type fusion utilizes two key techniques.
The first technique is to simply combine types with a type union.
For example, an int64
and a string
can be merged into a common
type of union (int64,string)
, e.g., the value sequence 1 "foo"
can be fused into the single-type sequence:
1((int64,string))
"foo"((int64,string))
The second technique is to merge fields of records, analogous to a spread
expression. Here, the value sequence {a:1}{b:"foo"}
may be
fused into the single-type sequence:
{a:1,b:null(string)}
{a:null(int64),b:"foo"}
Of course, these two techniques can be powerfully combined,
e.g., where the value sequence {a:1}{a:"foo",b:2}
may be
fused into the single-type sequence:
{a:1((int64,string)),b:null(int64)}
{a:"foo"((int64,string)),b:2}
To perform fusion, Zed currently includes two key mechanisms (though this is an active area of development):
- the
fuse
operator, and - the
fuse
aggregate function.
The fuse
operator reads all of its input, computes a fused type using
the techniques above, and outputs the result, e.g.,
echo '{x:1} {y:"foo"} {x:2,y:"bar"}' | super -z -c fuse -
produces
{x:1,y:null(string)}
{x:null(int64),y:"foo"}
{x:2,y:"bar"}
whereas
echo '{x:1} {x:"foo",y:"foo"}{x:2,y:"bar"}' | super -z -c fuse -
requires a type union for field x
and produces:
{x:1((int64,string)),y:null(string)}
{x:"foo"((int64,string)),y:"foo"}
{x:2((int64,string)),y:"bar"}
The fuse
aggregate function is most often useful during data exploration and discovery
where you might interactively run queries to determine the shapes of some new
or unknown input data and how those various shapes relate to one another.
For example, in the example sequence above, we can use the fuse
aggregate function to determine
the fused type rather than transforming the values, e.g.,
echo '{x:1} {x:"foo",y:"foo"} {x:2,y:"bar"}' | super -z -c 'fuse(this)' -
results in
<{x:(int64,string),y:string}>
Since the fuse
here is an aggregate function, it can also be used with
group-by keys. Supposing we want to divide records into categories and fuse
the records in each category, we can use a group-by. In this simple example, we
will fuse records based on their number of fields using the
len
function:
echo '{x:1} {x:"foo",y:"foo"} {x:2,y:"bar"}' |
super -z -c 'fuse(this) by len(this) |> sort len' -
which produces
{len:1,fuse:<{x:int64}>}
{len:2,fuse:<{x:(int64,string),y:string}>}
Now, we can turn around and write a "shaper" for data that has the patterns
we "discovered" above, e.g., if this Zed source text is in shape.zed
switch len(this) (
case 1 => pass
case 2 => yield shape(this, <{x:(int64,string),y:string}>)
default => yield error({kind:"unrecognized shape",value:this})
)
when we run
echo '{x:1} {x:"foo",y:"foo"} {x:2,y:"bar"} {a:1,b:2,c:3}' |
super -z -I shape.zed -c '|> sort this desc' -
we get
error({kind:"unrecognized shape",value:{a:1,b:2,c:3}})
{x:"foo"((int64,string)),y:"foo"}
{x:2((int64,string)),y:"bar"}
{x:1}