You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tested read.avro using a moderately complicated schema. Some fields contain sub-records, other fields contain arrays of records. One of the sub-records (named moments and containing mean, variance, skewness, and kurtosis fields) is defined the first time and referenced as a type subsequently. This does not cause avro any problems, but read.avro throws the following error:
...
"fields" : [ {
"name" : "routename",
"type" : "string",
"doc" : "path identifier indicates unique fix sequence"
}, {
"name" : "aircrafttype",
"type" : "string"
}, {
"name" : "lowaltitudebin",
"type" : "double",
"doc" : "altitude [feet] at low end of route (rounded to nearest 1000ft)"
}, {
"name" : "highaltitudebin",
"type" : "double",
"doc" : "altitude [feet] at high end of route (rounded to nearest 1000ft)"
}, {
"name" : "route",
"type" : [ "null", {
"type" : "record",
"name" : "routemetrics",
"fields" : [ {
"name" : "route",
"type" : [ "string", "null" ]
}, {
"name" : "initialalttude",
"type" : [ "null", {
"type" : "record",
"name" : "moments",
"fields" : [ {
"name" : "mean",
"type" : "double"
}, {
"name" : "variance",
"type" : "double"
}, {
"name" : "skewness",
"type" : "double"
}, {
"name" : "kurtosis",
"type" : "double"
}, {
"name" : "samplesize",
"type" : "long"
} ]
} ],
"doc" : "moments [feet] characterizing distribution of atltitudes at the beginning of the route (within given binning constraint)"
}, {
"name" : "terminalaltitude",
"type" : [ "null", "moments" ],
"doc" : "moments [feet] characterizing distribution of atltitudes at the end of the route (within given binning constraint)"
}, {...
Note that moments is defined as a type (as part of a union) for the first time in the initialalttude field, which is a field of the routemetrics record nested inside of the top-level route field. After that, moments is referenced by name in the subsequent terminalaltitude field.
Are there any plans to deal well with schemas like the one above?
The text was updated successfully, but these errors were encountered:
If I recall correctly, we were primarily focused on Avro files with the schema embedded. In that case, at least for the data we tested, record schemas were duplicated everywhere they appear in the schema("moments" would be defined in both places). It seems this not the case for the schema metadata in your Avro files?
The data was generated using PIG. When I attempted to explicitly define moments throughout the schema it threw errors (protecting against my giving the same name to different types of records I think). I assumed that this was not unique to the avro/PIG handshake, but a general avro schema requirement. Perhaps that isn't the case.
Regardless, the way I defined the schema when saving the data and the way it pops out when using avro-tools getschema on a resulting data file (which is what I pasted above) are consistent, defining moments only once. This does not cause any hiccups for avro-tools tojson. Also, messing around with the java API, calling fld.schema().getFields() (where fld is an object of type org.apache.avro.Schema.Field) on fields where moments is the type but is not explicitly defined (e.g., in the case of the terminalaltitude field above) returned the expected fields (mean, variance, etc.) without any problem.
Hello,
I tested
read.avro
using a moderately complicated schema. Some fields contain sub-records, other fields contain arrays of records. One of the sub-records (namedmoments
and containing mean, variance, skewness, and kurtosis fields) is defined the first time and referenced as a type subsequently. This does not cause avro any problems, butread.avro
throws the following error:The schema being read here reads (in part):
Note that
moments
is defined as a type (as part of a union) for the first time in theinitialalttude
field, which is a field of theroutemetrics
record nested inside of the top-levelroute
field. After that,moments
is referenced by name in the subsequentterminalaltitude
field.Are there any plans to deal well with schemas like the one above?
The text was updated successfully, but these errors were encountered: