Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing on complicated schemas #2

Open
mattpollock opened this issue Oct 29, 2014 · 2 comments
Open

failing on complicated schemas #2

mattpollock opened this issue Oct 29, 2014 · 2 comments

Comments

@mattpollock
Copy link

Hello,

I tested read.avro using a moderately complicated schema. Some fields contain sub-records, other fields contain arrays of records. One of the sub-records (named moments and containing mean, variance, skewness, and kurtosis fields) is defined the first time and referenced as a type subsequently. This does not cause avro any problems, but read.avro throws the following error:

> dat <- read.avro(file="/path/to/file/part-r-00000.avro")
Error in (function (x, schema, flatten = T, simplify = F, encoded_unions = T,  : 
  Unsupported Avro type: moments

The schema being read here reads (in part):

...
"fields" : [ {
    "name" : "routename",
    "type" : "string",
    "doc" : "path identifier indicates unique fix sequence"
  }, {
    "name" : "aircrafttype",
    "type" : "string"
  }, {
    "name" : "lowaltitudebin",
    "type" : "double",
    "doc" : "altitude [feet] at low end of route (rounded to nearest 1000ft)"
  }, {
    "name" : "highaltitudebin",
    "type" : "double",
    "doc" : "altitude [feet] at high end of route (rounded to nearest 1000ft)"
  }, {
    "name" : "route",
    "type" : [ "null", {
      "type" : "record",
      "name" : "routemetrics",
      "fields" : [ {
        "name" : "route",
        "type" : [ "string", "null" ]
      }, {
        "name" : "initialalttude",
        "type" : [ "null", {
          "type" : "record",
          "name" : "moments",
          "fields" : [ {
            "name" : "mean",
            "type" : "double"
          }, {
            "name" : "variance",
            "type" : "double"
          }, {
            "name" : "skewness",
            "type" : "double"
          }, {
            "name" : "kurtosis",
            "type" : "double"
          }, {
            "name" : "samplesize",
            "type" : "long"
          } ]
        } ],
        "doc" : "moments [feet] characterizing distribution of atltitudes at the beginning of the route (within given binning constraint)"
      }, {
        "name" : "terminalaltitude",
        "type" : [ "null", "moments" ],
        "doc" : "moments [feet] characterizing distribution of atltitudes at the end of the route (within given binning constraint)"
      }, {...

Note that moments is defined as a type (as part of a union) for the first time in the initialalttude field, which is a field of the routemetrics record nested inside of the top-level route field. After that, moments is referenced by name in the subsequent terminalaltitude field.

Are there any plans to deal well with schemas like the one above?

@jamiefolson
Copy link
Contributor

If I recall correctly, we were primarily focused on Avro files with the schema embedded. In that case, at least for the data we tested, record schemas were duplicated everywhere they appear in the schema("moments" would be defined in both places). It seems this not the case for the schema metadata in your Avro files?

@mattpollock
Copy link
Author

The data was generated using PIG. When I attempted to explicitly define moments throughout the schema it threw errors (protecting against my giving the same name to different types of records I think). I assumed that this was not unique to the avro/PIG handshake, but a general avro schema requirement. Perhaps that isn't the case.

Regardless, the way I defined the schema when saving the data and the way it pops out when using avro-tools getschema on a resulting data file (which is what I pasted above) are consistent, defining moments only once. This does not cause any hiccups for avro-tools tojson. Also, messing around with the java API, calling fld.schema().getFields() (where fld is an object of type org.apache.avro.Schema.Field) on fields where moments is the type but is not explicitly defined (e.g., in the case of the terminalaltitude field above) returned the expected fields (mean, variance, etc.) without any problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants