VLM Workspace Mockup #9

echeipesh · 2018-08-16T18:44:03Z

This is record of a chat with @metasim.

The underlying principles are this:

Spatial index is always external catalog, specific to the application (ex: PostgreSQL, ElasticSearch)
Assets are going to be large files like COGs and we're going to figure out how to sample them
Reading/reprojecting/resampling a window should be inlined in IO stage
- ie: avoid using spark shuffles to do those things when they can be done at read time
Tile reads will be deferred until last possible moment

This is how these principles may play out in RasterFrames API when we introduce an idea of a Workspace for output rasters.

object UseCaseRF {

  // we are in Zeppelin ... we got geotiffs

  val catalog = StacCatalog("http://landsat.pds/catalog.json")

  val items: Seq[StacItem] =
    stacCatalog.query(
      provider = "NAIP",
      bbox = ???,
      timeRange = ???,
      Tag("eo:cloud_cover") < 0.5)

  // Lets separate the act of querying for assets from act of reading them
  // If scenes remain in their native sizes, ex 800MB per LC8 scene its nearly
  // certain that the scene selection and filtering can happen fully on driver

  val assets: Seq[StacAsset] = items.flatMap( ??? ) // list out / filter the scenes

  /** Pixel layout and projection for our results */
  case class Workspace(crs: CRS, layout: LayoutDefinition)
  // ... Because we're pre-planning the reads to inline reprojection and resampling we will need to know the target pixel grid

  // Case 1: we know target, we're building pyramid
  val workspace = Workspace
    .fromPyramid(crs = WebMercator, level = 13)
    .build

  // Case 2: we're inferring layout from assets, they better match
  val workspace = Workspace
    .from(assets)
    .build
  // ... check there is one CRS, use it or throw
  // ... check they are grid aligned, use it or throw

  // Case 3: We know something, but not too much
  val workspace = Workspace
    .from(assets) // we checked that CRS is good, why did we do that?
    .align(assets.head, NearestNeighbor) // I gues the first one is good enough
    .build

  // evidence: T: Asset => RasterSource(extent, crs, cols/rows) { def read(bbox: RasterExtent): Tile }
  val rf = assets.toRF(workspace)
  // - we are going to look at each asset
  // - blow it out into tiles from workspace
  // => 1 asset -> n rows
  // => we're going to get multiple tiles per key
  // => DataFrame join/shuffle will be cheap to bring those keys together

  /** Something that represents delayed tile read */
  case class RasterRef(re: RasterExtent, crs: CRS, source: RasterSource)
  // .. these will not be visible to the user but will be generated under the covers

  // Actually, I KNOW that these assets are somehow part of the layer.
  // So when I sample, I better sample across scenes .. so what tells me to do that...

  val mergeFunction: Seq[Double] => Double = ???

  val rf = assets.grouped(mergeFunction).toRF(workspace)
  // - we're going to look at each tile we will produce
  // - we're going to make RasterRef that joins pixels from each overlapping asset
  // => We need RasterRef because we need to group multiple sources per target tile
  // So this is probably not true, we will want to rely on the join/shuffle in RDD space to bring the RasterRefs together
  // So this part of API is suspect for RasterFrame/RDD case

  // this is only going to do 5% of IO
  rf.sample(0.05).select(aggregateHistogram("col1"))

}

The text was updated successfully, but these errors were encountered:

metasim · 2018-08-21T13:07:10Z

@echeipesh Is the Tag("eo:cloud_cover") construct something you have implemented?

echeipesh · 2018-08-21T14:47:12Z

@metasim Nope, thats just flavor. Was just thinking on the spot that a hypothetical STAC query method would need some type to represent filters against optional or user defined tags whereas required fields can be vanilla parameters.

Edit:
I know you broached the idea that LayerQuery could be better if it assembled the filters into an expression tree instead of eagerly evaluating them. In general that is linked to that.

echeipesh · 2019-06-27T10:10:50Z

Current question if concept of inspect-able workspace, collection of rasters, independent of RasterSource instances still looks like useful feature.

pomadchin added the needs prototype prototype needed label May 30, 2019

echeipesh added the question Further information is requested label Jun 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM Workspace Mockup #9

VLM Workspace Mockup #9

echeipesh commented Aug 16, 2018

metasim commented Aug 21, 2018

echeipesh commented Aug 21, 2018 •

edited

Loading

echeipesh commented Jun 27, 2019

VLM Workspace Mockup #9

VLM Workspace Mockup #9

Comments

echeipesh commented Aug 16, 2018

metasim commented Aug 21, 2018

echeipesh commented Aug 21, 2018 • edited Loading

echeipesh commented Jun 27, 2019

echeipesh commented Aug 21, 2018 •

edited

Loading