Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM Workspace Mockup #9

Open
echeipesh opened this issue Aug 16, 2018 · 3 comments
Open

VLM Workspace Mockup #9

echeipesh opened this issue Aug 16, 2018 · 3 comments
Labels
needs prototype prototype needed question Further information is requested

Comments

@echeipesh
Copy link
Collaborator

This is record of a chat with @metasim.

The underlying principles are this:

  • Spatial index is always external catalog, specific to the application (ex: PostgreSQL, ElasticSearch)
  • Assets are going to be large files like COGs and we're going to figure out how to sample them
  • Reading/reprojecting/resampling a window should be inlined in IO stage
    • ie: avoid using spark shuffles to do those things when they can be done at read time
  • Tile reads will be deferred until last possible moment

This is how these principles may play out in RasterFrames API when we introduce an idea of a Workspace for output rasters.

object UseCaseRF {

  // we are in Zeppelin ... we got geotiffs

  val catalog = StacCatalog("http://landsat.pds/catalog.json")

  val items: Seq[StacItem] =
    stacCatalog.query(
      provider = "NAIP",
      bbox = ???,
      timeRange = ???,
      Tag("eo:cloud_cover") < 0.5)

  // Lets separate the act of querying for assets from act of reading them
  // If scenes remain in their native sizes, ex 800MB per LC8 scene its nearly
  // certain that the scene selection and filtering can happen fully on driver

  val assets: Seq[StacAsset] = items.flatMap( ??? ) // list out / filter the scenes

  /** Pixel layout and projection for our results */
  case class Workspace(crs: CRS, layout: LayoutDefinition)
  // ... Because we're pre-planning the reads to inline reprojection and resampling we will need to know the target pixel grid

  // Case 1: we know target, we're building pyramid
  val workspace = Workspace
    .fromPyramid(crs = WebMercator, level = 13)
    .build

  // Case 2: we're inferring layout from assets, they better match
  val workspace = Workspace
    .from(assets)
    .build
  // ... check there is one CRS, use it or throw
  // ... check they are grid aligned, use it or throw

  // Case 3: We know something, but not too much
  val workspace = Workspace
    .from(assets) // we checked that CRS is good, why did we do that?
    .align(assets.head, NearestNeighbor) // I gues the first one is good enough
    .build

  // evidence: T: Asset => RasterSource(extent, crs, cols/rows) { def read(bbox: RasterExtent): Tile }
  val rf = assets.toRF(workspace)
  // - we are going to look at each asset
  // - blow it out into tiles from workspace
  // => 1 asset -> n rows
  // => we're going to get multiple tiles per key
  // => DataFrame join/shuffle will be cheap to bring those keys together

  /** Something that represents delayed tile read */
  case class RasterRef(re: RasterExtent, crs: CRS, source: RasterSource)
  // .. these will not be visible to the user but will be generated under the covers

  // Actually, I KNOW that these assets are somehow part of the layer.
  // So when I sample, I better sample across scenes .. so what tells me to do that...

  val mergeFunction: Seq[Double] => Double = ???

  val rf = assets.grouped(mergeFunction).toRF(workspace)
  // - we're going to look at each tile we will produce
  // - we're going to make RasterRef that joins pixels from each overlapping asset
  // => We need RasterRef because we need to group multiple sources per target tile
  // So this is probably not true, we will want to rely on the join/shuffle in RDD space to bring the RasterRefs together
  // So this part of API is suspect for RasterFrame/RDD case

  // this is only going to do 5% of IO
  rf.sample(0.05).select(aggregateHistogram("col1"))

}
@metasim
Copy link
Contributor

metasim commented Aug 21, 2018

@echeipesh Is the Tag("eo:cloud_cover") construct something you have implemented?

@echeipesh
Copy link
Collaborator Author

echeipesh commented Aug 21, 2018

@metasim Nope, thats just flavor. Was just thinking on the spot that a hypothetical STAC query method would need some type to represent filters against optional or user defined tags whereas required fields can be vanilla parameters.

Edit:
I know you broached the idea that LayerQuery could be better if it assembled the filters into an expression tree instead of eagerly evaluating them. In general that is linked to that.

@pomadchin pomadchin added the needs prototype prototype needed label May 30, 2019
@echeipesh echeipesh added the question Further information is requested label Jun 27, 2019
@echeipesh
Copy link
Collaborator Author

Current question if concept of inspect-able workspace, collection of rasters, independent of RasterSource instances still looks like useful feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs prototype prototype needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants