Skip to content

Commit

Permalink
Add slash at the end of the load path (#1366)
Browse files Browse the repository at this point in the history
Shredder puts together entities with the same schema model-revision-addition in the same batch under same folder. Let’s say you have events with `1-0-0`, `1-0-1` and `1-0-2` version of the `com.acme.test` in the same batch. In that case, resulting run folder will have following subfolders:
```
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=0
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=1
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=2
```
Before the fix, Loader was using the s3 paths without slash (/) at the end in the created copy statements. This works fine in most cases. However, when same batch contains events with `1-0-1` and `1-0-11`, then problem starts. In that case, run folder will have following subfolders:
```
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=1
output=good/vendor=com.acme/name=test/format=tsv/model=1/revision=0/addition=11
```
When entities in the `/model=1/revision=0/addition=1` are tried to be copied to respective table with copy statement, Redshift tries to copy the entities under `/model=1/revision=0/addition=11` as well since they have same prefix and it gives error during the copy since data under `/model=1/revision=0/addition=11` doesn’t have same structure with `1-0-1`. Putting slash at the end of the path solved the problem. After that change, only entities under `model=1/revision=0/addition=1` are copied as expected.
  • Loading branch information
spenes committed Nov 13, 2024
1 parent 0b447df commit e6ac798
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ object ShreddedType {
*/
final case class Json(info: Info, jsonPaths: BlobStorage.Key) extends ShreddedType {
def getLoadPath: String =
s"${info.base}${Common.GoodPrefix}/vendor=${info.vendor}/name=${info.name}/format=json/model=${info.version.model}/revision=${info.version.revision}/addition=${info.version.addition}"
s"${info.base}${Common.GoodPrefix}/vendor=${info.vendor}/name=${info.name}/format=json/model=${info.version.model}/revision=${info.version.revision}/addition=${info.version.addition}/"

def show: String = s"${info.toCriterion.asString} ($jsonPaths)"
}
Expand All @@ -78,7 +78,7 @@ object ShreddedType {
*/
final case class Tabular(info: Info) extends ShreddedType {
def getLoadPath: String =
s"${info.base}${Common.GoodPrefix}/vendor=${info.vendor}/name=${info.name}/format=tsv/model=${info.version.model}/revision=${info.version.revision}/addition=${info.version.addition}"
s"${info.base}${Common.GoodPrefix}/vendor=${info.vendor}/name=${info.name}/format=tsv/model=${info.version.model}/revision=${info.version.revision}/addition=${info.version.addition}/"

def show: String = s"${info.toCriterion.asString} TSV"
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -144,8 +144,8 @@ class RedshiftSpec extends Specification {
result.toList must containTheSameElementsAs(
List(
"COPY events FROM s3://my-bucket/my-path/", // atomic
"COPY com_acme_event_2 FROM s3://my-bucket/my-path/output=good/vendor=com.acme/name=event/format=tsv/model=2/revision=0/addition=0",
"COPY com_acme_event_3 FROM s3://my-bucket/my-path/output=good/vendor=com.acme/name=event/format=tsv/model=3/revision=0/addition=0"
"COPY com_acme_event_2 FROM s3://my-bucket/my-path/output=good/vendor=com.acme/name=event/format=tsv/model=2/revision=0/addition=0/",
"COPY com_acme_event_3 FROM s3://my-bucket/my-path/output=good/vendor=com.acme/name=event/format=tsv/model=3/revision=0/addition=0/"
)
)
}
Expand Down

0 comments on commit e6ac798

Please sign in to comment.