Unpredictable number of files created by substation #219
-
Hey team, so we have noticed that there is not a 1 to 1 between the number of objects coming in from the source and the number of objects written to the destination bucket. To me it seems like there are 2 reasons this could be the case
So I have isolated these variables, and for some reason it still seems like the source file is writing multiple files in the sink. I have a zipped 18 MB file (~100KB zipped), and the batch size is set to 10MB and the concurrency is set to one. This file is uploaded, substation is triggered, and 10 files are written to the sink (each 1.8MB) . Do you have any idea why this would be? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
@viraj-lunani Substation doesn't support reading Zip files (it dynamically decompresses Gzip, Snappy, zstd, and Bzip2) and any files it does read should decompress to text (not a binary format). This problem is most likely caused by the Zip file being read as text, and probably has nothing to do with the size of the file. For example, you can read and write a ~70 MB base64 encoded string with this config: {
transforms: [
sub.tf.send.file({
batch: { size: 1000 * 1000 * 75 }, // 75MB
file_path: { uuid: true },
})
]
} Generate a file and convert it to base64 if you want to test it: dd if=/dev/urandom of=file bs=1M count=50
cat file | base64 > file.b64 You can try that with this branch (fixes a bug when reading very large lines, AWS Lambda isn't affected by this). Generic support for non-text (binary) files would need to be added in locations like this, then support for archive formats would be new transform functions. |
Beta Was this translation helpful? Give feedback.
Hey sorry for the confusion, but the files are gzip and they decompress to text. But after testing this out locally using the sub.tf.send.file() transform, I believe the issue is that the default count is set to 1000, which is why the file size was not reaching the 10MB limit, bc the count reached the limit of 1000 first. Thank you for the help!