You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The recent work of #73 revealed another scalability issue when importing large filesets into OMERO.server.
Synthetic datasets of varying number of files can easily be created using a combination of Bio-Formats test images and pattern files:
for i in $(seq 1 $N); do touch "t$i.fake"; done && echo "t<1-$N>.fake" > multifile.pattern
Each of this dataset can then be imported using the OMERO command-line interface. In that case, the import was done using in-place transfer, skipping the min/max calculation:
The import time for a given fileset can then be queried using omero fs importtime Fileset:<id>. The output of this command produces a breakdown by phase (upload, metadata, ...). A more detailed analysis can then be obtained using the import logs stored under the managed repository.
The import command above was executed for synthetic filesets of growing sizes (10, 100, 1000, 10000 and 50000 files) using OMERO.server 5.6.11 and an initially empty database. The following table reports the import metrics in the upload phase as well as a breakdown by a few sub-steps in this phase:
upload time
omero.repo.create_original_file.save
omero.repo.save_fileset
omero.import.process.checksum
10
1.3
0.01
0.1
0.1
100
5.8
0.06
0.28
0.5
1000
43.9
0.5
2.1
5.2
10000
382.5
4.4
57.2
53
50000
4778.2
20.7
2195.0
910
In principle, the number of transfer operations, objects to create should scale linearly with the number of files in the fileset. Thus we would reasonably expect the execution time to scale linearly with the number of
files.
The last column of the table shows some non-linear behavior and corresponds to the issue described in #73. This should hopefully be addressed in OMERO.server 5.6.12 and help with one of the bottlenecks associated with importing large filesets.
Unlike the creation of OriginalFile in RepositoryDaoImpl.createOriginalFile which execution scales linearly with the number of files in the fileset, the creation of the Fileset in RepositoryDaoImpl.saveFileset increases non-linearly. For a typical fileset of 50K files, ~50% of the total upload time is spent in this operation and more precisely 2068s (95% of the saveFileset time) happen during the single operation saving the objects to the database in
Nice write up, @sbesson. I don't have any immediate ideas. I'd be interested how much faster a dirty change to saveAndReturnIds is (though it's likely a breaking change) as well as how close the graph is compared to your RAM. After that, it'd be in to profiling AFAIK.
The recent work of #73 revealed another scalability issue when importing large filesets into OMERO.server.
Synthetic datasets of varying number of files can easily be created using a combination of Bio-Formats test images and pattern files:
Each of this dataset can then be imported using the OMERO command-line interface. In that case, the import was done using in-place transfer, skipping the min/max calculation:
The import time for a given fileset can then be queried using
omero fs importtime Fileset:<id>
. The output of this command produces a breakdown by phase (upload, metadata, ...). A more detailed analysis can then be obtained using the import logs stored under the managed repository.The import command above was executed for synthetic filesets of growing sizes (10, 100, 1000, 10000 and 50000 files) using OMERO.server 5.6.11 and an initially empty database. The following table reports the import metrics in the upload phase as well as a breakdown by a few sub-steps in this phase:
In principle, the number of transfer operations, objects to create should scale linearly with the number of files in the fileset. Thus we would reasonably expect the execution time to scale linearly with the number of
files.
The last column of the table shows some non-linear behavior and corresponds to the issue described in #73. This should hopefully be addressed in OMERO.server 5.6.12 and help with one of the bottlenecks associated with importing large filesets.
Unlike the creation of
OriginalFile
inRepositoryDaoImpl.createOriginalFile
which execution scales linearly with the number of files in the fileset, the creation of the Fileset inRepositoryDaoImpl.saveFileset
increases non-linearly. For a typical fileset of 50K files, ~50% of the total upload time is spent in this operation and more precisely 2068s (95% of thesaveFileset
time) happen during the single operation saving the objects to the database inomero-blitz/src/main/java/ome/services/blitz/repo/RepositoryDaoImpl.java
Lines 555 to 560 in 4c46e15
/cc @chris-allan @joshmoore @jburel @pwalczysko
The text was updated successfully, but these errors were encountered: