Filesystem - Large File Handling (Caching) #6411

mrnicegyu11 · 2024-09-20T10:02:59Z

Event Horizon

Give feedback

[IMPORTANT] Dynamic sidecar changing permission issue #6845

1 of 2

a:dynamic-sidecar
Autoscaler - General solution keep EBS for 1 day #6863

a:autoscaling
RUT - Pricing Unit/Cost + enable/disable feature
Webserver - CRUD operation on EBS Management
Frontend UI - enable/disable/overview
Options

Tasks

Give feedback

No tasks being tracked yet.

Options

MartinKippenberger

Give feedback

enable EFS for workspace in production #6243

a:infra+ops efs-guardian
monitoring & observability for the EFS osparc-issues#1630
Options

mguidon · 2024-11-05T09:45:04Z

Measure during the next sprint using aws dashboards/graphs
Alternative (lustre)

matusdrobuliak66 · 2024-11-24T16:57:21Z

General notes

Nice explanation differentce between IOPS and throughput https://stackoverflow.com/questions/59182414/iops-vs-throughput-which-one-to-use-while-choosing-aws-ebs
Types of distributed file systems (table generated by chat GPT)

Type	Overview	Examples
Client-Server	Central server stores files, accessed by clients over the network.	NFS, SMB
Peer-to-Peer	Nodes share files directly without a centralized server.	IPFS, BitTorrent
Object-Based	Files broken into objects with separate metadata for scalability.	Amazon S3, Ceph, MinIO
Clustered	Multiple nodes form a cluster to share resources and distribute workloads.	GFS, HDFS, Lustre
Parallel	Multiple servers handle data access simultaneously to improve performance.	IBM GPFS, BeeGFS
Cloud-Based	File systems designed to leverage cloud infrastructure.	Azure Blob Storage, Google Cloud Storage
Block-Based	Files divided into fixed-size blocks and distributed across nodes.	GlusterFS, MooseFS
Metadata-Based	Separates metadata from data storage for faster access and management.	XtreemFS
Hybrid	Combines features of multiple DFS types for flexibility.	Red Hat Ceph Storage
Specialized	Designed for specific workloads or industries.	QFS, OrangeFS

XFS is not a distributed file system but an excellent choice as the underlying file system for distributed solutions. If you are setting up a distributed file system like GlusterFS, XFS is one of the best local file systems to use
GlusterFS can provide NFS-like shares and scale by adding more nodes.

What we want to achieve?

Speed up loading times for large projects by caching data from S3.
Mount a data folder across multiple EC2 instances.
Enable user billing for this feature.

Options looked at

Amazon EFS
Amazon FSx for Lustre
Regatta storage (pointed out by Dustin)
Gluster (open source)
Ceph
EBS (keeping EBS up)

EFS

2 options:
- Throughput mode - Elastic (the one we use now)
- Throughput mode - Provisioned

Elastic

💸 Storage 0.30$ (GB-Month)
💸 Reads 0.03$ (per GB transferred)
💸 Writes 0.06$ (per GB transferred)
Currently Enabled automatic backups -> Probably not needed
Currently we use around 1% of available IOPS

Provisioned

20 TB per Month
- 🏃‍♀ Default Throughput: 1024 MB/s
- 💸 Storage: 20,480.00 GB (Standard Storage) × 0.30 USD/GB = 6,144.00 USD
10 TB per Month
- 🏃‍♀ Default Throughput: 512 MB/s
- 💸 Storage: Standard Storage = 3,072.00 USD
- Total Monthly Cost: 3,072.00 USD
5 TB per Month
- 🏃‍♀ Default Throughput: 256 MB/s
- 💸 Storage: Standard Storage = 1,536.00 USD + Provisioned Throughput Cost (if applicable)
- Example of Provisioned Throughput Cost: Additional 256 MB/s throughput = 256.00 MB/s-Month × 6.00 USD = 💸 1,536.00 USD

Notes

Currently we do not use Lifecycle management feature -> which would move data to infrequent access 0.025$ (GB-Month) - Files in the Standard Access class can be accessed with latency measured in single-digit milliseconds; files in the Infrequent Access class have latency in the double-digits - probably not useful for our usecase.

PROS:

it was easy to setup
POSIX-compliant
Good for metadata-heavy operations (NFS type)

CONS:

does not support user quotas (but we already have a custom solution implemented)
To enable user billing based on throughput ->custom solution needs to be implemented (see section in the end of this review)
cost? 💸
What happens if we hit the limits of provisioned speed/storage?

Amazon FSx for Lustre

https://aws.amazon.com/fsx/lustre/pricing/
Cost are (Storage + Throughput together) with default IOPS + additional Data transferred "in" to and "out" from Amazon FSx across AZs or VPC Peering connections in the same Region is charged at $0.01/GB in each direction.
20 TB per Month (21600 GB because only multiplication of 2400)
- 🏃‍♀ Throughput: 500 MB/s
- 💸 Storage: 21600 GB (Standard Storage) × 0.340 USD/GB = 7344 USD
20 TB per Month (Scratch SSD) (21600 GB because only multiplication of 2400)
- 🏃‍♀ Throughput: 200 MB/s
- 💸 Storage: 21600 GB (Standard Storage) × 0.140 USD/GB = 3024 USD
- 🔴 I am not sure whether scratch file system is a good idea:
Additional costs 💸: Provisioned metadata IOPS 0.055$ per IOPS-month
- ex. buying 1500 additional more: 1500 * 0.055 = 82.5$

PROS:

AWS managed
does support user quotas

CONS:

To enable user billing based on throughput ->custom solution needs to be implemented (see section in the end of this review)
cost? 💸
What happens if we hit the limits of provisioned speed/storage?

Regatta storage

https://regattastorage.com/ (pointed out by Dustin)
Fix costs
- 25$ (per Month)
Storage
- 0.20$ (GB-Month)1,000
Throughput
- 0.05$ (per GB transferred)

CONS:

not a lot of reviews
3rd party vendor
Not a lot of customization (monitoring? user quotas?)
caching logic is handled by them (they remove it after 1 hour) - not useful for our usecase

Gluster (open source)

https://www.gluster.org/install/
AWS Installation: https://docs.gluster.org/en/v3/Install-Guide/Setup_aws/
gp3 EBS costs: 💸 0.08/GB-month + 875 * 0.04$/MB/s-month over 125MB/s + 13000 * 0.005$/IOPS-month over 3000 as we currently maximally boost EBS speed with additional 875 MB/s speed and 13000 IOPS. (Q: we should analyse how much is really needed)
EC2:
- 4x m5.large (2 vCPUs, 8GB RAM, 0.096 hourly * 24 * 30 = 70$ monthly)
  - 💸 70 * 4 = 280$ monthly
- 8x m5.large (2 vCPUs, 8GB RAM)
  - 💸 70 * 8 = 560$ monthly
EBS:
- 1 TB disk space:
  - 🏃‍♀ Throughput: 1000 MB/s & 16000 IOPS
  - 💸Cost Breakdown:
    - 0.08 * 1000 = 80 USD
    - 875 * 0.04 = 35 USD
    - 13,000 * 0.005 = 65 USD
  - Total Monthly Cost: 180 USD
- 💸 20 TB disk space (we need at least 30TB for fault tolerance)
  - 💸 180 * 30 = 5400 $

TOTAL ESTIMATE ~ 6000$

CONS:

🔴 maintenance!
setting up
What happens if we hit the limits of provisioned speed/storage?

Ceph (using inhouse Ceph cluster)

Q: Probably issue that we need to move data between Cloud and inhouse?
hard to maintain

EBS (keeping EBS up)

gp3 EBS costs: 💸 0.08/GB-month + 875 * 0.04$/MB/s-month over 125MB/s + 13000 * 0.005$/IOPS-month over 3000 as we currently maximally boost EBS speed with additional 875 MB/s speed and 13000 IOPS. (Q: we should analyse how much is really needed)
1 TB disk space example:
- 🏃‍♀ Throughput: 1000 MB/s & 16000 IOPS
- 💸Cost Breakdown:
  - 0.08 * 1000 = 80 USD
  - 875 * 0.04 = 35 USD
  - 13,000 * 0.005 = 65 USD
- Total Monthly Cost: 180 USD
  - Keeping up for 10 Days: 60 USD

PROS:

💸 predictable pricing model eliminates the need for upfront provisioning. Also not need to deal with hitting limits of provisioned distributed file system.
Probably not need to monitor throughput (pricing model can be done on run time of EBS volume)

CONS:

Useful for caching user workspace, but not if we want to have one general solution also for mouting folder to multiple EC2 instances.

Additional notes:

To enable user billing based on throughput
- Probably we need to create a custom CloudWatch metric and push data (e.g., Lustre I/O statistics) from EC2 instance to CloudWatch. Lustre has lctl or in case of EFS probably nfsstat or df and iotop (Needs to be investigated!)

matusdrobuliak66 · 2024-11-25T14:01:00Z

Conclusion (25.11.2024 Matus/Manuel discussed)

Moving away from the idea of using a distributed file system for caching and instead relying solely on pure EBS volumes at the user level seems to be a more reasonable approach.
- From the analysis above, you can see that it is much cheaper than the other solutions.
- It involves fewer issues to address:
  - For instance, how to handle distributed performance/limits, such as what happens if we reach the storage limit?
  - There is no need to measure throughput.
- We can build a better and more manageable pricing model around it (e.g., credits per hour of running the cached EBS).
- As a long-term strategy, we aim to provide users with the option to create their own computers with multiple services (Enhancement: Lock a EC2 machine for a specific project #5669). With this approach, mounting an EBS volume will automatically allow it to be shared across all services.

NEXT Steps:

Build logic of keeping EBS up + pricing model with UI
Run multiple services on one machine (Enhancement: Lock a EC2 machine for a specific project #5669) -> Implementation notes
VIP models feature

NOTE: We might use current EFS infrastructure to store VIP models

matusdrobuliak66 · 2024-11-26T16:06:32Z

Update 26.11.2024

We identified an issue that will also affect the logic with EBS keeping it up. The dynamic sidecar is always changing permissions for all files at the start. Until now, this was typically done for an empty volume. However, if there is data with many files, and it doesn’t manage to change permissions within 1 minute, the Docker health check will kill it and prevent the sidecar from starting.
- [IMPORTANT] Dynamic sidecar changing permission issue #6845

mrnicegyu11 added this to the MartinKippenberger milestone Sep 20, 2024

mrnicegyu11 mentioned this issue Sep 20, 2024

Filesystem concept (Feb.) ITISFoundation/osparc-issues#1313

Closed

mrnicegyu11 added the PO issue Created by Product owners [PLEASE use osparc-issue repo] label Sep 20, 2024

mrnicegyu11 assigned matusdrobuliak66 Sep 20, 2024

mrnicegyu11 mentioned this issue Sep 20, 2024

Performance Improvements for Large Projects ITISFoundation/osparc-issues#1327

Open

mguidon changed the title ~~Filesystem - Large File Handling (Mounting, Caching)~~ Filesystem - Large File Handling (Caching) Sep 24, 2024

This was referenced Oct 7, 2024

🎨 efs improvements (group extra properties) 🗃️ #6493

Merged

🎨 EFS Guardian: adding size monitoring #6502

Merged

🎨 notify frontend about current efs disk space #6520

Merged

This was referenced Oct 18, 2024

🎨 EFS Guardian adding data removal background task #6562

Merged

🐛 EFS Guardian - not need of owner information if project lock in MAINTAINING state 🚨 #6581

Merged

odeimaiz mentioned this issue Oct 30, 2024

🎨 [Frontend] Show EFS data storage #6639

Merged

1 task

sanderegg self-assigned this Nov 27, 2024

matusdrobuliak66 assigned GitHK and odeimaiz Nov 28, 2024

sanderegg modified the milestones: MartinKippenberger, Event Horizon Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filesystem - Large File Handling (Caching) #6411

Filesystem - Large File Handling (Caching) #6411

mrnicegyu11 commented Sep 20, 2024 •

edited by sanderegg

Loading

Event Horizon

Tasks

MartinKippenberger

mguidon commented Nov 5, 2024

matusdrobuliak66 commented Nov 24, 2024 •

edited

Loading

matusdrobuliak66 commented Nov 25, 2024 •

edited

Loading

matusdrobuliak66 commented Nov 26, 2024 •

edited

Loading

Filesystem - Large File Handling (Caching) #6411

Filesystem - Large File Handling (Caching) #6411

Comments

mrnicegyu11 commented Sep 20, 2024 • edited by sanderegg Loading

Event Horizon

Tasks

MartinKippenberger

mguidon commented Nov 5, 2024

matusdrobuliak66 commented Nov 24, 2024 • edited Loading

General notes

What we want to achieve?

Options looked at

EFS

Elastic

Provisioned

Notes

Amazon FSx for Lustre

Regatta storage

Gluster (open source)

Ceph (using inhouse Ceph cluster)

EBS (keeping EBS up)

Additional notes:

matusdrobuliak66 commented Nov 25, 2024 • edited Loading

Conclusion (25.11.2024 Matus/Manuel discussed)

matusdrobuliak66 commented Nov 26, 2024 • edited Loading

Update 26.11.2024

mrnicegyu11 commented Sep 20, 2024 •

edited by sanderegg

Loading

matusdrobuliak66 commented Nov 24, 2024 •

edited

Loading

matusdrobuliak66 commented Nov 25, 2024 •

edited

Loading

matusdrobuliak66 commented Nov 26, 2024 •

edited

Loading