Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-load some spatial datasets #894

Open
adkinsrs opened this issue Sep 18, 2024 · 4 comments
Open

Pre-load some spatial datasets #894

adkinsrs opened this issue Sep 18, 2024 · 4 comments
Assignees

Comments

@adkinsrs
Copy link
Member

Sure, we could do this after the uploader step is created (#892), but I feel that it would be better to just pre-load some spatial datasets our own way. One reason is that we can test other tools developed (#890) without having the uploader as a blocker. Another reason is that can already have established the ready-to-go format and stored file structure that the uploader should go into.

@adkinsrs adkinsrs self-assigned this Sep 18, 2024
@adkinsrs
Copy link
Member Author

Spoke with @jorvis about using Google Filestore as a test space, since we previously discussed having to move our datasets off the VM for performance reasons. Google Buckets would also work but Filestore would be easier to integrate with our current code bases that use filepaths.

Google Filestore overview -> https://cloud.google.com/filestore/docs/overview

@adkinsrs
Copy link
Member Author

adkinsrs commented Sep 18, 2024

Haven't provisioned anything yet, but here's what I'm thinking for the Filestore instance. Prices for us-east1 are actually about $50 more dollars than I listed because the examples cite us-central1

  • zonal (us-east1-b) with initial 1Tb capacity - restricted to only that zone but our gEAR VM is on that zone. Zonal pricing is much cheaper than regional ($256/month min per TB vs $461)
  • There is a "basic" option that is $164/mo per TB for HDD and $768 for SSD) but I don't think either are good option. The HDD option is static (no room for growth) and the SDD allows for capacity growth but at a much higher cost than the zonal/regional options. Therefore I would not recommend the "basic"
  • Both zonal and regional auto-grow or auto-shrink at 0.25Tb increments with a minimum of 1 Tb. However you have to choose between a 1Tb-9.75Tb capacity or a 10Tb-100Tb capacity (same rate but 2.5Tb scaling increments). So if we were to choose the lower-end capacity option, and go into 10Tb of data, then we would have to make a new filestore as the choice is permanent. Again, zonal is cheaper and would align well with our zonal gEAR VMs
  • The other options, like mount point name and network connections are things we can control that should not have too much bearing on costs.

So in a way, I don't think this would be terribly cost-efficient until we actually put data onto the FileStore to use. It would be a pretty inefficient use of resources to pay $260/month for me to test 1 spatial dataset until things were working and we could add more.

More info -> https://cloud.google.com/filestore/docs/service-tiers

@adkinsrs
Copy link
Member Author

adkinsrs commented Sep 18, 2024

The other option would be to use one of Google's block-storage systems instead of their file-storage system (which I described above). I read up on the differences, and it seems the biggest difference is in file-storage, the management occurs on Google's side, whereas for block-storage you receive the disk block and then configure and manage the filesystem yourself (on the server).

Block storage (like Hyperdisk) also seems much cheaper than the file-storage options I quoted above. 1 TB of Hyperdisk balanced-provisioned space is $90/month. I believe you mount the disk to the VM just like in the other cases, but I need to just read up more to get a feel for the flow of things. They will also charge extra monthly if we were to exceed the baseline 3,000 IOPS and 140 MBps throughput that is included. I can see us probably going of 3K IOPS in a month (~$31), but maybe not the 140Mbps

https://cloud.google.com/compute/disks-image-pricing#disk

Found this cool flowchart that may also answer some questions as well -> https://cloud.google.com/static/architecture/images/storage-advisor.svg

Based on the flowchart, it seems like Filestore-zonal would be the best candidate but I wouldn't rule out Persistent Disk-Zonal or Hyperdisk Balanced due to potentially better costs, if the integration and flow are right

There is also this "which to choose" graphic -> https://cloud.google.com/blog/topics/developers-practitioners/map-storage-options-google-cloud

@adkinsrs
Copy link
Member Author

Using traditional Google bucket (object) storage is also an option and even cheaper (~$20 TB/month). We would have to use FUSE to mount the bucket to our VMs though -> https://cloud.google.com/storage/docs/gcsfuse-mount

I think from a strict requirements perspective, we do not NEED file-based access with respect to datasets. Generally, with the exception of saved analyses, all h5ads are stored in a flat location. But performance would take a hit, as we would need to enable caching ensure reading data would be faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant