Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

delta-only repositories #729

Open
cgwalters opened this issue Mar 10, 2017 · 15 comments
Open

delta-only repositories #729

cgwalters opened this issue Mar 10, 2017 · 15 comments

Comments

@cgwalters
Copy link
Member

In the Fedora/CentOS case where by default we rely on e.g. university-owned mirrors that might be some random ext4 server and not a proper object store, we can hit performance issues with the archive format.

It should be quite possible to make it easier for server operators to manage a "delta-only" repository. See also: #701

So it's delta-only + single "from empty" delta for the latest.

I think it'd be possible to cobble this together today via ostree static-delta generate --min-fallback-size 100000 for each delta you want, then ostree summary -u, then sync the summary and deltas/ content to the "delta repo".

@alexlarsson
Copy link
Member

I think this sounds good, as long as it properly falls back to the "from empty" delta if we're pulling from "not the next-to-latest" local version.

@cgwalters
Copy link
Member Author

(But we need some unit test coverage, and there's various enhancements one could make on top of this like being able to fall back to a separate archive repo for e.g. downgrades)

@cgwalters
Copy link
Member Author

Also, one thing occurs to me - we'd at least need to maintain the commit objects in the repo, otherwise prune would prune the deltas.

@dustymabe
Copy link
Contributor

(But we need some unit test coverage, and there's various enhancements one could make on top of this like being able to fall back to a separate archive repo for e.g. downgrades)

does this issue cover the creation of unit tests for static delta only repos or do we need another ticket for that?

Also, one thing occurs to me - we'd at least need to maintain the commit objects in the repo, otherwise prune would prune the deltas.

are we talking about the static delta only repo? wouldn't that get rid of the point of not having a bunch of small files in the repo? If we have a master repo where the small files and the static deltas live and then just create static delta only repos by copying content out of that repo then we don't need to worry about this correct?

@ramcq
Copy link
Contributor

ramcq commented Jan 9, 2018

be some random ext4 server and not a proper object store

@cgwalters I'm kind of confused by this - what about a filesystem makes it unsuitable for storing/hosting an ostree repo? Is there a more effective backend from which you can store an ostree repo and serve it over http? Or do mirror operators simply dislike having lots of files around?

@alexlarsson
Copy link
Member

So, I recently chatted with someone who was running an "app store" about how they implement authorized downloads. Basically what they do is serve the app files on a cdn like cloudfront, and then use a feature like cloudfront secure urls as documented here: https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PrivateContent.html where they generate the final URL on their server where they know that the logged in user is allowed to download a particular app. The secure URL has a lifetime of 30 seconds and is signed on the server, so the client doesn't have to care and can just download the thing.

In the context of ostree we could do the same thing if we had a delta-only repo on a cdn:

  • Use ostree pull --http-header=NAME=VALUE to set the secure cookie with a custom policy
  • In the policy, give access to "repo/deltas//${END OF COMMIT BASE64}" and only for 30 sec

Cloudfront allows you to use cookies for this, but it seems some other CDNs only support http params, so maybe ostree should have a feature similar to --http-header that adds a http param to all urls.

@dustymabe
Copy link
Contributor

@jlebon, @cgwalters, @sinnykumari and I were discussing 'delta-only' repos today. One thing @jlebon brought up was:

  jlebon | @walters @ksinny @dustymabe, just remembered re. static deltas -- those can actually list
         | fallback objects the client should just fetch directly from `objects/`. so we'll have to be
         | careful of that, either also mirroring just those ones (i think they're usually big files), or
         | teach ostree to fetch fallback objects from a separate repo? (edited)
 walters | yeah, i think we need a repo config flag saying it's a delta-only repo

@ramcq
Copy link
Contributor

ramcq commented Dec 18, 2018

I've been discussing this stuff with @alexlarsson a lot in the context of Flathub. At one point, the flathub stats were showing each download (whether an upgrade, or a new pull) was averaging 1GB of data transferred - but this was during a period that when ostree didn't see a matching delta it would pull the scratch delta instead of doing an object pull (madness, later resolved).

A delta-only repo is basically re-instating this: mirrors are great and everything, but are a far less relevant way of distributing files than modern caching/proxying CDNs. BunnyCDN (for Endless) and Fastly (for Flathub) work a-OK for ostree repos, and you can easily tune the caching to keep the immutable objects around for ~ever, have short timeouts / explicit purges, its pretty easy to cache ostree repos in CDNs, and the hit rate is superb (>97% in both cases I have access to, likely the two largest production ostree repos at present).

So: what problem is really being solved here? When you look at your CDN bill, or the time and data it costs at the client to have a very limited version of things on the server, I'm really not convinced that unless we make deltas heaps smarter, that a delta only repo is a benefit for clients. It makes mirroring easier, yes - because you have maybe one or a couple of delta folders per ref - but most people don't have a mirror network, so I think it represents a net loss for the bandwidth efficiency of the client, unless we:

  • Figure out some better practices/heuristics for generating/retaining deltas, such as having a chain of them until the cumulative size approaches a % threshold of the scratch delta
  • Have clients do some path-solving to actually pull a few deltas in a row, rather than falling back to an object pull
  • Take the list of deltas out of the summary file, otherwise it will massively bloat if you start to have any non-useless deltas available for people - I guess commit meta, provided you could square the circle of the delta itself containing the commit ID as a variable (but this is something that has been worked around in eg flatpak build-commit-from, because the delta is really about the content trees, not the commits, so they can be monkey-patched to a different commit)
  • More, smarter things...

@cgwalters
Copy link
Member Author

but this was during a period that when ostree didn't see a matching delta it would pull the scratch delta instead of doing an object pull (madness, later resolved).

Right: #1709

@ramcq
Copy link
Contributor

ramcq commented Dec 18, 2018

but this was during a period that when ostree didn't see a matching delta it would pull the scratch delta instead of doing an object pull (madness, later resolved).

Right: #1709

Oh yeah! What I said back then. tl;dr - deltas are an amazing technical advantage of ostree, and (modulo bringing any repo server to its knees when generating them on large files) incredibly smart and bandwidth efficient, but they totally fail to deliver on that promise due to how they are currently deployed and managed. Let's make repo the management tools, ostree/flatpak/repo-manager smarter before we force that ineffectual deployment cost onto our downstream mirrors and every end user by flipping a delta-only bit and not solving the real problem. :)

@cgwalters
Copy link
Member Author

We (FCOS) are discussing this in the context of this issue which links to this MirrorManager one. A concern some people have is tying ourselves solely to a CDN.

@ramcq
Copy link
Contributor

ramcq commented Dec 18, 2018 via email

@dustymabe
Copy link
Contributor

concern some people have is tying ourselves solely to a CDN.

for me, I'm not as concerned with tying ourselves to CDN. We've been using a CDN for our ostree repo for a little while now and people still complain about slow download speeds and timeouts all the time. So we either have things configured badly or things are getting cycled out of the cache too fast. See also #1541 where we were discussing one optimization (i.e. the many redirects might be what is slowing down the downloads).

If we can get a good CDN "answer" then i'd be fine with that too

@ramcq
Copy link
Contributor

ramcq commented Dec 19, 2018

Oh! Yeah redirects absolutely rinse the performance of whatever pipelining ostree is doing - at least I've definitely seen that at some point early in Flathub's life - that's why we set up dl.flathub.org as a separate hostname for repo access only. You have to point the origin in ostree to the hostname and path served by the CDN - you could probably finesse that with a mirrorlist of one in ostree.

I am almost certain that any Flathub issues are all due to load on the origin server rather than any problem with the CDN. Debian for instance has two CDNs (CloudFront and Fastly) and pays for neither - for Flathub we got Fastly basically by me tweeting, and it wasn't the only offer we received, just one of the best CDNs so I didn't spend much time with the others.

@ramcq
Copy link
Contributor

ramcq commented Dec 19, 2018

https://gist.github.com/ramcq/a3991b5834767c6da73eec1af08b52ab is how the origin is configured on Flathub, fwiw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants