Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Disk and memory-backed cache with Foyer 0.10. #2890

Merged
merged 11 commits into from
Oct 4, 2024
Merged

Conversation

rescrv
Copy link
Contributor

@rescrv rescrv commented Oct 1, 2024

Description of changes

The lfu/lru/weighted_lfu caches didn't initialize a hybrid cache. This PR rewrites the chroma_cache crate to hide implementation details of the cache as much as possible. Noteworthy callouts from this PR:

  • We don't currently support cleaning up of tempfiles evicted from cache. This is status-quo.
  • cops-disk-cache-config-writer and cops-memory-cache-config-writer provide command-line arguments to generate the YAML for the Foyer cache.
  • Cache<K, V> is a trait. PersistentCache<K, V> allows for requiring persistence. All PersistentCache<K, V> can be cast to Cache<K, V>.

Test plan

How are these changes tested?

  • [] Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?

Copy link

github-actions bot commented Oct 1, 2024

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

The lfu/lru/weighted_lfu caches didn't initialize a hybrid cache.  This
PR rewrites the chroma_cache crate to hide implementation details of the
cache as much as possible.  Noteworthy callouts from this PR:

- We don't currently support cleaning up of tempfiles evicted from
  cache.  This is status-quo.
- cops-disk-cache-config-writer and cops-memory-cache-config-writer
  provide command-line arguments to generate the YAML for the Foyer
  cache.
- Cache<K, V> is a trait.  PersistentCache<K, V> allows for requiring
  persistence.  All PersistentCache<K, V> can be cast to Cache<K, V>.
@rescrv rescrv requested a review from HammadB October 2, 2024 16:29
fn try_from(b: &Block) -> Result<Self, Self::Error> {
Ok(CachedBlock {
id: b.id,
data: b.to_bytes()?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think from_bytes/to_bytes will result in copies of the data, do we care?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do. I'm debating on a two-level cache for persistent and non-persistent cache to avoid this problem, but I don't know how we'll do that.

@@ -34,8 +34,8 @@ impl ArrowBlockfileProvider {
pub fn new(
storage: Storage,
max_block_size_bytes: usize,
block_cache: Cache<Uuid, Block>,
sparse_index_cache: Cache<Uuid, SparseIndex>,
block_cache: Box<dyn Cache<Uuid, CachedBlock>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that all cache accesses will use to_bytes/from_bytes? That seems not super great since its in the hot path?

}

pub(super) async fn get(&self, id: &Uuid) -> Result<Option<Block>, GetError> {
let block = self.block_cache.get(id);
let block = self.block_cache.get(id).await.ok().flatten();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means every access to the cache results in a copying of the data in the block, which means basically all queries copy potentially 100's of MB of data around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I agree that it seems like every cache access converts bytes to blocks which copies data (a block is 8MB so 8MB per access) and vice versa (unless I am missing something which is also likely)

@@ -196,8 +197,9 @@ impl BlockManager {
pub(super) fn new(
storage: Storage,
max_block_size_bytes: usize,
block_cache: Cache<Uuid, Block>,
block_cache: Box<dyn Cache<Uuid, CachedBlock>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why take a Box and turn it into an Arc instead of just taking an Arc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface for creating creates a box. Consistency?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad for not noticing


use chroma_cache::{CacheConfig, FoyerCacheConfig};

fn main() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment that this takes the default config and generates the yaml would be helpful


use chroma_cache::{CacheConfig, FoyerCacheConfig};

fn main() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit / thought: might be nicer to have one binary that takes a flag for which config it generates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to make it this way as it's clear from the tool which output you'll get. This allows us to have one tool per enum variant with exactly the options said variant needs. I'm not wedded to this idea, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good with me

use super::{CacheError, Weighted};

/// A zero-configuration cache that doesn't evict.
pub struct NopCache;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is unused? If so, should we remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wired up. I think being able to stub out the cache is useful, so I'd prefer to leave it. You have to set cache config to nop to use it, so the risk of inadvertently using it is low.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good with me

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me, my primary concern is to do with to_bytes/from_bytes on the CachedBlock impl. @sanketkedia wdyt?

pub(super) fn cached(&self, id: &Uuid) -> bool {
self.block_cache.get(id).is_some()
pub(super) async fn cached(&self, id: &Uuid) -> bool {
self.block_cache.get(id).await.ok().is_some()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this panic in case of errors? Should we instead catch the error and surface it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.ok().is_some() should be written .is_ok(). It will not panic.

/// Deserialize.
#[derive(Clone, Debug)]
#[repr(transparent)]
pub struct RecordBatchWrapper(pub RecordBatch);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this

Copy link
Collaborator

@HammadB HammadB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on nits / potentially unused code that would be good to look at. Seems great otherwise! Thank you for chasing the solution

@rescrv rescrv merged commit 8e9dbc9 into main Oct 4, 2024
70 checks passed
@rescrv rescrv deleted the rescrv/foyer-0.10 branch October 4, 2024 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants