Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from BGZFStreams to CodecBGZF #30

Open
jakobnissen opened this issue Aug 23, 2020 · 2 comments
Open

Migrate from BGZFStreams to CodecBGZF #30

jakobnissen opened this issue Aug 23, 2020 · 2 comments

Comments

@jakobnissen
Copy link
Member

Dear @CiaranOMara

I haven't lost track of the existing issues, and will get around to resolving them!

However, I got sidetracked with a more fundamental issue: The BGZFStreams package is slow, and not well integrated with the rest of the Julia IO ecosystem. Kenta Sato, the author of BGZFStreams.jl realized this, and created CodecBGZF.jl three years ago, but it looks like that project has been abandoned, and Kenta is a busy guy, hard to reach these days. See also this comment from Kenta Sato

In fact, BGZFStreams is the major bottleneck of XAM.jl, to the extent that optimizing XAM does not make any practical difference because it is so bogged down by BGZFStreams. For that reason, I created CodecBZFG myself. It's faster, safer, and more generic than BGZFStreams. I'd expect XAM's BAM module to be around 2x faster with CodecBGZF, perhaps more.

It'll be a bit of a project, since XAM depends on Indexes, which also depends on BGZFStreams.

I'll make this issue here, and then hopefully get around to it the following few months. If you support this change, and feel like giving it a crack, please be my guest! It'll also be useful to have other eyes on CodecBGZF to make sure I didn't bork the interface.

@CiaranOMara
Copy link
Member

CiaranOMara commented Aug 25, 2020

I had noticed what you're up to, and I think it's exciting.

I like the idea of using TranscodingStreams. I have started to update the lifted packages in the BioJulia ecosystem to use TranscodingStreams. It would be great to have API consistency and be able to switch and chain codecs easily.

Indexes.jl should be refactored such that its dependency on the codec is inverted. A package that uses Indexes should be able to declare which codec or codec chain to use. In hindsight, the Indexes package should have been called Tabix.

I'd suggest setting up a pu/* (proposed updates) branch in Indexes and track the branch in a development repository (I do this locally with GenomicFeautures v3). With Julia 1.5, mixing registries now works well. We could start a BioJuliaDevRegistry that tracks the pu branches.

On a side note, @bicycle1885 did get BGZFStreams working in parallel at BioJulia/BioAlignments.jl#4.

@Marlin-Na
Copy link

Thanks for putting this together! I see the feature/CodecBGZF branch and am excited about it as well. I wonder if there is a planned timeline for it to be merged to master and released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants