-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIMD code generator #60
Conversation
Note for future: On a FASTA file, the SIMD generator executes 8.7x fewer instructions and 10x fewer branches, while only being 2.5x faster. Part of the reason may be increased cache misses (2.5k -> 4k / 192k), but the major factor is probably the level of instruction level parallism. Looking at the assembly, the SIMD loop has serial memory dependencies, preventing superscalar instructions. On my computer, manually unrolling the SIMD loop does not make it use different registers, so the seriality remains and is in fact slower. |
Might be worth checking how much this increases load time for some consumer package. Automa is pretty light in terms of dependencies, and it sounds like it would be worth it, but it might be nice to have that information at hand. |
That's a good idea. The new SIMD.jl dependency has historically been devastating for load times, although this should have improved quite significantly with 1.6, and I don't plan on merging this before then. |
Turns out compile times are affected, but not by much. On version 1.5, compiling and running a FASTA reader takes about 2.3 seconds without this PR, and 2.55 seconds with this PR. On 1.6, it's 1.85 seconds without and 1.95 with. |
I think this is done now and can get merged. Code coverage is a little misleading, it's actually almost 100%. I've no coverage on the branches that activates on big-endian CPUs and on CPUs without SIMD capabilities, because I don't have those kinds of CPUs. I'ts just simple if-statements though, so not much can go wrong. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the kind of code I'm fit to review technically. One thing I'll note is that I don't see much associated documentation. There are some useful (I gather) comments throughout, and I recognize that this may not effect the user-facing API, but it seems complicated enough (and helpful enough) that it might be worth spending a bit more time fleshing out the human readable bits.
Unless you think it would all be self-evident to someone familiar with SIMD operations generall. Basically, what would happen to that code if you were hit by the metaphorical bus - would it be interpretable?
Okay, now I've added a whole bunch of comments and also mitigated some of the 1.6 regressions. You're right, it's not very clear what it does. Hopefully the comments help. Although I do think that this whole SIMD code could probably be split out into its own package. I can see how having SIMD byte matching could be generally useful. It is essentially like https://docs.rs/memchr/2.3.4/memchr/, except this generalizes to any collection of bytes. But that's a PR for another time :) |
Okay, I've been running on this branch on my daily work for a few months now, refactored it out to use ScanByte, and added non-SIMD fallbacks. So merging now. |
Introduction
This PR introduces the
:simd
code generator. At this time, it is very similar to the:goto
code generator. In fact, the only current change from the:goto
-generator is that it uses SIMD loop unrolling instead of the manual Automa-generated loop unrolling.In future PRs, I would like to change the large-scale structure of the code generated by the
:simd
generator (see issue #53), if it turns out to be possible and performant. But for now, this change only pertains to SIMD loop unrolling.Benefits
Speed. That's it, really. We can use an "empty" format validator (see #55 on a byte array) to most accurately time Automa's parsing, since neither the
Machine
's attached actions, nor file I/O will influence the result.The times are given in median runtimes determined using BenchmarkTools.jl:
These are, of course, optimal conditions. Even on the extremely simple biofast FASTQ benchmark, processing a file is approximately 2.5x slower than processing a byte vector in RAM. So a doubling of speed of the actual Automa parsing is realistically only a 20 % performance increase. Of course, this does mean that we are moving the performance bottleneck somewhere else, such that e.g. TranscodingStreams now have more room for optimization.
Or maybe it turns out it's more than that. We'll have to see for certain when all the tests are in place and the behaviour is verified to be correct.
Drawbacks of using the
:simd
generator over the:goto
generator:Unfortunately, SIMD code generation comes with a few limitations:
getbyte
can no longer be a custom function in a:simd
generator, it has to beBase.getindex
. I don't think anyone ever used that functionality, though.:goto
generator, the generated code is not pure Base code. That means the "consumer" of the code must haveimport
'ed orusing
Automa, else they will have aNameError
:simd
generator is used to generate Julia code on a modern CPU (less than 10 years old x86 CPU), but the Julia code is executed on another CPU (10 years old, or >5 years old AMD, or RISC CPUs), it will likely error catastrophically. This may happen if cached Julia code is used on another computer.Drawbacks of merging this code
simd.jl
file. It checks the available instructions, but there is, as far as I know, no truly stable and officially supported way of doing this. So it may error sometime in the future. Since this happens at import time, even users not using the:simd
generator will be affected if this fails on some systems or in the future.To-do list
Still fails FASTQ parsing, so there are probably a few bugs to be kinked outNeeds way more tests, ideally 100% coverage:simd
and:goto
generator. This needs to be factored out (or does it? It depends on how much progress I'll make on reorganizing the:simd
codegen along the road)closes issue #54