Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy parsing of sequence attributes #133

Open
GreyCat opened this issue Mar 17, 2017 · 3 comments
Open

Lazy parsing of sequence attributes #133

GreyCat opened this issue Mar 17, 2017 · 3 comments

Comments

@GreyCat
Copy link
Member

GreyCat commented Mar 17, 2017

Sometimes we have some fairly large / complex subtypes in a seq with a known size, which are not really used on every file read, so it's beneficial to read the lazily, i.e. on demand. A simple case:

seq:
  - id: foo
    type: u4
  - id: bar
    size: 100500
    type: something_else
    lazy: true # <= new proposed key
  - id: baz
    type: u4

See #65 (comment) for proposed generation results.

Another, more complex option, is to have lazy seq arrays with fixed size elements and (repeat-expr or repeat-eos) — thus number of elements being known a priori. Probably it's worth discussing after we've handled this simple case.

@KOLANICH
Copy link

KOLANICH commented Oct 6, 2017

For python there is a nice module https://github.com/ionelmc/python-lazy-object-proxy which may be useful. But KSC must be altered in order to use it.

But there are obvious drawbacks for lazy parsing: for now we can parse and close the file, lazy parsing means that we cannot close the file until all the needed parts of a struct have been accessed. So there should be some parameter controlling it. I guess that we may need a hook function in runtime where we pass a lambda accessing the property and returning the KS object. Non-lazy one will execute the lambda immediately. Lazy one will call lazy_object_proxy.Proxy.

@milahu
Copy link

milahu commented Apr 2, 2023

for now we can parse and close the file

what if my file has 100GB, but i have only 1GB ram?
(example: sqlite3 database)

currently kaitai-struct-compiler generates

    @property
    def pages(self):
        if hasattr(self, "_m_pages"):
            return self._m_pages

        self._m_pages = []
        for i in range(self.header.num_pages):
            self._m_pages.append(
                Sqlite3.Page(
                    (i + 1), (self.header.page_size * i), self._io, self, self._root
                )
            )

when it should generate

class Sqlite3(KaitaiStruct):

    class PagesList:
        def __init__(self, root):
            self.root = root

        def __len__(self):
            return self.root.header.num_pages

        def __getitem__(self, i):  # i is 0-based
            if i < 0:  # -1 means last page, etc
                i = self.root.header.num_pages + i

            assert (
                0 <= i and i < self.root.header.num_pages
            ), f"page index is out of range: {i} is not in (0, {self.root.header.num_pages - 1})"

            # TODO LRU cache with sparse array?
            # note: LRU cache does not give pointer equality
            # but equality check is trivial: page_a.page_number == page_b.page_number

            _pos = self.root._io.pos()
            self.root._io.seek(i * self.root.header.page_size)
            page = Sqlite3.Page(
                (i + 1), (self.root.header.page_size * i), self.root._io, self.root, self.root._root
            )
            self.root._io.seek(_pos)
            return page

    def __init__(self, _io, _parent=None, _root=None):
        self._io = _io
        self._parent = _parent
        self._root = _root if _root else self
        # add this line:
        self.pages = Sqlite3.PagesList(self)
        self._read()

for now im patching the generated sqlite3.py parser

see also: Can Kaitai Struct be used to describe TLV data without creating new types for each field? (via kaitai-io/kaitai_struct_formats#661 (comment))

keywords: random access, array, list, nested parsing, deferred parsing

@KOLANICH
Copy link

KOLANICH commented Apr 2, 2023

what if my file has 100GB, but i have only 1GB ram?

#65 was created with exactly that use case in mind. Not to store all the data in memory, but map the file and store offsets and let the OS to read them only when needed and re-parse tuem again (and in the case of fixed structs - without actually parsing, at least for C++ and Rust, when no serialization is needed). Not very suitable for systems without a MMU, such as Arduino boards. For them it may be possible to generate instances, getting the ranges. But again it is not very suitable to read the whole array into memory, so we need generating an index first, and then reuse it. And we need the runtime to forget the stuff it doesn't need. All these are not so straighforward decisions and I doubt they can be done without consulting to general intelligence. The knowledge pf the right decisions can be incorporated into ksy files in the form of hints #225. Probably there should he a mode for the compiler adding hints stubs (in the form of all suitable hints for the case) into ksy file in the places they are needed. Then a programmer can use a diff tool to review the hints and choose the right ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants