Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlite3: fix parser #661

Draft
wants to merge 28 commits into
base: master
Choose a base branch
from
Draft

sqlite3: fix parser #661

wants to merge 28 commits into from

Conversation

milahu
Copy link

@milahu milahu commented Apr 2, 2023

continue #640

lazy pages based on #640 (comment)

rename fields from *_index to idx_* vaguely based on style guide via #640 (comment)
these are not physical offsets, so im not using ofs_*

migration script:

#! /bin/sh
# \b = word boundary
sed -i -E '
  s/\block_byte_page_index\b/idx_lock_byte_page/;
  s/\bfirst_ptrmap_page_index\b/idx_first_ptrmap_page/;
  s/\blast_ptrmap_page_index\b/idx_last_ptrmap_page/;
  s/\bptrmap_max_num_entries\b/num_ptrmap_entries_max/;
  s/\bvariable_size\b/len_blob_string/;
' "$@"

will squash commits later

pepijnve and others added 8 commits April 2, 2023 14:34
- Make all database page types parseable
- Add cell content overflow handling
- Add UTF-16 text encoding support
- Make free page list and overflow page lists accessible
Comment on lines 38 to 64
instances:
len_page:
value: 'len_page_mod == 1 ? 0x10000 : len_page_mod'
pages:
type: page(_index + 1, header.page_size * _index)
repeat: expr
repeat-expr: header.num_pages
types:
page:
params:
- id: page_number
type: s4
- id: ofs_body
type: s4
instances:
page_index:
value: 'page_number - 1'
body:
pos: ofs_body
size: _root.header.page_size
type:
switch-on: '(page_index == _root.header.idx_lock_byte_page ? 0 : page_index >= _root.header.idx_first_ptrmap_page and page_index <= _root.header.idx_last_ptrmap_page ? 1 : 2)'
cases:
0: lock_byte_page(page_number)
1: ptrmap_page(page_number)
# TODO: Free pages and cell overflow pages are incorrectly interpreted as btree pages
# This is unfortunate, but unavoidable since there's no way to recognize these types at
# this point in the parser.
2: btree_page(page_number)
Copy link
Author

@milahu milahu Apr 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@generalmimon no luck with lazy db.pages

this still loops all pages when i read db.pages[0]

    @property
    def pages(self):
        if hasattr(self, "_m_pages"):
            return self._m_pages

        self._m_pages = []
        for i in range(self.header.num_pages):
            self._m_pages.append(
                Sqlite3.Page(
                    (i + 1), (self.header.page_size * i), self._io, self, self._root
                )
            )

see also kaitai-io/kaitai_struct#133

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milahu:

this still loops all pages when i read db.pages[0]

Yes, it creates the objects but does not parse them.

If you don't want to even create the empty objects (because the total memory usage of too many empty objects would be too high), you can provide an "unused" page type and let the users of the parser instantiate it themselves for the page number they want (and even dispose the object afterwards to keep the memory usage low). This approach is essentially explained in https://stackoverflow.com/a/73332294/12940655.

KS unfortunately doesn't support truly unused types very well at the moment (i.e. when you define a type in the types but don't use it absolutely anywhere), but this can be easily worked around by the if: false trick as explained in the linked SO post.

@edzillion
Copy link

edzillion commented Dec 1, 2023

I tried to get kaitai struct working in lua and ended up here. It seems PR640 started work on fixing some of the problems but it was never merged. In any case I was able to parse my example db to some extent. It should be mentioned more clearly that since the pages are lazy loaded you will never see any actual pages, just the database_header. It took the kind people over in the kaitai struct gitter lobby to clue me in on that one. The lua documentation is very deficient. I note there is mention of this elsewhere but there is so much information to consume that having some kind of basic language specific tutorial on how to get the table data from your db would a much more direct way to get to grips with kaitai.

In any case, I am having issues with overflow pages. My example DB has page size of 4096 and page 2 is of type table-leaf, there is only one cell in the page and it says in the payload of this cell that the payload size is 7756. It says the overflow page number is 6. When my sqlite3.lua file processes this page it does not add a pointer from page 2 to page 6 like I am expecting, and if I look at page 6 it has 99 cells and all they is one property each: ofs_content:27769

how to I access this offset content? I stepped through the parser and I my crude understanding is that there should be a Sqlite3.BtreePagePointer() created to link this page with it's overflow page on page 6.

When I look at page 6 it has 99 cells and each contains one property ofs_content:27769 which I guess is a memory offset?

What's the state of this PR, can I do anything to help it along?

edit: in fact, I would be happy to contribute a Lua tutorial, if I can get it working.

@milahu
Copy link
Author

milahu commented Dec 1, 2023

What's the state of this PR

abandoned

can I do anything to help it along?

add example code (python, lua, ...)

i needed this sqlite parser for my pysqlite3 to parse a partially-downloaded sqlite database
the full database has 130GB, so for my case, the pages must be lazy

kaitai does not support such lazy parsers
so using this parser requires some manual parsing

kaitai is just a code-generator
and sometimes the generated code needs patching

I am having issues with overflow pages.

overflow means, data is stored on multiple pages

see connection._table_values

payload is overflow_record
must be parsed manually because paging

cell.ofs_content is also used in connection._row_locations
to get the raw byte offsets of values
i used this to build an external index, to map row ids to byte offsets
the database is shared over bittorrent
and with the index i can map row ids to torrent pieces for partial download

@edzillion
Copy link

Yeah, I am trying to parse 'app size' sqlite dbs on limited hardware so lazy parsing isn't an issue. That's an interesting application you have, thanks for the pointers, I will have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants