Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN is not handled correctly #213

Open
ValentinKaisermayer opened this issue Jun 24, 2020 · 8 comments
Open

NaN is not handled correctly #213

ValentinKaisermayer opened this issue Jun 24, 2020 · 8 comments

Comments

@ValentinKaisermayer
Copy link

import SQLite, DataFrames

vals = rand(10)
vals[2:3] .= NaN
df = DataFrames.DataFrame(:A => vals)

db = SQLite.DB("test.db")
SQLite.DBInterface.execute(db, "DROP TABLE IF EXISTS data;")
SQLite.load!(df, db, "data")

db = SQLite.DB("test.db")
df = SQLite.DBInterface.execute(db, "SELECT * FROM data") |> DataFrames.DataFrame

However, the two NaN are missing from the table.

8×1 DataFrame
│ Row │ A        │
│     │ Float64  │
├─────┼──────────┤
│ 10.128066 │
│ 20.936917 │
│ 30.500213 │
│ 40.939719 │
│ 50.176642 │
│ 60.41554  │
│ 70.630911 │
│ 80.887876

I'm not sure if this is how it should be.

Missing is handled correctly:

vals = convert(Vector{Union{Float64,Missing}}, rand(10))
vals[2:3] .= missing
df = DataFrames.DataFrame(:A => vals)

db = SQLite.DB("test2.db")
SQLite.DBInterface.execute(db, "DROP TABLE IF EXISTS data;")
SQLite.load!(df, db, "data")

db = SQLite.DB("test2.db")
df = SQLite.DBInterface.execute(db, "SELECT * FROM data") |> DataFrames.DataFrame
10×1 DataFrame
│ Row │ A        │
│     │ Float64? │
├─────┼──────────┤
│ 10.438104 │
│ 2missing  │
│ 3missing  │
│ 40.101527 │
│ 50.83622  │
│ 60.360448 │
│ 70.310123 │
│ 80.458411 │
│ 90.569096 │
│ 100.141622
@quinnj
Copy link
Member

quinnj commented Jun 29, 2020

Hmmm, from some brief research, it sounds like sqlite doesn't really follow IEEE semantics at all, and hence doesn't really have a concept of NaN. A bigger issue here is that we aren't checking the return code on the sqlite_step when loading each row into the database, which could have at least issued a warning in this case.

For a Vector{Float64} column like this, we're currently "typing" it in sqlite as REAL NOT NULL, and sqlite returns a "constraint violation" when trying to insert NaN. If I take off the NOT NULL constraint, it successfully inserts as NULL, so the values are returned as missing.

Given all that, I don't see a clear solution for the original issue here; but I do plan on doing the return code check and warning when a row fails to insert. I'd love any ideas/research other people have here.

@ValentinKaisermayer
Copy link
Author

The problem is that the Tables.Schema interprets a vector with NaN and Float64 as a Vector{Float64}, which it actually is. However, this than results in a column of REAL NOT NULL. If I create the schema myself as Union{Float64,Missing} I can get it to store the NaNs as NULL.

@stephancb
Copy link

A possible solution is to insert/update IEEE-754 special values, +/- Infty and NaNs, as 8-byte blobs. In Sqlite every value can have a different SQL type than the one declared for the table column. So a column declared as REAL can be Blobs in some rows. When selecting back from REAL columns, then of course the type needs to be checked for each row, sqlite3_column_type(). If it is 8 octets long blob, then convert to Float64 (and if it is NULL convert to missing).

@felipenoris
Copy link
Contributor

felipenoris commented Oct 27, 2020

If SQLite doesn't support it, most likely the solution will be tied to this Julia driver. This is like deciding how to write a Char to a JSON, that only supports Strings.

In this situation, I usually opt to not support the data type at all, and throw an error if the user tries to insert the unsupported data type to the database. In this case, a NaN value. The user will then be forced to solve the issue in the application level, deciding on how to encode the data.

@stephancb
Copy link

stephancb commented Oct 27, 2020

Well, it is bit different than unsupported data type, unless you are suggesting to not support Float64 at all. In order to throw an error, every single Float64 value getting inserted would need to be checked for Infty or NaN values (which is of course also needed for my suggested solution).

On insert Sqlite just replaces NaNs and Infty silently with NULLs, there is no error code from sqlite3_step or so. The unpleasant consequence is that for Float64 you don't always get out what has been put into the database. The suggested solutions fixes this, albeit only within the particular driver.

Addendum: If the column is declared as REAL NOT NULL, an error is of course indicated. But the problem still exists if Union{Float64, Missing} is to be inserted.

@felipenoris
Copy link
Contributor

felipenoris commented Oct 29, 2020

Yes, every value should be checked for NaN or Inf. Or, equivalently, one could define a SQLiteNumber datatype that exactly matches the driver's definition and create encode/decode values to/from Float. Like the Oracle database does with OCINumber.

If the datatype does not match exactly, there's already some conversion happening before sending the value to the database. Usually the driver's C code is full of these per-value checks, and thank God this is Julia because we can do the same before hitting the C code and get the same performance. I guess this check won't hurt performance a lot, given this is most likely an IO bound operation.

@quinnj
Copy link
Member

quinnj commented Oct 30, 2020

Yeah, I agree that the best solution here is like what @felipenoris suggests: SQLite.jl could define a SQLFloat{T} type that properly respected IEEE-754 semantics. If someone feels like making a PR + docs, I'd appreciate it.

For reference, I found this old sqlite user mailing list thread where the core devs basically said they didn't really care about supporting IEEE 😞 http://sqlite.1065341.n5.nabble.com/Handling-of-IEEE-754-nan-and-inf-in-SQLite-td27874.html

@stephancb
Copy link

stephancb commented Oct 30, 2020

Pse see also my question at http://sqlite.1065341.n5.nabble.com/NaN-in-0-0-out-td19086.html and Dr. Hipp's reply.

The issue cannot be completely solved without changes in Sqlite which will not happen. Regarding a driver like Sqlite.jl, the options are:

  1. Accept how Sqlite is doing things. The columns corresponding to Julia Float64 should simply be declared REAL (not REAL NOT NULL). If a Float64 array has NaNs and/or Inftys, and it is inserted and selected back, it will have Sqlite NULLs in place of these. This simply maps to Union{Float64,Missing}. Alternatively Hipp's suggested sqlite3_column_double_v2 could be implemented in Julia, e.g. by constructing a type (SQLfloat?) that has value NaN for anything that is not REAL or INTEGER in Sqlite.

  2. Disallow insertion of NaN and Inftys into SQlite, raising an exception if attempted. It could be handled via a type SQLfloat.

  3. Workaround via Sqlite BLOBs. It can ensure that what goes into Sqlite also comes out even for NaNs and Inftys, but only within conforming drivers. A Julia type SQLfloat can be a vehicle to implement this for Julia.

So which option, what exactly should a type SQLfloat do?

Edit: I just saw that Inftys are handled normally in Sqlite, only NaNs get converted to NULL and would need special treatment in SQLite.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants