Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Bottleneck in nodejs-polars When Creating Multiple Expr Objects #265

Open
chenstarx opened this issue Sep 6, 2024 · 6 comments

Comments

@chenstarx
Copy link

Have you tried latest version of polars?

  • [yes]

What version of polars are you using?

0.0.15

What operating system are you using polars on?

MacOS 14.4.1 (M3 Pro)

What node version are you using

v20.12.1

Describe your bug.

I have encountered a significant performance issue when using the nodejs-polars library. Specifically, the time required to create multiple Expr objects is considerably higher compared to the Python version of polars.

What are the steps to reproduce the behavior?

To illustrate the issue, I conducted a performance test by generating one million Expr objects in both nodejs-polars and Python polars. The following code snippets demonstrate the test setup:

Python Code

import polars as pl
import time

start_time = time.time()

for _ in range(1000000):
    _ = pl.col("A") > pl.lit(1)

end_time = time.time()
print(f"Time taken in Python: {end_time - start_time} seconds")

Node.js Code

const pl = require('nodejs-polars');
console.time('Expr Creation');

for (let i = 0; i < 1000000; i++) {
  const expr = pl.col("A").gt(pl.lit(1));
}

console.timeEnd('Expr Creation');

What is the actual behavior?

Python polars: Approximately 7 seconds to create 1,000,000 Expr objects.
Node.js polars: Approximately 1,000 seconds to create the same number of Expr objects.

  • Each iteration of the for loop took approximately 1ms
Impact

This performance discrepancy presents a significant bottleneck when performing operations that require frequent creation of Expr objects in nodejs-polars. It substantially limits the library's usability for large-scale data processing tasks in a Node.js environment.

What is the expected behavior?

The performance of creating Expr objects in nodejs-polars should be closer to, or ideally match, the performance in the Python version of polars.

Possible Reason

The issue might be caused by _Expr that will create an new Expr object when executed. Each execution will take about 0.5ms in my laptop, consuming considerable time if executed million times. Moreover, In my test dataset, the actual computing time for millions rows of data is very short, most of the time was wasted in creating the Expr objects.

There might be two ways to solve this problem:

  1. Rewriting _Expr with class, and return this in each expression method, avoiding time-consuming operations of creating a complex new object in javascript.
  2. Turning the Expr into mutable object, updating its attributes rather than creating an new object in each expression.

Thanks for reading the issue, I hope my suggestion would be helpful for the nodejs-polars library.

@chenstarx chenstarx added the bug Something isn't working label Sep 6, 2024
@maizhichao
Copy link

I encountered the same issue.

@Bidek56
Copy link
Collaborator

Bidek56 commented Sep 6, 2024

It's partly a node issue, have you tried bun or deno? I get much better performance using those 2 vs. node

@maizhichao
Copy link

It's partly a node issue, have you tried bun or deno? I get much better performance using those 2 vs. node

My project highly depends on Node.js, it's hard to migrate to Bun or Deno...

@Bidek56
Copy link
Collaborator

Bidek56 commented Sep 6, 2024

Anything other operation using node takes much longer than using Bun
For example I just tried: [...Array(1_000_000)].forEach((_, i) => pl.Series("abs", [1, 2, 3]));
It takes many times longer in node than in Bun
Why do you think re-writing just _Expr will solve your performance issue?

@Bidek56
Copy link
Collaborator

Bidek56 commented Sep 6, 2024

For me Bun is 20% faster than Python when running: [...Array(1_000_000)].forEach((_, i) => pl.Series("abs", [1, 2, 3]));

@universalmind303
Copy link
Collaborator

@maizhichao unfortunately, this is a known issue with nodejs. Their FFI implementation is incredibly slow compared to python or other js engines (deno, bun). The main bottleneck is sending values over n-api, which refactoring Expr will not fix.

@universalmind303 universalmind303 added performance and removed bug Something isn't working labels Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants