Make polars frames lazy and stream into csv #294

coroa · 2024-05-18T20:53:48Z

Tests run fine. The extra pyarrow dependency should not hurt, since arrow is already a requirement for polars (and soon also pandas), while pandas is only the python frontend in addition.

We should check each of the invocations of write_lazyframe that explain(streamable=True) shows it can actually run the streaming pipeline.

If you decide to merge, please squash (the history is ugly :))

for more information, see https://pre-commit.ci

…polars-lazy

codecov · 2024-05-19T20:49:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.65%. Comparing base (f0c8457) to head (d82e97e).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #294      +/-   ##
==========================================
- Coverage   89.69%   89.65%   -0.05%     
==========================================
  Files          16       16              
  Lines        4019     4021       +2     
  Branches      939      941       +2     
==========================================
  Hits         3605     3605              
- Misses        281      284       +3     
+ Partials      133      132       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

FabianHofmann · 2024-05-21T08:28:26Z

Hey @coroa, thanks for your PR. According to the profiler the lazy operation is taking very long.

Original Pandas Based

Polars Based (Non-lazy)

Polars Based (lazy)

Code for running the benchmark

import pypsa
import psutil
import time
import threading
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# Flag to control the monitoring loop
stop_monitoring = False

# List to store memory usage values
memory_values = []

# Function to monitor memory usage
def monitor_memory_usage(interval=0.1):
    global stop_monitoring
    global memory_values
    process = psutil.Process()
    while not stop_monitoring:
        mem_info = process.memory_info()
        memory_values.append(mem_info.rss / 1024 ** 2)  # Store memory in MB
        time.sleep(interval)

# Start monitoring memory usage in a separate thread
monitor_thread = threading.Thread(target=monitor_memory_usage)
monitor_thread.daemon = True  # Daemonize thread
monitor_thread.start()

# Your original code
n = pypsa.Network(".../pypsa-eur/results/solver-io/prenetworks/elec_s_128_lv1.5__Co2L0-25H-T-H-B-I-A-solar+p3-dist1_2050.nc")
m = n.optimize.create_model()

m.to_file("test.lp", io_api="lp-polars")

# Stop monitoring
stop_monitoring = True
monitor_thread.join()

# Plotting the memory usage
plt.plot(memory_values)
plt.xlabel('Time (in 0.1s intervals)')
plt.ylabel('Memory Usage (MB)')
plt.title('Memory Usage Over Time')
plt.savefig("mem-polars-non-lazy.png")
print(max(memory_values))

fneum · 2024-05-21T13:31:42Z

Interesting that there is no memory savings in either case compared to the other two.

coroa · 2024-05-21T14:04:30Z

Thanks for the profiling. Very disappointing.

coroa · 2024-05-21T14:07:24Z

It's possible that .values.reshape(-1) is not zero-copy.

coroa · 2024-05-21T14:09:46Z

The lazy version has to do everything at least twice, since the check_nulls already needs to eval everything (that could be improved). I don't know why factor 4, though.

coroa · 2024-05-21T14:33:30Z

I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann

FabianHofmann · 2024-05-21T20:40:51Z

I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann

Cool, but no rush, seems to be stable for the moment. I think it should be independent of the xarray version.

FabianHofmann · 2024-05-21T20:41:55Z

linopy/common.py

@@ -316,7 +318,7 @@ def check_has_nulls_polars(df: pl.DataFrame, name: str = "") -> None:
        ValueError: If the DataFrame contains null values,
        a ValueError is raised with a message indicating the name of the constraint and the fields containing null values.
    """
-    has_nulls = df.select(pl.col("*").is_null().any())
+    has_nulls = df.select(pl.col("*").is_null().any()).collect()


perhaps, we can also avoid this .collect?

Yes, we should be able to, but i think we need to change the formulation then a bit further.

coroa added 2 commits May 18, 2024 22:48

enh(constraints): Improve infer_schema_polars

4163c1b

Make polars frames lazy and stream into csv

16b8665

coroa requested a review from FabianHofmann May 18, 2024 20:53

pre-commit-ci bot and others added 6 commits May 18, 2024 20:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

cba0269

for more information, see https://pre-commit.ci

map_batches is also fine with an empty frame

71958e9

Merge branch 'make-polars-lazy' of github.com:PyPSA/linopy into make-…

c5588e6

…polars-lazy

fix typo

7c9fd2f

Fix tests

db64c01

Add pyarrow dependency

d82e97e

FabianHofmann reviewed May 21, 2024

View reviewed changes

FabianHofmann mentioned this pull request May 22, 2024

Support for sparse xarrays #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make polars frames lazy and stream into csv #294

Make polars frames lazy and stream into csv #294

coroa commented May 18, 2024 •

edited

Loading

codecov bot commented May 19, 2024 •

edited

Loading

FabianHofmann commented May 21, 2024

fneum commented May 21, 2024

coroa commented May 21, 2024

coroa commented May 21, 2024

coroa commented May 21, 2024 •

edited

Loading

coroa commented May 21, 2024

FabianHofmann commented May 21, 2024

FabianHofmann May 21, 2024

coroa May 22, 2024

Make polars frames lazy and stream into csv #294

Are you sure you want to change the base?

Make polars frames lazy and stream into csv #294

Conversation

coroa commented May 18, 2024 • edited Loading

codecov bot commented May 19, 2024 • edited Loading

Codecov Report

FabianHofmann commented May 21, 2024

Original Pandas Based

Polars Based (Non-lazy)

Polars Based (lazy)

Code for running the benchmark

fneum commented May 21, 2024

coroa commented May 21, 2024

coroa commented May 21, 2024

coroa commented May 21, 2024 • edited Loading

coroa commented May 21, 2024

FabianHofmann commented May 21, 2024

FabianHofmann May 21, 2024

Choose a reason for hiding this comment

coroa May 22, 2024

Choose a reason for hiding this comment

coroa commented May 18, 2024 •

edited

Loading

codecov bot commented May 19, 2024 •

edited

Loading

coroa commented May 21, 2024 •

edited

Loading