Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make polars frames lazy and stream into csv #294

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

coroa
Copy link
Member

@coroa coroa commented May 18, 2024

Tests run fine. The extra pyarrow dependency should not hurt, since arrow is already a requirement for polars (and soon also pandas), while pandas is only the python frontend in addition.

We should check each of the invocations of write_lazyframe that explain(streamable=True) shows it can actually run the streaming pipeline.

If you decide to merge, please squash (the history is ugly :))

@coroa coroa requested a review from FabianHofmann May 18, 2024 20:53
Copy link

codecov bot commented May 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.65%. Comparing base (f0c8457) to head (d82e97e).

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #294      +/-   ##
==========================================
- Coverage   89.69%   89.65%   -0.05%     
==========================================
  Files          16       16              
  Lines        4019     4021       +2     
  Branches      939      941       +2     
==========================================
  Hits         3605     3605              
- Misses        281      284       +3     
+ Partials      133      132       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@FabianHofmann
Copy link
Collaborator

Hey @coroa, thanks for your PR. According to the profiler the lazy operation is taking very long.

Original Pandas Based

mem-pandas

Polars Based (Non-lazy)

mem-polars-non-lazy

Polars Based (lazy)

mem-lazy

Code for running the benchmark

import pypsa
import psutil
import time
import threading
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# Flag to control the monitoring loop
stop_monitoring = False

# List to store memory usage values
memory_values = []

# Function to monitor memory usage
def monitor_memory_usage(interval=0.1):
    global stop_monitoring
    global memory_values
    process = psutil.Process()
    while not stop_monitoring:
        mem_info = process.memory_info()
        memory_values.append(mem_info.rss / 1024 ** 2)  # Store memory in MB
        time.sleep(interval)

# Start monitoring memory usage in a separate thread
monitor_thread = threading.Thread(target=monitor_memory_usage)
monitor_thread.daemon = True  # Daemonize thread
monitor_thread.start()

# Your original code
n = pypsa.Network(".../pypsa-eur/results/solver-io/prenetworks/elec_s_128_lv1.5__Co2L0-25H-T-H-B-I-A-solar+p3-dist1_2050.nc")
m = n.optimize.create_model()

m.to_file("test.lp", io_api="lp-polars")

# Stop monitoring
stop_monitoring = True
monitor_thread.join()

# Plotting the memory usage
plt.plot(memory_values)
plt.xlabel('Time (in 0.1s intervals)')
plt.ylabel('Memory Usage (MB)')
plt.title('Memory Usage Over Time')
plt.savefig("mem-polars-non-lazy.png")
print(max(memory_values))

@fneum
Copy link
Member

fneum commented May 21, 2024

Interesting that there is no memory savings in either case compared to the other two.

@coroa
Copy link
Member Author

coroa commented May 21, 2024

Thanks for the profiling. Very disappointing.

@coroa
Copy link
Member Author

coroa commented May 21, 2024

It's possible that .values.reshape(-1) is not zero-copy.

@coroa
Copy link
Member Author

coroa commented May 21, 2024

The lazy version has to do everything at least twice, since the check_nulls already needs to eval everything (that could be improved). I don't know why factor 4, though.

@coroa
Copy link
Member Author

coroa commented May 21, 2024

I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann

@FabianHofmann
Copy link
Collaborator

I'll try to debug a bit around to find out where we are scooping up this memory use. Any particular xarray version to focus on? @FabianHofmann

Cool, but no rush, seems to be stable for the moment. I think it should be independent of the xarray version.

@@ -316,7 +318,7 @@ def check_has_nulls_polars(df: pl.DataFrame, name: str = "") -> None:
ValueError: If the DataFrame contains null values,
a ValueError is raised with a message indicating the name of the constraint and the fields containing null values.
"""
has_nulls = df.select(pl.col("*").is_null().any())
has_nulls = df.select(pl.col("*").is_null().any()).collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, we can also avoid this .collect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be able to, but i think we need to change the formulation then a bit further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants