Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare release #10

Merged
merged 9 commits into from
Jan 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 68 additions & 10 deletions .github/workflows/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,25 +15,83 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true


jobs:
Pipeline:
BuildDocs:
name: Build Docs
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python 3.12
uses: actions/setup-python@v5
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.12"
python-version: '3.11'

- name: Set up UV
run: python -m pip install uv
- name: Setup UV
run: pip install uv

- name: Generate documentation
- name: Execute Tests
run: make docs-build

- name: Execute lint checks
run: make lint
UnitTests:
name: Unit Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Execute tests
- name: Setup UV
run: pip install uv

- name: Execute Tests
run: make test-all

StaticChecks:
name: Static Checks
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Setup UV
run: pip install uv

- name: Execute Static Checks
run: make lint

Publish:
name: Publish
runs-on: ubuntu-latest
needs: [UnitTests, StaticChecks, BuildDocs]
environment:
name: pypi
url: https://pypi.org/p/gyjd
permissions:
id-token: write
if: startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Setup UV
run: pip install uv

- name: Build package
run: make build

- name: Publish package distributions to PyPI
uses: pypa/gh-action-pypi-publish@release/v1

38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,37 @@
# lazy_pandas
# Lazy Pandas
Lazy Pandas is a Python library that simplifies the use duckdb wrapping the pandas API. This library is not a pandas replacement, but a way to use the pandas API with DuckDB. Pandas is awesome and adopted by many people, but it is not the best tool for datasets that do not fit in memory. So why not give the power of duckdb to pandas users?

## Installation

To install Lazy Pandas, you can use pip:

```sh
pip install lazy-pandas
```

## Usage

Here is a basic example of how to use Lazy Pandas:
```python
import lazy_pandas as lp

df = lp.read_csv(location, parse_dates=["pickup_datetime"])
df = df[["pickup_datetime", "passenger_count"]]
df["pickup_date"] = df["pickup_datetime"].dt.date
df = df.sort_values("pickup_date")
df = df.collect() # Materialize the lazy DataFrame to a pandas DataFrame
```

Features

- Lazy evaluation
- SQL support
- Support for DuckDB extensions (e.g., Delta, Iceberg, etc.)

Contribution

Contributions are welcome! Feel free to open issues and pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.
2 changes: 1 addition & 1 deletion docs/docs/assets/profiler/lazy_pandas.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
},
"annotations": [
{
"text": "Uso de memória ao longo do tempo (segundos)",
"text": "Memory usage over time (seconds)",
"xref": "paper",
"yref": "paper",
"x": 0.5,
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/assets/profiler/pandas.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
},
"annotations": [
{
"text": "Uso de memória ao longo do tempo (segundos)",
"text": "Memory usage over time (seconds)",
"xref": "paper",
"yref": "paper",
"x": 0.5,
Expand Down
31 changes: 14 additions & 17 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,34 @@
---
title: Lazy Pandas
hide:
- navigation
- toc
- navigation
- toc
---
Welcome to the Lazy Pandas official documentation! A library that allows you to use the pandas API with DuckDB as simple as a pip install.

# Lazy Pandas
To start using Lazy Pandas, you can install it using pip:

Welcome to the **Lazy Pandas** official documentation!
A library inspired by [pandas](https://pandas.pydata.org/) that focuses on *lazy* processing, enabling high performance and lower memory usage for large datasets.
```sh
pip install lazy-pandas
```

## What is Lazy Pandas?

Lazy Pandas is built on the concept of delaying DataFrame operations until they are strictly necessary (lazy evaluation). This allows:
- Operations to be optimized in batches.
- Memory usage to be minimized during processing.
- Total runtime to be reduced for complex pipelines.
LazyPandas is a wrapper around DuckDB that allows you to use the pandas API to interact with DuckDB. This library is not a pandas replacement, but a way to use the pandas API with DuckDB. Pandas is awesome and adopted by many people, but it is not the best tool for datasets that do not fit in memory. So why not give the power of duckdb to pandas users?

## Code Comparison

Below is a side-by-side comparison showing how the same operation would look in **Pandas** versus **Lazy Pandas**:


=== "Lazy Pandas"

```python linenums="1" hl_lines="2 5 13"
import pandas as pd
import lazy_pandas as lpd
import lazy_pandas as lp

def read_taxi_dataset(location: str) -> pd.DataFrame:
df = lpd.read_csv(location, parse_dates=["pickup_datetime"])
df = lp.read_csv(location, parse_dates=["pickup_datetime"])
df = df[["pickup_datetime", "passenger_count"]]
df["passenger_count"] = df["passenger_count"]
df["pickup_date"] = df["pickup_datetime"].dt.date
del df["pickup_datetime"]
df = df.groupby("pickup_date").sum().reset_index()
Expand All @@ -41,7 +38,6 @@ Below is a side-by-side comparison showing how the same operation would look in
return df
```


=== "Pandas"

```python linenums="1"
Expand All @@ -51,7 +47,6 @@ Below is a side-by-side comparison showing how the same operation would look in
def read_taxi_dataset(location: str) -> pd.DataFrame:
df = pd.read_csv(location, parse_dates=["pickup_datetime"])
df = df[["pickup_datetime", "passenger_count"]]
df["passenger_count"] = df["passenger_count"]
df["pickup_date"] = df["pickup_datetime"].dt.date
del df["pickup_datetime"]
df = df.groupby("pickup_date").sum().reset_index()
Expand All @@ -65,8 +60,7 @@ Notice that in traditional **pandas**, operations are executed immediately, whil

## Memory Usage

Below is a fictitious performance comparison between **pandas** and **Lazy Pandas**, showing a scenario where a large dataset is processed in three stages (reading, aggregation, and complex filtering).

Running the previous code on a 5.7GB CSV file with 55 million rows, we can see the memory usage difference between **Pandas** and **Lazy Pandas**:

<div class="grid cards" markdown>
```plotly
Expand All @@ -78,4 +72,7 @@ Below is a fictitious performance comparison between **pandas** and **Lazy Panda
```
</div>

In the **Pandas** example, the memory usage spikes to 25.8GB and takes 8 minutes to complete, while in the **Lazy Pandas** example, the memory usage remains constant at 500mb and takes 6 seconds to complete.
For the test, we used a MacBook Pro M1 with 16GB. The dataset used was the [NYC Taxi Dataset](https://www.kaggle.com/code/debjeetdas/nyc-taxi-fare-eda-prediction-using-linear-reg/input?select=train.csv) available on Kaggle.


1 change: 0 additions & 1 deletion docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ theme:
- navigation.sections
- toc.integrate
- toc.follow
- content.action.edit
plugins:
#- include-markdown
- plotly
Expand Down
16 changes: 8 additions & 8 deletions docs/scripts/generate_references.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,31 +6,31 @@
package_dir = Path(__file__).parent.parent.parent / "src"
sys.path.insert(0, str(package_dir))

import lazy_pandas as lpd # noqa: E402
import lazy_pandas as lp # noqa: E402

vls = []

vls += [
(1000 + idx, "lazy_pandas.LazyFrame", f"LazyFrame.{attr}", attr)
for idx, attr in enumerate(sorted(dir(lpd.LazyFrame)))
for idx, attr in enumerate(sorted(dir(lp.LazyFrame)))
if not attr.startswith("_")
]

vls += [
(1000 + idx, "lazy_pandas.LazyColumn", f"LazyColumn.{attr}", attr)
for idx, attr in enumerate(sorted(dir(lpd.LazyColumn)))
for idx, attr in enumerate(sorted(dir(lp.LazyColumn)))
if not attr.startswith("_") and attr not in ["str", "dt", "create_from_function"]
]

vls += [
(2000 + idx, "lazy_pandas.LazyStringColumn", f"LazyColumn.str.{attr}", attr)
for idx, attr in enumerate(sorted(dir(lpd.LazyStringColumn)))
for idx, attr in enumerate(sorted(dir(lp.LazyStringColumn)))
if not attr.startswith("_")
]

vls += [
(3000 + idx, "lazy_pandas.LazyDateTimeColumn", f"LazyColumn.dt.{attr}", attr)
for idx, attr in enumerate(sorted(dir(lpd.LazyDateTimeColumn)))
for idx, attr in enumerate(sorted(dir(lp.LazyDateTimeColumn)))
if not attr.startswith("_")
]

Expand All @@ -50,15 +50,15 @@

fn_names = [
attr
for idx, attr in enumerate(sorted(dir(lpd)))
for idx, attr in enumerate(sorted(dir(lp)))
if not attr.startswith("_")
and callable(getattr(lpd, attr))
and callable(getattr(lp, attr))
and attr not in ["LazyFrame", "LazyColumn", "LazyStringColumn", "LazyDateTimeColumn"]
]


template = """
# lpd.{function_name}
# lp.{function_name}
::: lazy_pandas.{function_name}
options:
members:
Expand Down
3 changes: 3 additions & 0 deletions makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
UVX = uvx
MKDOCS_OPTS = --with-requirements requirements.txt

build:
uv build

test:
$(UVX) hatch test

Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ requires = [
dependencies = [
"duckdb",
]
description = "Add your description here"
description = "The power of duckdb with the ease of pandas"
dynamic = [
"version",
]
Expand Down Expand Up @@ -40,7 +40,7 @@ extra-args = [
run = "pytest{env:HATCH_TEST_ARGS:} {args}"

[tool.hatch.version]
source = "vcs"
path = "src/lazy_pandas/__init__.py"

[tool.pytest.ini_options]
pythonpath = "src"
Expand Down
Empty file removed src/__init__.py
Empty file.
2 changes: 2 additions & 0 deletions src/lazy_pandas/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@
"LazyDateTimeColumn",
"LazyStringColumn",
]

__version__ = "0.1.0dev1"
Loading
Loading