Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for Pandas 2 support #742

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a37e797
Fix test setup to match pandas 2.0 demands
bartbroere Dec 15, 2024
e3dd93a
Use the now deprecated _append method
bartbroere Dec 15, 2024
2ae1296
Deal with numeric_only being removed in metrics test
bartbroere Dec 15, 2024
c847c6d
Skip mad metric for other pandas versions
bartbroere Dec 16, 2024
141682a
Account for differences between pandas versions in describe methods
bartbroere Dec 16, 2024
718b249
Run black
bartbroere Dec 16, 2024
bd0b10d
Check Pandas version first
bartbroere Dec 16, 2024
b6a31bd
Mirror behaviour of installed Pandas version when running value_counts
bartbroere Dec 18, 2024
399200d
Allow passing arguments to the individual asserters
bartbroere Dec 18, 2024
c198cc9
Fix for method _construct_axes_from_arguments no longer existing
bartbroere Dec 18, 2024
b18036d
Skip mad metric if it does not exist
bartbroere Dec 18, 2024
470848c
Account for pandas 2.0 timestamp default behaviour
bartbroere Dec 18, 2024
19ad96d
Deal with empty vs other inferred data types
bartbroere Dec 18, 2024
0ad66a0
Account for default datetime precision change
bartbroere Dec 18, 2024
388722c
Run Black
bartbroere Dec 18, 2024
416ec1e
Solution for differences in inferred_type only
bartbroere Dec 18, 2024
73078d3
Fix csv and json issues
bartbroere Dec 19, 2024
83df6dd
Skip two doctests
bartbroere Dec 19, 2024
81b381c
Passing a set as indexer is no longer allowed
bartbroere Dec 19, 2024
1146b2e
Don't validate output where it differs between Pandas versions in the…
bartbroere Dec 19, 2024
02f1573
Update test matrix and packaging metadata
bartbroere Dec 19, 2024
d08a29a
Update version of Python in the docs
bartbroere Dec 19, 2024
d70e2b9
Update Python version in demo notebook
bartbroere Dec 19, 2024
b8b1526
Match noxfile
bartbroere Dec 19, 2024
cfcd7a8
Symmetry
bartbroere Jan 6, 2025
96a5069
Fix trailing comma in JSON
bartbroere Jan 6, 2025
e269fa0
Merge branch 'main' into prepare-for-pandas-2
bartbroere Jan 8, 2025
ea98797
Revert some changes in setup.py to fix building the documentation
bartbroere Jan 10, 2025
7be5a9e
Merge branch 'main' into prepare-for-pandas-2
bartbroere Jan 18, 2025
d91b5c3
Revert "Revert some changes in setup.py to fix building the documenta…
bartbroere Jan 18, 2025
59e06f6
Merge branch 'main' into prepare-for-pandas-2
pquentin Jan 22, 2025
f9c589b
Use PANDAS_VERSION from eland.common
bartbroere Jan 28, 2025
d6f7af2
Still skip the doctest, but make the output pandas 2 instead of 1
bartbroere Jan 29, 2025
b188219
Still skip doctest, but switch to pandas 2 output
bartbroere Jan 29, 2025
12753d5
Prepare for pandas 3
bartbroere Jan 29, 2025
00c3c7c
Reference the right column
bartbroere Jan 29, 2025
f3c7e2e
Ignore output in tests but switch to pandas 2 output
bartbroere Jan 29, 2025
f13f83a
Add line comment about NBVAL_IGNORE_OUTPUT
bartbroere Jan 29, 2025
ddc1c26
Restore missing line and add stderr cell
bartbroere Jan 30, 2025
f335e6a
Use non-private method instead
bartbroere Jan 30, 2025
948b731
Fix indentation and parameter issues
bartbroere Jan 30, 2025
d7ad7ed
If index is not specified, and pandas 1 is present, set it to True
bartbroere Jan 30, 2025
2a1a6d4
Run black
bartbroere Jan 30, 2025
a37d266
Newer version of black might have different opinions?
bartbroere Jan 30, 2025
2942e5e
Add line comment
bartbroere Jan 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .buildkite/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,16 @@ steps:
machineType: "n2-standard-4"
env:
PYTHON_VERSION: "{{ matrix.python }}"
PANDAS_VERSION: '1.5.0'
PANDAS_VERSION: "{{ matrix.pandas }}"
TEST_SUITE: "xpack"
ELASTICSEARCH_VERSION: "{{ matrix.stack }}"
matrix:
setup:
# Python and pandas versions need to be added to the nox configuration too
# (in the decorators of the test method in noxfile.py)
pandas:
- '1.5.0'
- '2.2.3'
python:
- '3.12'
- '3.11'
Expand Down
2 changes: 1 addition & 1 deletion docs/sphinx/examples/demo_notebook.ipynb
bartbroere marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"\n",
"For this example, you will need:\n",
"\n",
"- Python 3.8 or later\n",
"- Python 3.9 or later\n",
"- An Elastic deployment\n",
" - We'll be using [Elastic Cloud](https://www.elastic.co/guide/en/cloud/current/ec-getting-started.html) for this example (available with a [free trial](https://cloud.elastic.co/registration))\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion eland/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ def elasticsearch_date_to_pandas_date(


def ensure_es_client(
es_client: Union[str, List[str], Tuple[str, ...], Elasticsearch]
es_client: Union[str, List[str], Tuple[str, ...], Elasticsearch],
) -> Elasticsearch:
if isinstance(es_client, tuple):
es_client = list(es_client)
Expand Down
10 changes: 5 additions & 5 deletions eland/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
from pandas.util._validators import validate_bool_kwarg # type: ignore

import eland.plotting as gfx
from eland.common import DEFAULT_NUM_ROWS_DISPLAYED, docstring_parameter
from eland.common import DEFAULT_NUM_ROWS_DISPLAYED, PANDAS_VERSION, docstring_parameter
from eland.filter import BooleanFilter
from eland.groupby import DataFrameGroupBy
from eland.ndframe import NDFrame
Expand Down Expand Up @@ -411,9 +411,7 @@ def drop(
axis = pd.DataFrame._get_axis_name(axis)
bartbroere marked this conversation as resolved.
Show resolved Hide resolved
axes = {axis: labels}
elif index is not None or columns is not None:
axes, _ = pd.DataFrame()._construct_axes_from_arguments(
(index, columns), {}
)
axes = {"columns": columns, "index": index}
else:
raise ValueError(
"Need to specify at least one of 'labels', 'index' or 'columns'"
Expand Down Expand Up @@ -1361,7 +1359,7 @@ def to_json(
default_handler=None,
lines=False,
compression="infer",
index=True,
index=None,
indent=None,
storage_options=None,
):
Expand All @@ -1376,6 +1374,8 @@ def to_json(
--------
:pandas_api_docs:`pandas.DataFrame.to_json`
"""
if index is None and PANDAS_VERSION[0] == 1:
index = True
bartbroere marked this conversation as resolved.
Show resolved Hide resolved
kwargs = {
"path_or_buf": path_or_buf,
"orient": orient,
Expand Down
39 changes: 26 additions & 13 deletions eland/etl.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
# under the License.

import csv
import warnings
from collections import deque
from typing import Any, Dict, Generator, List, Mapping, Optional, Tuple, Union

Expand Down Expand Up @@ -110,15 +111,15 @@ def pandas_to_eland(
2 3.141 1 ... 3 Long text - to be indexed as es type text
<BLANKLINE>
[3 rows x 8 columns]
>>> pd_df.dtypes
A float64
B int64
C object
D datetime64[ns]
E float64
F bool
G int64
H object
>>> pd_df.dtypes # doctest: +SKIP
bartbroere marked this conversation as resolved.
Show resolved Hide resolved
A float64
B int64
C object
D datetime64[s]
E float64
F bool
G int64
H object
dtype: object

Convert `pandas.DataFrame` to `eland.DataFrame` - this creates an Elasticsearch index called `pandas_to_eland`.
Expand Down Expand Up @@ -307,9 +308,9 @@ def csv_to_eland( # type: ignore
names=None,
index_col=None,
usecols=None,
squeeze=False,
squeeze=None,
prefix=None,
mangle_dupe_cols=True,
mangle_dupe_cols=None,
# General Parsing Configuration
dtype=None,
engine=None,
Expand Down Expand Up @@ -357,6 +358,7 @@ def csv_to_eland( # type: ignore
low_memory: bool = _DEFAULT_LOW_MEMORY,
memory_map=False,
float_precision=None,
**extra_kwargs,
) -> "DataFrame":
"""
Read a comma-separated values (csv) file into eland.DataFrame (i.e. an Elasticsearch index).
Expand Down Expand Up @@ -485,7 +487,6 @@ def csv_to_eland( # type: ignore
"usecols": usecols,
"verbose": verbose,
"encoding": encoding,
"squeeze": squeeze,
"memory_map": memory_map,
"float_precision": float_precision,
"na_filter": na_filter,
Expand All @@ -494,9 +495,9 @@ def csv_to_eland( # type: ignore
"error_bad_lines": error_bad_lines,
"on_bad_lines": on_bad_lines,
"low_memory": low_memory,
"mangle_dupe_cols": mangle_dupe_cols,
"infer_datetime_format": infer_datetime_format,
"skip_blank_lines": skip_blank_lines,
**extra_kwargs,
}

if chunksize is None:
Expand Down Expand Up @@ -525,6 +526,18 @@ def csv_to_eland( # type: ignore

kwargs.pop("on_bad_lines")

if "squeeze" in kwargs:
kwargs.pop("squeeze")
warnings.warn(
"This argument no longer works, use .squeeze('columns') on your DataFrame instead"
)

if "mangle_dupe_cols" in kwargs:
kwargs.pop("mangle_dupe_cols")
warnings.warn(
"The mangle_dupe_cols argument no longer works. Furthermore, "
"duplicate columns will automatically get a number suffix."
)
# read csv in chunks to pandas DataFrame and dump to eland DataFrame (and Elasticsearch)
reader = pd.read_csv(filepath_or_buffer, **kwargs)

Expand Down
7 changes: 5 additions & 2 deletions eland/field_mappings.py
Original file line number Diff line number Diff line change
Expand Up @@ -712,8 +712,11 @@ def add_scripted_field(
capabilities, orient="index", columns=FieldMappings.column_labels
)

self._mappings_capabilities = self._mappings_capabilities.append(
capability_matrix_row
self._mappings_capabilities = pd.concat(
[
self._mappings_capabilities,
capability_matrix_row,
]
)

def numeric_source_fields(self) -> List[str]:
Expand Down
1 change: 1 addition & 0 deletions eland/operations.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
DEFAULT_PIT_KEEP_ALIVE,
DEFAULT_PROGRESS_REPORTING_NUM_ROWS,
DEFAULT_SEARCH_SIZE,
PANDAS_VERSION,
SortOrder,
build_pd_series,
elasticsearch_date_to_pandas_date,
Expand Down
17 changes: 13 additions & 4 deletions eland/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,12 @@

import numpy as np
import pandas as pd # type: ignore
from pandas.core.indexes.frozen import FrozenList
from pandas.io.common import _expand_user, stringify_path # type: ignore

import eland.plotting
from eland.arithmetics import ArithmeticNumber, ArithmeticSeries, ArithmeticString
from eland.common import DEFAULT_NUM_ROWS_DISPLAYED, docstring_parameter
from eland.common import DEFAULT_NUM_ROWS_DISPLAYED, PANDAS_VERSION, docstring_parameter
from eland.filter import (
BooleanFilter,
Equal,
Expand Down Expand Up @@ -292,18 +293,26 @@ def value_counts(self, es_size: int = 10) -> pd.Series:
Examples
--------
>>> df = ed.DataFrame('http://localhost:9200', 'flights')
>>> df['Carrier'].value_counts()
>>> df['Carrier'].value_counts() # doctest: +SKIP
bartbroere marked this conversation as resolved.
Show resolved Hide resolved
Carrier
Logstash Airways 3331
JetBeats 3274
Kibana Airlines 3234
ES-Air 3220
Name: Carrier, dtype: int64
Name: count, dtype: int64
"""
if not isinstance(es_size, int):
raise TypeError("es_size must be a positive integer.")
elif es_size <= 0:
raise ValueError("es_size must be a positive integer.")
return self._query_compiler.value_counts(es_size)
value_counts = self._query_compiler.value_counts(es_size)
# https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#value-counts-sets-the-resulting-name-to-count
if PANDAS_VERSION[0] == 2:
value_counts.name = "count"
value_counts.index.names = FrozenList([self.es_field_name])
value_counts.index.name = self.es_field_name

return value_counts

# dtype not implemented for Series as causes query to fail
# in pandas.core.computation.ops.Term.type
Expand Down
2 changes: 1 addition & 1 deletion noxfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def lint(session):


@nox.session(python=["3.9", "3.10", "3.11", "3.12"])
@nox.parametrize("pandas_version", ["1.5.0"])
@nox.parametrize("pandas_version", ["1.5.0", "2.2.3"])
def test(session, pandas_version: str):
session.install("-r", "requirements-dev.txt")
session.install(".")
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@
packages=find_packages(include=["eland", "eland.*"]),
install_requires=[
"elasticsearch>=8.3,<9",
"pandas>=1.5,<2",
"pandas>=1.5,<3",
"matplotlib>=3.6",
"numpy>=1.2.0,<2",
bartbroere marked this conversation as resolved.
Show resolved Hide resolved
"packaging",
Expand Down
8 changes: 6 additions & 2 deletions tests/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from pandas.testing import assert_frame_equal, assert_series_equal

import eland as ed
from eland.common import PANDAS_VERSION

ROOT_DIR = os.path.dirname(os.path.abspath(__file__))

Expand All @@ -45,7 +46,10 @@
_pd_flights = pd.DataFrame.from_records(flight_records).reindex(
_ed_flights.columns, axis=1
)
_pd_flights["timestamp"] = pd.to_datetime(_pd_flights["timestamp"])
if PANDAS_VERSION[0] >= 2:
_pd_flights["timestamp"] = pd.to_datetime(_pd_flights["timestamp"], format="mixed")
else:
_pd_flights["timestamp"] = pd.to_datetime(_pd_flights["timestamp"])
# Mimic what copy_to in an Elasticsearch mapping would do, combining the two fields in a list
_pd_flights["Cities"] = _pd_flights.apply(
lambda x: list(sorted([x["OriginCityName"], x["DestCityName"]])), axis=1
Expand All @@ -62,7 +66,7 @@
)
_pd_ecommerce.insert(2, "customer_birth_date", None)
_pd_ecommerce.index = _pd_ecommerce.index.map(str) # make index 'object' not int
_pd_ecommerce["customer_birth_date"].astype("datetime64")
_pd_ecommerce["customer_birth_date"].astype("datetime64[ns]")
_ed_ecommerce = ed.DataFrame(ES_TEST_CLIENT, ECOMMERCE_INDEX_NAME)


Expand Down
21 changes: 15 additions & 6 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,24 +77,33 @@ def f(*args, **kwargs):
pd_exc = e

self.check_exception(ed_exc, pd_exc)
self.check_values(ed_obj, pd_obj)
try:
self.check_values(ed_obj, pd_obj)
except AssertionError as e:
# This is an attribute we allow to differ when comparing zero-length objects
if (
'Attribute "inferred_type" are different' in repr(e)
and len(ed_obj) == 0
and len(pd_obj) == 0
):
self.check_values(ed_obj, pd_obj, check_index_type=False)

if isinstance(ed_obj, (ed.DataFrame, ed.Series)):
return SymmetricAPIChecker(ed_obj, pd_obj)
return pd_obj

return f

def check_values(self, ed_obj, pd_obj):
def check_values(self, ed_obj, pd_obj, **kwargs):
"""Checks that any two values coming from eland and pandas are equal"""
if isinstance(ed_obj, ed.DataFrame):
assert_pandas_eland_frame_equal(pd_obj, ed_obj)
assert_pandas_eland_frame_equal(pd_obj, ed_obj, **kwargs)
elif isinstance(ed_obj, ed.Series):
assert_pandas_eland_series_equal(pd_obj, ed_obj)
assert_pandas_eland_series_equal(pd_obj, ed_obj, **kwargs)
elif isinstance(ed_obj, pd.DataFrame):
assert_frame_equal(ed_obj, pd_obj)
assert_frame_equal(ed_obj, pd_obj, **kwargs)
elif isinstance(ed_obj, pd.Series):
assert_series_equal(ed_obj, pd_obj)
assert_series_equal(ed_obj, pd_obj, **kwargs)
elif isinstance(ed_obj, pd.Index):
assert ed_obj.equals(pd_obj)
else:
Expand Down
2 changes: 2 additions & 0 deletions tests/dataframe/test_datetime_pytest.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,8 @@ def test_datetime_to_ms(self):
},
index=["0", "1", "2"],
)
# https://pandas.pydata.org/docs/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution
df["D"] = df["D"].astype("datetime64[ns]")

expected_mappings = {
"mappings": {
Expand Down
12 changes: 10 additions & 2 deletions tests/dataframe/test_describe_pytest.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,17 @@ def test_flights_describe(self):
["Cancelled", "FlightDelay"], axis="columns"
)

# Pandas >= 2 calculates aggregations such as min and max for timestamps too
# This could be implemented in eland, but as of yet this is not the case
# We therefore remove it before the comparison
if "timestamp" in pd_describe.columns:
pd_describe = pd_describe.drop(["timestamp"], axis="columns")

# Pandas >= 2 orders the aggregations differently than Pandas < 2
# A sort_index is applied so tests will succeed in both environments
assert_frame_equal(
pd_describe.drop(["25%", "50%", "75%"], axis="index"),
ed_describe.drop(["25%", "50%", "75%"], axis="index"),
pd_describe.drop(["25%", "50%", "75%"], axis="index").sort_index(),
ed_describe.drop(["25%", "50%", "75%"], axis="index").sort_index(),
check_exact=False,
rtol=True,
)
Expand Down
2 changes: 1 addition & 1 deletion tests/dataframe/test_head_tail_pytest.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ def test_head_0(self):

ed_head_0 = ed_flights.head(0)
pd_head_0 = pd_flights.head(0)
assert_pandas_eland_frame_equal(pd_head_0, ed_head_0)
assert_pandas_eland_frame_equal(pd_head_0, ed_head_0, check_index_type=False)

def test_doc_test_tail(self):
df = self.ed_flights()
Expand Down
Loading