Releases: bmeares/Meerschaum
v1.6.11 (v1.5.19)
v1.6.11 (v1.5.19 for Python 3.7)
- Fix an issue with in-place syncing.
When syncing a SQL pipe in-place with a backtrack interval, the interval is applied to the existing data stage to avoid inserting duplicate rows.
v1.6.10
v1.6.9 — v1.6.10
-
Improve thread safety checks.
Added checks forIS_THREAD_SAFE
to connectors to determine whether to use mutlithreading. -
Fix an issue with custom flags while syncing.
This patch includes better handling of custom flags added from plugins during the syncing process.
v1.6.8 Ultimate Generators Release!
v1.6.8
-
Added
as_iterator
toPipe.get_data()
.
Passingas_iterator=True
(oras_chunks
) toPipe.get_data()
returns a generator which returns chunks of Pandas DataFrames.Each DataFrame is the result of a
Pipe.get_data()
call with intermediate datetime bounds betweenbegin
andend
of sizechunk_interval
(defaultdatetime.timedelta(days=1)
for time-series / 100,000 IDs for integers).import meerschaum as mrsm pipe = mrsm.Pipe( 'a', 'b', columns={'datetime': 'id'}, dtypes={'id': 'Int64'}, ) pipe.sync([ {'id': 0, 'color': 'red'}, {'id': 1, 'color': 'blue'}, {'id': 2, 'color': 'green'}, {'id': 3, 'color': 'yellow'}, ]) ### NOTE: due to non-inclusive end bounding, ### chunks sometimes contain ### (chunk_interval - 1) rows. chunks = pipe.get_data( chunk_interval = 2, as_iterator = True, ) for chunk in chunks: print(chunk) # id color # 0 0 red # 1 1 blue # id color # 0 2 green # 1 3 yellow
-
Add server-side cursor support to
SQLConnector.read()
.
Ifchunk_hook
is provided, keep an open cursor and stream the chunks one-at-a-time. This allows for processing very large out-of-memory data sets.To return the results of the
chunk_hook
callable rather than a dataframe, passas_hook_result=True
to receive a list of values.If
as_iterator
is provided orchunksize
isNone
, thenSQLConnector.read()
reverts to the default client-side cursor implementation (which loads the entire result set into memory).import meerschaum as mrsm conn = mrsm.get_connector() def process_chunk(df: 'pd.DataFrame', **kw) -> int: return len(df) results = conn.read( "very_large_table", chunk_hook = process_chunk, as_hook_results = True, chunksize = 100, ) results[:2] # [100, 100]
-
Remove
--sync-chunks
and set its behavior as default.
Due to the above changes toSQLConnector.read()
,sync_chunks
now defaults toTrue
inPipe.sync()
. You may disable this behavior with--chunksize 0
.
v1.6.7
v1.6.6
v1.6.6 (v1.5.14 backported to Python 3.7)
- Issue one
ALTER TABLE
query per column for SQLite, MSSQL, DuckDB, and Oracle SQL.
SQLite and other flavors do not support multiple columns in anALTER TABLE
query. This patch addresses this behavior and adds a specific test for this scenario.
v1.6.5
v1.6.5 (v1.5.13 backported to Python 3.7)
-
Allow pipes to sync DataFrame generators.
Ifpipe.sync()
receives a generator (forDataFrames
, dictionaries, or lists), it will attempt to consume it and sync its chunks in parallel threads (this can be single-threaded with--workers 1
). For SQL pipes, this will be capped at your configured pool size (default 5) minus the running number of threads.This means you may now return generators to large transcations, such as reading a large CSV:
def fetch(pipe, **kw) -> Iterable['pd.DataFrame']: return pd.read_csv('data.csv', chunksize=1000)
Any iterator of DataFrame-like chunks will work:
def fetch(pipe, **kw) -> Generator[List[Dict[str, Any]]]: return ( [ {'id': 1, 'val': 10.0 * i}, {'id': 2, 'val': 20.0 * i}, ] for i in range(10) )
This new behavior has been added to
SQLConnector.fetch()
so you may now confidently sync very large tables between your databases.NOTE: The default
chunksize
for SQL queries has been lowered to 100,000 from 1,000,000. You may alter this value with--chunksize
or setting the value inMRSM{system:connectors:sql:chunksize}
(you can also edit the default pool size here). -
Fix edge case with SQL in-place syncs.
Occasionally, a few docs would be duplicated when running in-place SQL syncs. This patch increases the fetch window size to mitigate the issue. -
Remove
begin
andend
fromfilter_existing()
.
The keyword arguments were interfering with the determined datetime bounds, so this patch removes these flags (albeitbegin
was already ignored) to avoid confusion. Date bounds are solely determined from the contents of the DataFrame.
v1.6.4
v1.6.4 (backported as v1.5.12 to Python 3.7)
-
Allow for mixed UTC offsets in datetimes.
UTC offsets are now applied to datetime values before timezone information is stripped, which should now reflect accurate values. This patch also fixes edge cases when different offsets are synced within the same transcation.import meerschaum as mrsm pipe = mrsm.Pipe('a', 'b', columns={'datetime': 'dt'}) pipe.sync([ {'dt': '2023-01-01 00:00:00+00:00'}, {'dt': '2023-01-02 00:00:00+01:00'}, ]) pipe.get_data().to_dict(orient='records') # [ # {'dt': Timestamp('2023-01-01 00:00:00')}, # {'dt': Timestamp('2023-01-01 23:00:00')} # ]
-
Allow skipping datetime detection.
The automatic datetime detection feature now respects a pipe'sdtypes
; columns that aren't of typedatetime64[ns]
will be ignored.import meerschaum as mrsm pipe = mrsm.Pipe('a', 'b', dtypes={'not-date': 'str'}) pipe.sync([ {'not-date': '2023-01-01 00:00:00'} ]) pipe.get_data().to_dict(orient='records') # [ # {'not-date': '2023-01-01 00:00:00'} # ]
-
Added utility method
enforce_dtypes()
.
The DataFrame data type enforcement logic ofpipe.enforce_dtypes()
has been exposed asmeerschaum.utils.misc.enforce_dtypes()
:from meerschaum.utils.misc import enforce_dtypes import pandas as pd df = pd.DataFrame([{'a': '1'}]) enforce_dtypes(df, {'a': 'Int64'}).dtypes # a Int64 # dtype: object
-
Performance improvements.
Some of the unnecessarily immutable transformations have been replaced with more memory- and compute-efficient in-place operations. Other small improvements like better caching should also speed things up. -
Removed noise from debug output.
The virtual environment debug messages have been removed to make--debug
easier to read. -
Better handle inferred datetime index.
The inferred datetime index feature may now be disabled by settingdatetime
toNone
. Improvements were made to be handle incorrectly identified indices. -
Improve dynamic dtypes for SQLite.
SQLite doesn't allow for modifying column types but is usually dynamic with data types. A few edge cases have been solved with a workaround for altering the table's definition.
v1.6.3
v1.6.2
v1.6.2
-
Virtual environment and
pip
tweaks.
With upcoming changes topip
coming due to PEP 668, this patch sets the environment variablePIP_BREAK_SYSTEM_PACKAGES
when executingpip
internally. Note that all packages are installed within virtual environments exceptuvicorn
,gunicorn
, and those explicitly installed with a venv ofNone
. -
Change how pipes are pretty-printed.
Printing the attributes of a single pipe now highlights the keys in blue. -
Fix an issue with
bootstrap pipes
and plugins.
When bootstrapping a pipe with a plugin connector, the plugin's virtual environment will now be activated while executing itsregister()
function. -
Update dependencies.
The minimum version ofduckdb
was bumped to0.7.1
,duckdb-engine
was bumped to0.7.0
, andpip
was lowered to22.0.4
to accept older versions. Additionally,pandas==2.0.0rc1
was tested and confirmed to work, so version 1.7.x of Meerschaum will likely require 2.0+ ofpandas
to make use of its PyArrow backend.
v1.6.1: SQLAlchemy 2.0, drop Python 3.7, and more!
v1.6.0 – v1.6.1
Breaking Changes
-
Dropped Python 3.7 support.
The latestpandas
requires 3.8+, so to use Pandas 1.5.x, we have to finally drop Python 3.7. -
Upgrade SQLAlchemy to 2.0.5+.
This includes better transaction handling with connections. Other packages which use SQLAlchemy may not yet support 2.0+. -
Removed
MQTTConnector
.
This was one of the original connectors but was never tested or used in production. It may be reintroduced via a futuremqtt
plugin.
Bugfixes and Improvements
-
Stop execution when improper command-line arguments are passed in.
Incorrect command-line arguments will now return an error. The previous behavior was to strip the flags and execute the action anyway, which was undesirable.$ mrsm show pipes -c 💢 Invalid arguments: show pipes -c 🛑 argument -c/-C/--connector-keys: expected at least one argument
-
Allow
bootstrap connector
to create custom connectors.
Thebootstrap connector
wizard can now handle registering custom connectors. It uses theREQUIRED_ATTRIBUTES
list set in the custom connector class when determining what to ask for. -
Allow custom connectors to omit
__init__()
If a connector is created via@make_connector
and doesn't have an__init__()
function, the base one is used to create the connector with the correct type (derived from the class name) and verify theREQUIRED_ATTRIBUTES
values if present. -
Infer a connector's
type
from its class name.
Thetype
of a connector is now determined from its class name (e.g.FooConnector
would have a typefoo
). When inheriting fromConnector
, it is no longer required to explictly pass the type before the label. For backwards compatability, the legacy method still behaves as expected.from meerschaum.connectors import ( Connector, make_connector, get_connector, ) @make_connector class FooConnector: REQUIRED_ATTRIBUTES = ['username', 'password'] conn = get_connector( 'foo', username = 'abc', password = 'def', )
-
Allow connectors to omit a
label
.
The default labelmain
will be used iflabel
is omitted. -
Add
meta
keys to connectors.
Like pipes, themeta
property of a connector returns a dictionary with the kwargs needed to reconstruct the connector.conn = mrsm.get_connector('sql:temp', flavor='sqlite', database=':memory:') print(conn.meta) # {'type': 'sql', 'label': 'temp', 'database': ':memory:', 'flavor': 'sqlite'}
-
Remove
NUL
bytes when inserting into PostgreSQL.
PostgreSQL doesn't supportNUL
bytes in text ('\0'
), so these characters are removed from strings when copying into a table. -
Cache
pipe.exists()
for 5 seconds.
Repeated calls topipe.exists()
will be sped up due to short-term caching. This cache is invalidated when syncing or dropping a pipe. -
Fix an edge case with subprocesses in headless environments.
Checks were added to subprocesses to prevent using interactive features when no such features may be available (i.e.termios
). -
Added
pprint()
,get_config()
, andattempt_import()
to the top-level namespace.
Frequently used functionspprint()
,get_config()
, andattempt_import()
have been promoted to the root level of themeerschaum
namespace, i.e.:import meerschaum as mrsm mrsm.pprint(mrsm.get_config('meerschaum')) sqlalchemy = mrsm.attempt_import('sqlalchemy')
-
Fix CLI for MSSQL.
The interactive CLI has been fixed for Microsoft SQL Server.