Skip to content

Releases: bmeares/Meerschaum

v1.6.11 (v1.5.19)

22 May 14:19
4ee23a1
Compare
Choose a tag to compare

v1.6.11 (v1.5.19 for Python 3.7)

  • Fix an issue with in-place syncing.
    When syncing a SQL pipe in-place with a backtrack interval, the interval is applied to the existing data stage to avoid inserting duplicate rows.

v1.6.10

14 May 03:36
45f5c2c
Compare
Choose a tag to compare

v1.6.9 — v1.6.10

  • Improve thread safety checks.
    Added checks for IS_THREAD_SAFE to connectors to determine whether to use mutlithreading.

  • Fix an issue with custom flags while syncing.
    This patch includes better handling of custom flags added from plugins during the syncing process.

v1.6.8 Ultimate Generators Release!

12 May 06:43
432b511
Compare
Choose a tag to compare

v1.6.8

  • Added as_iterator to Pipe.get_data().
    Passing as_iterator=True (or as_chunks) to Pipe.get_data() returns a generator which returns chunks of Pandas DataFrames.

    Each DataFrame is the result of a Pipe.get_data() call with intermediate datetime bounds between begin and end of size chunk_interval (default datetime.timedelta(days=1) for time-series / 100,000 IDs for integers).

    import meerschaum as mrsm
    
    pipe = mrsm.Pipe(
        'a', 'b',
        columns={'datetime': 'id'},
        dtypes={'id': 'Int64'},
    )
    pipe.sync([
        {'id': 0, 'color': 'red'},
        {'id': 1, 'color': 'blue'},
        {'id': 2, 'color': 'green'},
        {'id': 3, 'color': 'yellow'},
    ])
    
    ### NOTE: due to non-inclusive end bounding,
    ###       chunks sometimes contain
    ###       (chunk_interval - 1) rows.
    chunks = pipe.get_data(
        chunk_interval = 2,
        as_iterator = True,
    )
    for chunk in chunks:
        print(chunk)
    
    #    id color
    # 0   0   red
    # 1   1  blue
    #    id   color
    # 0   2   green
    # 1   3  yellow
  • Add server-side cursor support to SQLConnector.read().
    If chunk_hook is provided, keep an open cursor and stream the chunks one-at-a-time. This allows for processing very large out-of-memory data sets.

    To return the results of the chunk_hook callable rather than a dataframe, pass as_hook_result=True to receive a list of values.

    If as_iterator is provided or chunksize is None, then SQLConnector.read() reverts to the default client-side cursor implementation (which loads the entire result set into memory).

    import meerschaum as mrsm
    conn = mrsm.get_connector()
    
    def process_chunk(df: 'pd.DataFrame', **kw) -> int:
        return len(df)
    
    results = conn.read(
        "very_large_table",
        chunk_hook = process_chunk,
        as_hook_results = True,
        chunksize = 100,
    )
    
    results[:2]
    # [100, 100]
  • Remove --sync-chunks and set its behavior as default.
    Due to the above changes to SQLConnector.read(), sync_chunks now defaults to True in Pipe.sync(). You may disable this behavior with --chunksize 0.

v1.6.7

11 May 18:26
c11e6ae
Compare
Choose a tag to compare

v1.6.7

  • Improve memory usage when syncing generators.
    To more lazily sync chunks from generators, pool.map() has been replaced with pool.imap().

v1.6.6

09 May 03:02
35ff219
Compare
Choose a tag to compare

v1.6.6 (v1.5.14 backported to Python 3.7)

  • Issue one ALTER TABLE query per column for SQLite, MSSQL, DuckDB, and Oracle SQL.
    SQLite and other flavors do not support multiple columns in an ALTER TABLE query. This patch addresses this behavior and adds a specific test for this scenario.

v1.6.5

05 May 04:51
8285898
Compare
Choose a tag to compare

v1.6.5 (v1.5.13 backported to Python 3.7)

  • Allow pipes to sync DataFrame generators.
    If pipe.sync() receives a generator (for DataFrames, dictionaries, or lists), it will attempt to consume it and sync its chunks in parallel threads (this can be single-threaded with --workers 1). For SQL pipes, this will be capped at your configured pool size (default 5) minus the running number of threads.

    This means you may now return generators to large transcations, such as reading a large CSV:

    def fetch(pipe, **kw) -> Iterable['pd.DataFrame']:
        return pd.read_csv('data.csv', chunksize=1000)

    Any iterator of DataFrame-like chunks will work:

    def fetch(pipe, **kw) -> Generator[List[Dict[str, Any]]]:
        return (
            [
                {'id': 1, 'val': 10.0 * i},
                {'id': 2, 'val': 20.0 * i},
            ] for i in range(10)
        )

    This new behavior has been added to SQLConnector.fetch() so you may now confidently sync very large tables between your databases.

    NOTE: The default chunksize for SQL queries has been lowered to 100,000 from 1,000,000. You may alter this value with --chunksize or setting the value in MRSM{system:connectors:sql:chunksize} (you can also edit the default pool size here).

  • Fix edge case with SQL in-place syncs.
    Occasionally, a few docs would be duplicated when running in-place SQL syncs. This patch increases the fetch window size to mitigate the issue.

  • Remove begin and end from filter_existing().
    The keyword arguments were interfering with the determined datetime bounds, so this patch removes these flags (albeit begin was already ignored) to avoid confusion. Date bounds are solely determined from the contents of the DataFrame.

v1.6.4

26 Apr 08:15
18254b0
Compare
Choose a tag to compare

v1.6.4 (backported as v1.5.12 to Python 3.7)

  • Allow for mixed UTC offsets in datetimes.
    UTC offsets are now applied to datetime values before timezone information is stripped, which should now reflect accurate values. This patch also fixes edge cases when different offsets are synced within the same transcation.

    import meerschaum as mrsm
    pipe = mrsm.Pipe('a', 'b', columns={'datetime': 'dt'})
    pipe.sync([
        {'dt': '2023-01-01 00:00:00+00:00'},
        {'dt': '2023-01-02 00:00:00+01:00'},
    ])
    pipe.get_data().to_dict(orient='records')
    # [
    #     {'dt': Timestamp('2023-01-01 00:00:00')},
    #     {'dt': Timestamp('2023-01-01 23:00:00')}
    # ]
  • Allow skipping datetime detection.
    The automatic datetime detection feature now respects a pipe's dtypes; columns that aren't of type datetime64[ns] will be ignored.

    import meerschaum as mrsm
    pipe = mrsm.Pipe('a', 'b', dtypes={'not-date': 'str'})
    pipe.sync([
        {'not-date': '2023-01-01 00:00:00'}
    ])
    pipe.get_data().to_dict(orient='records')
    # [
    #     {'not-date': '2023-01-01 00:00:00'}
    # ]
  • Added utility method enforce_dtypes().
    The DataFrame data type enforcement logic of pipe.enforce_dtypes() has been exposed as meerschaum.utils.misc.enforce_dtypes():

    from meerschaum.utils.misc import enforce_dtypes
    import pandas as pd
    df = pd.DataFrame([{'a': '1'}])
    enforce_dtypes(df, {'a': 'Int64'}).dtypes
    # a    Int64
    # dtype: object
  • Performance improvements.
    Some of the unnecessarily immutable transformations have been replaced with more memory- and compute-efficient in-place operations. Other small improvements like better caching should also speed things up.

  • Removed noise from debug output.
    The virtual environment debug messages have been removed to make --debug easier to read.

  • Better handle inferred datetime index.
    The inferred datetime index feature may now be disabled by setting datetime to None. Improvements were made to be handle incorrectly identified indices.

  • Improve dynamic dtypes for SQLite.
    SQLite doesn't allow for modifying column types but is usually dynamic with data types. A few edge cases have been solved with a workaround for altering the table's definition.

v1.6.3

28 Mar 22:15
5e1ea5f
Compare
Choose a tag to compare

v1.6.3

  • Fixed an issue with background jobs.
    A change had broken daemon functionality has been reverted.

v1.6.2

20 Mar 03:24
7e33297
Compare
Choose a tag to compare

v1.6.2

  • Virtual environment and pip tweaks.
    With upcoming changes to pip coming due to PEP 668, this patch sets the environment variable PIP_BREAK_SYSTEM_PACKAGES when executing pip internally. Note that all packages are installed within virtual environments except uvicorn, gunicorn, and those explicitly installed with a venv of None.

  • Change how pipes are pretty-printed.
    Printing the attributes of a single pipe now highlights the keys in blue.

  • Fix an issue with bootstrap pipes and plugins.
    When bootstrapping a pipe with a plugin connector, the plugin's virtual environment will now be activated while executing its register() function.

  • Update dependencies.
    The minimum version of duckdb was bumped to 0.7.1, duckdb-engine was bumped to 0.7.0, and pip was lowered to 22.0.4 to accept older versions. Additionally, pandas==2.0.0rc1 was tested and confirmed to work, so version 1.7.x of Meerschaum will likely require 2.0+ of pandas to make use of its PyArrow backend.

v1.6.1: SQLAlchemy 2.0, drop Python 3.7, and more!

13 Mar 06:00
eb73a50
Compare
Choose a tag to compare

v1.6.0 – v1.6.1

Breaking Changes

  • Dropped Python 3.7 support.
    The latest pandas requires 3.8+, so to use Pandas 1.5.x, we have to finally drop Python 3.7.

  • Upgrade SQLAlchemy to 2.0.5+.
    This includes better transaction handling with connections. Other packages which use SQLAlchemy may not yet support 2.0+.

  • Removed MQTTConnector.
    This was one of the original connectors but was never tested or used in production. It may be reintroduced via a future mqtt plugin.

Bugfixes and Improvements

  • Stop execution when improper command-line arguments are passed in.
    Incorrect command-line arguments will now return an error. The previous behavior was to strip the flags and execute the action anyway, which was undesirable.

    $ mrsm show pipes -c
    
     💢 Invalid arguments:
      show pipes -c
    
       🛑 argument -c/-C/--connector-keys: expected at least one argument
  • Allow bootstrap connector to create custom connectors.
    The bootstrap connector wizard can now handle registering custom connectors. It uses the REQUIRED_ATTRIBUTES list set in the custom connector class when determining what to ask for.

  • Allow custom connectors to omit __init__()
    If a connector is created via @make_connector and doesn't have an __init__() function, the base one is used to create the connector with the correct type (derived from the class name) and verify the REQUIRED_ATTRIBUTES values if present.

  • Infer a connector's type from its class name.
    The type of a connector is now determined from its class name (e.g. FooConnector would have a type foo). When inheriting from Connector, it is no longer required to explictly pass the type before the label. For backwards compatability, the legacy method still behaves as expected.

    from meerschaum.connectors import (
        Connector,
        make_connector,
        get_connector,
    )
    
    @make_connector
    class FooConnector:
        REQUIRED_ATTRIBUTES = ['username', 'password']
    
    conn = get_connector(
        'foo',
        username = 'abc',
        password = 'def',
    )
  • Allow connectors to omit a label.
    The default label main will be used if label is omitted.

  • Add meta keys to connectors.
    Like pipes, the meta property of a connector returns a dictionary with the kwargs needed to reconstruct the connector.

    conn = mrsm.get_connector('sql:temp', flavor='sqlite', database=':memory:')
    print(conn.meta)
    # {'type': 'sql', 'label': 'temp', 'database': ':memory:', 'flavor': 'sqlite'}
  • Remove NUL bytes when inserting into PostgreSQL.
    PostgreSQL doesn't support NUL bytes in text ('\0'), so these characters are removed from strings when copying into a table.

  • Cache pipe.exists() for 5 seconds.
    Repeated calls to pipe.exists() will be sped up due to short-term caching. This cache is invalidated when syncing or dropping a pipe.

  • Fix an edge case with subprocesses in headless environments.
    Checks were added to subprocesses to prevent using interactive features when no such features may be available (i.e. termios).

  • Added pprint(), get_config(), and attempt_import() to the top-level namespace.
    Frequently used functions pprint(), get_config(), and attempt_import() have been promoted to the root level of the meerschaum namespace, i.e.:

    import meerschaum as mrsm
    mrsm.pprint(mrsm.get_config('meerschaum'))
    
    sqlalchemy = mrsm.attempt_import('sqlalchemy')
  • Fix CLI for MSSQL.
    The interactive CLI has been fixed for Microsoft SQL Server.