Skip to content

Releases: bmeares/Meerschaum

🔐 v2.8.4 Improve API allowed instance keys settings.

04 Feb 01:53
229b9ab
Compare
Choose a tag to compare

v2.8.4

  • Allow for pattern matching in allowed_instance_keys.
    You may now generalize the instances exposed by the API by using Unix-style patterns in the list system:api:permissions:instances:allowed_instance_keys:

    {
      "api": {
        "permissions": {
          "instances": {
            "allowed_instance_keys": [
              "valkey:*",
              "*_dev"
            ]
          }
        }
      }
    }
  • Return pipe attributes for the route /pipes/{connector}/{metric}/{location}.
    The API routes /pipes/{connector}/{metric}/{location} and /pipes/{connector}/{metric}/{location}/attributes both return pipe attributes.

  • Check entire batches for verify rowcounts.
    The command verify rowcounts will now check batch boundaries before checking row-counts for individual chunks. This should moderately increase performance.

  • Kill orphaned child processes when the parent job is killed.
    Jobs created with pipeline arguments should now kill associated child processes.

  • Add --skip-hooks.
    The flag --skip-hooks prevents any sync hooks from firing when syncing pipes.

  • Remove datetime rounding from parse_schedule().
    Scheduled actions now behave as expected ― the current timestamp is no longer rounded to the nearest minute, which was causing issues with the starting in delay feature.

  • Fix allowed_instance_keys enforcement.

🧹 v2.8.3 Clean up API endpoints.

23 Jan 17:48
695e643
Compare
Choose a tag to compare

v2.8.3

  • Increase username limit to 60 characters.
  • Add chunk retries to Pipe.verify().
  • Add instance keys to remaining pipes endpoints.
  • Misc bugfixes.

⚡️ v2.8.2 Add batches to `verify pipes`, allow for multiple instances from WebAPI, and memory improvements.

17 Jan 04:23
0b0a9ab
Compare
Choose a tag to compare

v2.8.0 – v2.8.2

  • Add batches to Pipe.verify().
    Verification syncs now run in sequential batches so that they may be interrupted and resumed. See Pipe.get_chunk_bounds_batches() for more information:

    from datetime import timedelta
    import meerschaum as mrsm
    
    pipe = mrsm.Pipe('demo', 'get_chunk_bounds', instance='sql:local')
    bounds = pipe.get_chunk_bounds(
        chunk_interval=timedelta(hours=10),
        begin='2025-01-10',
        end='2025-01-15',
        bounded=True,
    )
    batches = pipe.get_chunk_bounds_batches(bounds, workers=4)
    mrsm.pprint(
        [
            tuple(
                (str(bounds[0]), str(bounds[1]))
                for bounds in batch
            )
            for batch in batches
        ]
    ) 
    # [
    #     (
    #         ('2025-01-10 00:00:00+00:00', '2025-01-10 10:00:00+00:00'),
    #         ('2025-01-10 10:00:00+00:00', '2025-01-10 20:00:00+00:00'),
    #         ('2025-01-10 20:00:00+00:00', '2025-01-11 06:00:00+00:00'),
    #         ('2025-01-11 06:00:00+00:00', '2025-01-11 16:00:00+00:00')
    #     ),
    #     (
    #         ('2025-01-11 16:00:00+00:00', '2025-01-12 02:00:00+00:00'),
    #         ('2025-01-12 02:00:00+00:00', '2025-01-12 12:00:00+00:00'),
    #         ('2025-01-12 12:00:00+00:00', '2025-01-12 22:00:00+00:00'),
    #         ('2025-01-12 22:00:00+00:00', '2025-01-13 08:00:00+00:00')
    #     ),
    #     (
    #         ('2025-01-13 08:00:00+00:00', '2025-01-13 18:00:00+00:00'),
    #         ('2025-01-13 18:00:00+00:00', '2025-01-14 04:00:00+00:00'),
    #         ('2025-01-14 04:00:00+00:00', '2025-01-14 14:00:00+00:00'),
    #         ('2025-01-14 14:00:00+00:00', '2025-01-15 00:00:00+00:00')
    #     )
    # ]
  • Add --skip-chunks-with-greater-rowcounts to verify pipes.
    The flag --skip-chunks-with-greater-rowcounts will compare a chunk's rowcount with the rowcount of the remote table and skip if the chunk is greater than or equal to the remote count. This is only applicable for connectors which implement remote=True support for get_sync_time().

  • Add verify rowcounts.
    The action verify rowcounts (same as passing --check-rowcounts-only to verify pipes) will compare row-counts for a pipe's chunks against remote rowcounts. This is only applicable for connectors which implement get_pipe_rowcount() with support for remote=True.

  • Add remote to pipe.get_sync_time().
    For pipes which support it (i.e. the SQLConnector), the option remote is intended to return the sync time of a pipe's fetch definition, like the option remote in Pipe.get_rowcount().

  • Allow for the Web API to serve pipes from multiple instances.
    You can disable this behavior by setting system:api:permissions:instances:allow_multiple_instances to false. You may also explicitly allow which instances may be accessed by the WebAPI by setting the list system:api:permissions:instances:allowed_instance_keys (defaults to ["*"]).

  • Fix memory leak for retrying failed chunks.
    Failed chunks were kept in memory and retried later. In resource-intensive syncs with large chunks and high failures, this would result in large objects not being freed and hogging memory. This situation has been fixed.

  • Add negation to job actions.
    Prefix a job name with an underscore to select all other jobs. This is useful for filtering out noise for show logs.

  • Add Pipe.parent.
    As a quality-of-life improvement, the attribute Pipe.parent will return the first member of Pipe.parents (if available).

  • Use the current instance for new tabs in the Webterm.
    Clicking "New Tab" will open a new tmux window using the currently selected instance on the Web Console.

  • Other webterm quality-of-life improvements.
    Added a size toggle button to allow for the webterm to take the entire page.

  • Additional refactoring work.
    The API endpoints code has been cleaned up.

  • Added system configurations.
    New options have been added to the system configuration, such as max_response_row_limit, allow_multiple_instances, allowed_instance_keys.

🚸 v2.7.10 Add persistent webterms, limit concurrency for verify pipes.

12 Jan 20:58
39b41eb
Compare
Choose a tag to compare

v2.7.9 – v2.7.10

  • Add persistent Webterm sessions.
    On the Web Console, the Webterm will attach to a persistent terminal for the current session's user.

  • Reconnect Webterms after client disconnect.
    If a Webterm socket connection is broken, the client logic will attempt to reconnect and attach to the tmux session.

  • Add tmux sessions to Webterms.
    Webterm sessions now connect to tmux sessions (tied to the user accounts).
    Set system:webterm:tmux:enabled to false to disable tmux sessions.

  • Limit concurrent connections during verify pipes.
    To keep from exhausting the SQL connection pool, limit the number of concurrent intra-chunk connections.

  • Return the precision and scale from a table's columns and types.
    Reading a table's columns and types with meerschaum.utils.sql.get_table_columns_types() now returns the precision and scale for NUMERIC (DECIMAL) columns.

⚡️ v2.7.8 Memory improvements, add precision and scale support to numerics.

11 Jan 03:55
f6bb6f1
Compare
Choose a tag to compare

v2.7.8

  • Add support for user-supplied precision and scale for numeric columns.
    You may now manually specify a numeric column's precision and scale:

    import meerschaum as mrsm
    
    pipe = mrsm.Pipe(
        'demo', 'numeric', 'precision_scale',
        instance='sql:local',
        dtypes={'val': 'numeric[5,2]'},
    )
    pipe.sync([{'val': '123.456'}])
    print(pipe.get_data())
    #       val
    # 0  123.46
  • Serialize numeric columns to exact values during bulk inserts.
    Decimal values are serialized when inserting into NUMERIC columns during bulk inserts.

  • Return a generator when fetching with SQLConnector.
    To alleviate memory pressure, skip loading the entire dataframe when fetching.

  • Add json_serialize_value() to handle custom dtypes.
    When serializing documents, pass json_serialize_value as the default handler:

    import json
    from decimal import Decimal
    from datetime import datetime, timezone
    from meerschaum.utils.dtypes import json_serialize_value
    
    print(json.dumps(
        {
            'bytes': b'hello, world!',
            'decimal': Decimal('1.000000001'),
            'datetime': datetime(2025, 1, 1, tzinfo=timezone.utc),
        },
        default=json_serialize_value,
        indent=4,
    ))
    # {
    #     "bytes": "aGVsbG8sIHdvcmxkIQ==",
    #     "decimal": "1.000000001",
    #     "datetime": "2025-01-01T00:00:00+00:00"
    # }
  • Fix an issue with the WITH keyword in pipe definitions for MSSQL.
    Previously, pipes with used with keyword WITH but not as a CTE (e.g. to specify an index) were incorrectly parsed.

⚡️ v2.7.7 Index performance improvements, add drop indices and index pipes, and more.

09 Jan 00:46
79c48a0
Compare
Choose a tag to compare

v2.7.7

  • Add actions drop indices and index pipes.
    You may now drop and create indices on pipes with the actions drop indices and index pipes or the pipe methods drop_indices() and create_indices():

    import meerschaum as mrsm
    
    pipe = mrsm.Pipe('demo', 'drop_indices', columns=['id'], instance='sql:local')
    pipe.sync([{'id': 1}])
    print(pipe.get_columns_indices())
    # {'id': [{'name': 'IX_demo_drop_indices_id', 'type': 'INDEX'}]}
    
    pipe.drop_indices()
    print(pipe.get_columns_indices())
    # {}
    
    pipe.create_indices()
    print(pipe.get_columns_indices())
    # {'id': [{'name': 'IX_demo_drop_indices_id', 'type': 'INDEX'}]}
  • Remove CAST() to datetime with selecting from a pipe's definition.
    For some databases, casting to the same dtype causes the query optimizer to ignore the datetime index.

  • Add INCLUDE clause to datetime index for MSSQL.
    This is to coax the query optimizer into using the datetime axis.

  • Remove redundant unique index.
    The two competing unique indices have been combined into a single index (for the key unique). The unique constraint (when upsert is true) shares the name but has the prefix UQ_ in place of IX_.

  • Add pipe parameter null_indices.
    Set the pipe parameter null_indices to False for a performance improvement in situations where null index values are not expected.

  • Apply backtrack minutes when fetching integer datetimes.
    Backtrack minutes are now applied to pipes with integer datetimes axes.

🔧 v2.7.6 Make temporary table names configurable.

07 Jan 22:32
ff6bbbe
Compare
Choose a tag to compare

v2.7.6

  • Make temporary table names configurable.
    The values for temporary SQL tables may be set in MRSM{system:connectors:sql:instance:temporary_target}. The new default prefix is '_', and the new default transaction length is 4. The values have been re-ordered to target, transaction ID, then label.

  • Add connector completions to copy pipes.
    When copying pipes, the connector keys prompt will offer auto-complete suggestions.

  • Fix stale job results.
    When polling for job results, the job result is dropped from in-memory cache to avoid overwriting the on-disk result.

  • Format row counts and seconds into human-friendly text.
    Row counts and sync durations are now formatted into human-friendly representations.

  • Add digits to generate_password().
    Random strings from meerschaum.utils.misc.generate_password() may now contain digits.

✅ v2.7.5 Enforce TZ-aware columns as UTC, add dynamic queries.

30 Dec 03:44
e350002
Compare
Choose a tag to compare

v2.7.3 – v2.7.5

  • Allow for dynamic targets in SQL queries.
    Include a pipe definition in double curly braces (à la Jinja) to substitute a pipe's target into a templated query.

    import meerschaum as mrsm
    
    pipe = mrsm.Pipe('demo', 'template', target='foo', instance='sql:local')
    _ = pipe.register()
    
    downstream_pipe = mrsm.Pipe(
        'sql:local', 'template',
        instance='sql:local',
        parameters={
            'sql': "SELECT *\nFROM {{Pipe('demo', 'template', instance='sql:local')}}"
        },
    )
    
    conn = mrsm.get_connector('sql:local')
    print(conn.get_pipe_metadef(downstream_pipe))
    # WITH "definition" AS (
    #     SELECT *
    #     FROM "foo"
    # )
    # SELECT *
    # FROM "definition"
  • Add --skip-enforce-dtypes.
    To override a pipe's enforce parameter, pass --skip-enforce-dtypes to a sync.

  • Add bulk inserts for MSSQL.
    To disable this behavior, set system:connectors:sql:bulk_insert:mssql to false. Bulk inserts for PostgreSQL-like flavors may now be disabled as well.

  • Fix altering multiple column types for MSSQL.
    When a table has multiple columns to be altered, each column will have its own ALTER TABLE query.

  • Skip enforcing custom dtypes when enforce=False.
    To avoid confusion, special Meerschaum data types (numeric, json, etc.) are not coerced into objects when enforce=False.

  • Fix timezone-aware casts.
    A bug has been fixed where it was possible to mix timezone-aware and -naive casts in a single query. This patch ensures that this no longer occurs.

  • Explicitly cast timezone-aware datetimes as UTC in SQL syncs.
    By default, timezone-aware columns are now cast as time zone UTC in SQL. This may be skipped by setting enforce to False.

  • Added virtual environment inter-process locks.
    Competing processes now cooperate for virtual environment verification, which protects installed packages.

✨ v2.7.2 Add bytes, enforce, allow autoincrementing datetime index, improve MSSQL indices.

27 Dec 21:07
60e985a
Compare
Choose a tag to compare

v2.7.0 – v2.7.2

  • Introduce the bytes data type.
    Instance connectors which support binary data (e.g. SQLConnector) may now take advantage of the bytes dtype. Other connectors (e.g. ValkeyConnector) may use meerschaum.utils.dtypes.serialize_bytes() to store binary data as a base64-encoded string.

    import meerschaum as mrsm
    
    pipe = mrsm.Pipe(
        'demo', 'bytes',
        instance='sql:memory',
        dtypes={'blob': 'bytes'},
    )
    pipe.sync([
        {'blob': b'hello, world!'},
    ])
    
    df = pipe.get_data()
    binary_data = df['blob'][0]
    print(binary_data.decode('utf-8'))
    # hello, world!
    
    from meerschaum.utils.dtypes import serialize_bytes, attempt_cast_to_bytes
    df['encoded'] = df['blob'].apply(serialize_bytes)
    df['decoded'] = df['encoded'].apply(attempt_cast_to_bytes)
    print(df)
    #                blob               encoded           decoded
    # 0  b'hello, world!'  aGVsbG8sIHdvcmxkIQ==  b'hello, world!'
  • Allow for pipes to use the same column for datetime, primary, and autoincrement=True.
    Pipes may now use the same column as the datetime axis and primary with autoincrement set to True.

    pipe = mrsm.Pipe(
        'demo', 'datetime_primary_key', 'autoincrement',
        instance='sql:local',
        columns={
            'datetime': 'Id',
            'primary': 'Id',
        },
        autoincrement=True,
    )
  • Only join on primary when present.
    When the index primary is set, use the column as the primary joining index. This will improve performance when syncing tables with a primary key.

  • Add the parameter enforce.
    The parameter enforce (default True) toggles data type enforcement behavior. When enforce is False, incoming data will not be cast to the desired data types. For static datasets where the incoming data is always expected to be of the correct dtypes, then it is recommended to set enforce to False and static to True.

    from decimal import Decimal
    import meerschaum as mrsm
    
    pipe = mrsm.Pipe(
        'demo', 'enforce',
        instance='sql:memory',
        enforce=False,
        static=True,
        autoincrement=True,
        columns={
            'primary': 'Id',
            'datetime': 'Id',
        },
        dtypes={
            'Id': 'int',
            'Amount': 'numeric',
        },
    )
    pipe.sync([
        {'Amount': Decimal('1.11')},
        {'Amount': Decimal('2.22')},
    ]) 
    
    df = pipe.get_data()
    print(df)
  • Create the datetime axis as a clustered index for MSSQL, even when a primary index is specififed.
    Specifying a datetime and primary index will create a nonclustered PRIMARY KEY. Specifying the same column as both datetime and primary will create a clustered primary key (tip: this is useful when autoincrement=True).

  • Increase the default chunk interval to 43200 minutes.
    New hypertables will use a default chunksize of 30 days (43200 minutes).

  • Virtual environment bugfixes.
    Existing virtual environment packages are backed up before re-initializing a virtual environment. This fixes the issue of disappearing dependencies.

  • Store numeric as TEXT for SQLite and DuckDB.
    Due to limited precision, numeric columns are now stored as TEXT, then parsed into Decimal objects upon retrieval.

  • Show the Webterm by default when changing instances.
    On the Web Console, changing the instance select will make the Webterm visible.

  • Improve dtype inference.

🎨 v2.6.17 Enhance pipeline editing, fix dropping pipes with custom schema.

11 Dec 01:05
a289921
Compare
Choose a tag to compare

v2.6.17

  • Add relative deltas to starting in scheduler syntax.
    You may specify a delta in the job scheduler starting syntax:

    mrsm sync pipes -s 'daily starting in 30 seconds'
    
  • Fix drop pipes for pipes on custom schemas.
    Pipes created under a specific schema are now correctly dropped.

  • Enhance editing pipeline jobs.
    Pipeline jobs now provide the job label as the default text to be edited. Pipeline arguments are now placed on a separate line to improve legibility.

  • Disable the progress timer for jobs.
    The sync pipes progress timer will now be hidden when running through a job.

  • Unset MRSM_NOASK for daemons.
    Now that jobs may accept user input, the environment variable MRSM_NOASK is no longer needed for jobs run as daemons (executor local).

  • Replace Cx_Oracle with oracledb.
    The Oracle SQL driver is no longer required now that the default Python binding for Oracle is oracledb.

  • Fix Oracle auto-incrementing for good.
    At long last, the mystery of Oracle auto-incrementing identity columns has been laid to rest.