Skip to content

Releases: bmeares/Meerschaum

v2.0.1

26 Sep 04:31
922bca3
Compare
Choose a tag to compare

v2.0.1

  • Fix syncing bools within in-place SQL pipes.
    SQL pipes may now sync bools in-place. For database flavors which lack native BOOLEAN support (e.g. sqlite, oracle, mysql), then the boolean columns must be stated in pipe.dtypes.

  • Fix an issue with multiple users managing jobs.
    Extra validation was added to the web UI to allow for multiple users to interact with jobs.

  • Fix a minor formatting bug with highlight_pipes().
    Improved validation logic was added to prevent incorrectly prepending the Pipe( prefix.

  • Hold back pydantic to <2.0.0
    Pydantic 2 is supported in all features except --schedule. Until rocketry supports Pydantic 2, it will be held back.

v2.0.0

24 Sep 22:59
1d21364
Compare
Choose a tag to compare

v2.0.0

Breaking Changes

  • Removed redundant Pipe.sync_time property.
    Use pipe.get_sync_time() instead.

  • Removed SQLConnector.get_pipe_backtrack_minutes().
    Use pipe.get_backtrack_interval() instead.

  • Replaced pipe.parameters['chunk_time_interval'] with pipe.parameters['verify']['chunk_minutes']
    For better security and cohesiveness, the TimescaleDB chunk_time_interval value is now derived from the standard chunk_minutes value. This also means pipes with integer date axes will be created with a new default chunk interval of 1440 (was previously 100,000).

  • Moved choose_subaction() into meerschaum.actions.
    This function is for internal use and as such should not affect any users.

Features

  • Added verify pipes and --verify.
    The command mrsm verify pipes or mrsm sync pipes --verify will resync pipes' chunks with different rowcounts to catch any backfilled data.

    import meerschaum as mrsm
    foo = mrsm.Pipe(
        'foo', 'src',
        target = 'foo',
        columns = {'datetime': 'dt'},
        instance = 'sql:local'
    )
    docs = [
        {'dt': '2023-01-01'},
        {'dt': '2023-01-02'},
    ]
    foo.sync(docs)
    
    pipe = mrsm.Pipe(
        'sql:local', 'verify', 
        columns = {'datetime': 'dt'},
        parameters = {
            'query': f'SELECT * FROM "{foo.target}"'
        },
        instance = 'sql:local',
    )
    pipe.sync(docs) 
    
    backfilled_docs = [
        {'dt': '2022-12-30'},
        {'dt': '2022-12-31'},
    ]
    foo.sync(backfilled_docs)
    mrsm.pprint(pipe.verify())
    assert foo.get_rowcount() == pipe.get_rowcount()
  • Added deduplicate pipes and --deduplicate.
    Running mrsm deduplicates pipes or mrsm sync pipes --deduplicate will iterate over pipes' entire intervals, chunking at the configured chunk interval (see pipe.get_chunk_interval() below) and clearing + resyncing chunks with duplicate rows.

    If your instance connector implements deduplicate_pipe() (e.g. SQLConnector), then this method will override the default pipe.deduplicate().

    pipe = mrsm.Pipe(
        'demo', 'deduplicate',
        columns = {'datetime': 'dt'},
        instance = 'sql:local',
    )
    docs = [
        {'dt': '2023-01-01'},
        {'dt': '2023-01-01'},
    ]
    pipe.sync(docs)
    print(pipe.get_rowcount())
    # 2
    pipe.deduplicate()
    print(pipe.get_rowcount())
    # 1
  • Added pyarrow support.
    The dtypes enforcement system was overhauled to add support for pyarrow data types.

    import meerschaum as mrsm
    import pandas as pd
    
    df = pd.DataFrame(
        [{'a': 1, 'b': 2.3}]
    ).convert_dtypes(dtype_backend='pyarrow')
    
    pipe = mrsm.Pipe(
        'demo', 'pyarrow',
        instance = 'sql:local',
        columns = {'a': 'a'},
    )
    pipe.sync(df)
  • Added bool support.
    Pipes may now sync DataFrames with booleans (even on Oracle and MySQL):

    import meerschaum as mrsm
    pipe = mrsm.Pipe(
        'demo', 'bools',
        instance = 'sql:local',
        columns = {'id': 'id'},
    )
    pipe.sync([{'id': 1, 'is_blue': True}])
    assert 'bool' in pipe.dtypes['is_blue']
    
    pipe.sync([{'id': 1, 'is_blue': False}])
    assert pipe.get_data()['is_blue'][0] == False
  • Added preliminary dask support.
    For example, you may now return Dask DataFrames in your plugins, pass into pipe.sync(), and pipe.get_data() now has the flag as_dask.

    import meerschaum as mrsm
    pipe = mrsm.Pipe(
        'dask', 'demo',
        columns = {'datetime': 'dt'},
        instance = 'sql:local',
    )
    pipe.sync([
        {'dt': '2023-01-01', 'val': 1},
        {'dt': '2023-01-02', 'val': 2},
        {'dt': '2023-01-03', 'val': 3},
    ])
    ddf = pipe.get_data(as_dask=True)
    print(ddf)
    # Dask DataFrame Structure:
    #                            dt             val
    # npartitions=4
    #                datetime64[ns]  int64[pyarrow]
    #                           ...             ...
    #                           ...             ...
    #                           ...             ...
    #                           ...             ...
    # Dask Name: from-delayed, 5 graph layers
    
    print(ddf.compute())
    #           dt  val
    # 0 2023-01-01    1
    # 0 2023-01-02    2
    # 0 2023-01-03    3
    
    pipe2 = mrsm.Pipe(
        'dask', 'insert',
        columns = pipe.columns,
        instance = 'sql:local',
    )
    pipe2.sync(ddf)
    assert pipe.get_data().to_dict() == pipe2.get_data().to_dict()
    
    pipe.sync([{'dt': '2023-01-01', 'val': 10}])
    pipe2.sync(pipe.get_data(as_dask=True))
    assert pipe.get_data().to_dict() == pipe2.get_data().to_dict()
  • Added chunk_minutes to pipe.parameters['verify'].
    Like pipe.parameters['fetch']['backtrack_minutes'], you may now specify the default chunk interval to use for verification syncs and iterating over the datetime axis.

    import meerschaum as mrsm
    pipe = mrsm.Pipe(
        'a', 'b',
        instance = 'sql:local',
        columns = {'datetime': 'dt'},
        parameters = {
            'verify': {
                'chunk_minutes': 2880,
            }
        },
    )
    pipe.sync([
        {'dt': '2023-01-01'},
        {'dt': '2023-01-02'},
        {'dt': '2023-01-03'},
        {'dt': '2023-01-04'},
        {'dt': '2023-01-05'},
    ])
    chunk_bounds = pipe.get_chunk_bounds(bounded=True)
    for chunk_begin, chunk_end in chunk_bounds:
        print(chunk_begin, '-', chunk_end)
    
    # 2023-01-01 00:00:00 - 2023-01-03 00:00:00
    # 2023-01-03 00:00:00 - 2023-01-05 00:00:00
  • Added --chunk-minutes, --chunk-hours, and --chunk-days.
    You may override a pipe's chunk interval during a verification sync with --chunk-minutes (or --chunk-hours or --chunk-days).

    mrsm verify pipes --chunk-days 3
  • Added pipe.get_chunk_interval() and pipe.get_backtrack_interval().
    Return the timedelta (or int for integer datetimes) from verify:chunk_minutes and fetch:backtrack_minutes, respectively.

    import meerschaum as mrsm
    dt_pipe = mrsm.Pipe(
        'demo', 'intervals', 'datetime',
        instance = 'sql:local',
        columns = {'datetime': 'dt'},
    )
    print(dt_pipe.get_chunk_interval())
    # 1 day, 0:00:00
    
    int_pipe = mrsm.Pipe(
        'demo', 'intervals', 'int',
        instance = 'sql:local',
        columns = {'datetime': 'dt'},
        dtypes = {'dt': 'int'},
    )
    print(int_pipe.get_chunk_interval())
    # 1440
  • Added pipe.get_chunk_bounds().
    Return a list of begin and end values to use when iterating over a pipe's datetime axis.

    from datetime import datetime
    import meerschaum as mrsm
    pipe = mrsm.Pipe(
        'demo', 'chunk_bounds',
        instance = 'sql:local',
        columns = {'datetime': 'dt'},
        parameters = {
            'verify': {
                'chunk_minutes': 1440,
            }
        },
    )
    pipe.sync([
        {'dt': '2023-01-01'},
        {'dt': '2023-01-02'},
        {'dt': '2023-01-03'},
        {'dt': '2023-01-04'},
    ])
    
    open_bounds = pipe.get_chunk_bounds()
    for i, (begin, end) in enumerate(open_bounds):
        print(f"Chunk {i}: ({begin}, {end})")
    
    # Chunk 0: (None, 2023-01-01 00:00:00)
    # Chunk 1: (2023-01-01 00:00:00, 2023-01-02 00:00:00)
    # Chunk 2: (2023-01-02 00:00:00, 2023-01-03 00:00:00)
    # Chunk 3: (2023-01-03 00:00:00, 2023-01-04 00:00:00)
    # Chunk 4: (2023-01-04 00:00:00, None)
    
    closed_bounds = pipe.get_chunk_bounds(bounded=True)
    for i, (begin, end) in enumerate(closed_bounds):
        print(f"Chunk {i}: ({begin}, {end})")
    
    # Chunk 0: (2023-01-01 00:00:00, 2023-01-02 00:00:00)
    # Chunk 1: (2023-01-02 00:00:00, 2023-01-03 00:00:00)
    # Chunk 2: (2023-01-03 00:00:00, 2023-01-04 00:00:00)
    
    sub_bounds = pipe.get_chunk_bounds(
        begin = datetime(2023, 1, 1),
        end = datetime(2023, 1, 3),
    )
    for i, (begin, end) in enumerate(sub_bounds):
        print(f"Chunk {i}: ({begin}, {end})")
    
    # Chunk 0: (2023-01-01 00:00:00, 2023-01-02 00:00:00)
    # Chunk 1: (2023-01-02 00:00:00, 2023-01-03 00:00:00)
  • Added --bounded to verification syncs.
    By default, verify pipes is unbounded, meaning it will sync values beyond the existing minimum and maximum datetime values. Running a verification sync with --bounded will bound the search to the existing datetime axis.

    mrsm sync pipes --verify --bounded
  • Added pipe.get_num_workers().
    Return the number of concurrent threads to be used with this pipe (with respect to its instance connector's thread safety).

  • Added select_columns and omit_columns to pipe.get_data().
    In situations where not all columns are required, you can now either specify which columns you want to include (select_columns) and which columns to filter out (omit_columns). You may pass a list of columns or a single column, and the value '*' for select_columns will be treated as None (i.e. SELECT *).

    pipe = mrsm.Pipe('a', 'b', 'c', instance='sql:local')
    pipe.sync([{'a': 1, 'b': 2, 'c': 3}])
    
    pipe.get_data(['a', 'b'])
    #    a  b
    # 0  1  2
    
    pipe.get_data('*', 'b')
    #    a  c
    # 0  1  3
    
    pipe.get_data(None, ['a', 'c'])
    #    b
    # 0  2
    
    pipe.get_data(omit_columns=['b', 'c'])
    #    a
    # 0  1
    
    pipe.get_data(select_columns=['c', 'a'])
    #    c  a
    # 0  3  1
  • Replace daemoniker with python-daemon.
    python-daemon is a well-maintained and well-behaved daemon process library. However, this migration removes Windows support for background jobs (which was never really fully supported already, so no harm there).

  • Added pause jobs.
    In addition to start jobs and stop jobs, the command pause jobs will s...

Read more

v1.7.4

29 Aug 13:20
0f97fbc
Compare
Choose a tag to compare

v1.7.3 – v1.7.4

  • Fix an issue with the local stack healthcheck.
    Due to some edge cases, the local stack docker-compose.yaml file would not be correctly formatted until edit config had been executed. This patch ensures the files are synced with each invocation of stack.

  • Fix an issue when running the local stack with non-default ports.
    Initializing a local stack with a different database port (e.g. 5433) now routes correctly within the Docker compose network (now patching to internal port to 5432).

  • Fix upgrade mrsm behavior.
    Recent changes to stack broke the automatic stack pull within mrsm upgrade mrsm.

v1.7.2

12 Aug 14:21
f8b810f
Compare
Choose a tag to compare

v1.7.2

  • Fix role "root" does not exist from stack logs.
    Although the healthcheck was working as expected, the log output was filled with Error FATAL: role "root" does not exist. These errors have been fixed.

  • Fix MRSM_CONFIG behavior when running start api --production.
    Starting the Web API through gunicorn (i.e. --production) now respects MRSM_CONFIG. This is useful for running stack up with non-default credentials.

  • Added --insecure as an alias for --no-auth.
    To compliment the newly added --secure flag, starting the Web API with --insecure will bypass authentication.

  • Bump default TimescaleDB version to PG15.
    The default TimescaleDB version for the Meerschaum stack is now latest-pg15-oss.

  • Pass sysargs to docker compose via stack
    This patch allows for jumping into the api container:

    mrsm stack exec api bash
  • Added the API endpoint /healthcheck.
    This is used to determine reachability and the health of the local stack.

v1.7.1

11 Aug 04:29
e95ae71
Compare
Choose a tag to compare

v1.7.0 – v1.7.1

  • Remove get_backtrack_data() for instance connectors.
    If provided, this method will still override the new generic implementation.

  • Add --keyfile and --certfile support.
    When starting the Web API, you may now run via HTTPS with --keyfile and --certfile. Older releases required the keys to be set in MRSM_CONFIG. This also brings SSL support for --production (Gunicorn).

  • Add the Webterm to the Web Console.
    At long last, the webterm is embedded within the web console and is accessible from the Web API at the endpoint /webterm. You must provide your active, authorized session ID to access to the Webterm.

  • Add --secure to start api.
    Starting the Web API with --secure will now disallow actions from non-administrators. This is recommend for shared deployments.

  • Fixed the registration page on the Web API.
    Users should now be able to create accounts from Dockerized deployments.

  • Held back dash-extensions
    The recent 1.0.2+ releases have shipped some broken changes, so dash-extensions is held back to 1.0.1 until newer releases have been tested.

  • Allow for digits in environment connectors.
    Connectors defined as environment variables may now have digits in the type.

    export MRSM_A2B_TEST='{"foo": "bar"}'
  • Fixed stack on Windows.

  • Fixed a false error with background jobs.

  • Increased the minimum password length to 5.

v1.6.19

14 Jul 15:38
71d596f
Compare
Choose a tag to compare
  • Add Pydantic v2 support
    The only feature which requires Pydantic v1 is the --schedule flag, which will throw a warning with a hint to install an older version. The underlying libraries for this feature should have Pydantic v2 support merged soon.

  • Bump dependencies.
    This patch bumps the minimum required versions for typing-extensions, rich, prompt-toolkit, rocketry, uvicorn, websockets, and fastapi and loosens the minimum version of pydantic.

  • Fix shell formatting on Windows 10.
    Some edge case issues have been patched for older versions of Windows.

v1.6.15

16 Jun 21:32
7c07b2b
Compare
Choose a tag to compare
  • Sync chunks in the copy pipes action.
    This will help with large out-of-memory pipes.

v1.6.14 (v1.5.23 for Python 3.7)

05 Jun 03:35
38cf2c0
Compare
Choose a tag to compare

v1.6.14

  • Added healthchecks to mrsm stack up.
    The internal Docker Compose file for mrsm stack was bumped to version 3.9, and secrets were replaced with environment variable references.

  • Fixed --no-auth when starting the API.
    The command mrsm start api --no-auth now correctly handles sessions.

v1.6.13 (v1.5.22 for Python 3.7)

02 Jun 18:28
50c6f66
Compare
Choose a tag to compare

v1.6.13

  • Remove \\u0000 from strings when inserting into PostgreSQL.
    Replace both \0 and \\u0000 with empty strings when streaming rows into PostgreSQL.

Thank you @SheepIsGoat for your contribution!

v1.6.12 (v1.5.21 for Python 3.7)

30 May 23:44
13aae02
Compare
Choose a tag to compare

v1.6.12

  • Allow nested chunk generators.
    This patch more gracefully handles labels for situations with nested chunk generators and adds and explicit test for this scenario.

    import meerschaum as mrsm
    pipe = mrsm.Pipe('foo', 'bar', instance='sql:memory')
    
    docs = [{'color': 'red'}, {'color': 'green'}]
    num_chunks = 3
    num_batches = 2
    
    def build_chunks():
        return (
            [
                {'chunk_ix': chunk_ix, **doc}
                for doc in docs
            ]
            for chunk_ix in range(num_chunks)
        )
    
    batches = (
        (
            [
                {'batch_ix': batch_ix, **doc}
                for doc in chunk
            ]
            for chunk in build_chunks()
        )
        for batch_ix in range(num_batches)
    )
    
    pipe.sync(batches)
    print(pipe.get_data())
    #     batch_ix  chunk_ix  color
    # 0          0         0    red
    # 1          0         0  green
    # 2          0         1    red
    # 3          0         1  green
    # 4          0         2    red
    # 5          0         2  green
    # 6          1         0    red
    # 7          1         0  green
    # 8          1         1    red
    # 9          1         1  green
    # 10         1         2    red
    # 11         1         2  green