Releases: bmeares/Meerschaum
v2.0.1
v2.0.1
-
Fix syncing bools within in-place SQL pipes.
SQL pipes may now sync bools in-place. For database flavors which lack nativeBOOLEAN
support (e.g.sqlite
,oracle
,mysql
), then the boolean columns must be stated inpipe.dtypes
. -
Fix an issue with multiple users managing jobs.
Extra validation was added to the web UI to allow for multiple users to interact with jobs. -
Fix a minor formatting bug with
highlight_pipes()
.
Improved validation logic was added to prevent incorrectly prepending thePipe(
prefix. -
Hold back
pydantic
to<2.0.0
Pydantic 2 is supported in all features except--schedule
. Untilrocketry
supports Pydantic 2, it will be held back.
v2.0.0
v2.0.0
Breaking Changes
-
Removed redundant
Pipe.sync_time
property.
Usepipe.get_sync_time()
instead. -
Removed
SQLConnector.get_pipe_backtrack_minutes()
.
Usepipe.get_backtrack_interval()
instead. -
Replaced
pipe.parameters['chunk_time_interval']
withpipe.parameters['verify']['chunk_minutes']
For better security and cohesiveness, the TimescaleDBchunk_time_interval
value is now derived from the standardchunk_minutes
value. This also means pipes with integer date axes will be created with a new default chunk interval of 1440 (was previously 100,000). -
Moved
choose_subaction()
intomeerschaum.actions
.
This function is for internal use and as such should not affect any users.
Features
-
Added
verify pipes
and--verify
.
The commandmrsm verify pipes
ormrsm sync pipes --verify
will resync pipes' chunks with different rowcounts to catch any backfilled data.import meerschaum as mrsm foo = mrsm.Pipe( 'foo', 'src', target = 'foo', columns = {'datetime': 'dt'}, instance = 'sql:local' ) docs = [ {'dt': '2023-01-01'}, {'dt': '2023-01-02'}, ] foo.sync(docs) pipe = mrsm.Pipe( 'sql:local', 'verify', columns = {'datetime': 'dt'}, parameters = { 'query': f'SELECT * FROM "{foo.target}"' }, instance = 'sql:local', ) pipe.sync(docs) backfilled_docs = [ {'dt': '2022-12-30'}, {'dt': '2022-12-31'}, ] foo.sync(backfilled_docs) mrsm.pprint(pipe.verify()) assert foo.get_rowcount() == pipe.get_rowcount()
-
Added
deduplicate pipes
and--deduplicate
.
Runningmrsm deduplicates pipes
ormrsm sync pipes --deduplicate
will iterate over pipes' entire intervals, chunking at the configured chunk interval (seepipe.get_chunk_interval()
below) and clearing + resyncing chunks with duplicate rows.If your instance connector implements
deduplicate_pipe()
(e.g.SQLConnector
), then this method will override the defaultpipe.deduplicate()
.pipe = mrsm.Pipe( 'demo', 'deduplicate', columns = {'datetime': 'dt'}, instance = 'sql:local', ) docs = [ {'dt': '2023-01-01'}, {'dt': '2023-01-01'}, ] pipe.sync(docs) print(pipe.get_rowcount()) # 2 pipe.deduplicate() print(pipe.get_rowcount()) # 1
-
Added
pyarrow
support.
The dtypes enforcement system was overhauled to add support forpyarrow
data types.import meerschaum as mrsm import pandas as pd df = pd.DataFrame( [{'a': 1, 'b': 2.3}] ).convert_dtypes(dtype_backend='pyarrow') pipe = mrsm.Pipe( 'demo', 'pyarrow', instance = 'sql:local', columns = {'a': 'a'}, ) pipe.sync(df)
-
Added
bool
support.
Pipes may now sync DataFrames with booleans (even on Oracle and MySQL):import meerschaum as mrsm pipe = mrsm.Pipe( 'demo', 'bools', instance = 'sql:local', columns = {'id': 'id'}, ) pipe.sync([{'id': 1, 'is_blue': True}]) assert 'bool' in pipe.dtypes['is_blue'] pipe.sync([{'id': 1, 'is_blue': False}]) assert pipe.get_data()['is_blue'][0] == False
-
Added preliminary
dask
support.
For example, you may now return Dask DataFrames in your plugins, pass intopipe.sync()
, andpipe.get_data()
now has the flagas_dask
.import meerschaum as mrsm pipe = mrsm.Pipe( 'dask', 'demo', columns = {'datetime': 'dt'}, instance = 'sql:local', ) pipe.sync([ {'dt': '2023-01-01', 'val': 1}, {'dt': '2023-01-02', 'val': 2}, {'dt': '2023-01-03', 'val': 3}, ]) ddf = pipe.get_data(as_dask=True) print(ddf) # Dask DataFrame Structure: # dt val # npartitions=4 # datetime64[ns] int64[pyarrow] # ... ... # ... ... # ... ... # ... ... # Dask Name: from-delayed, 5 graph layers print(ddf.compute()) # dt val # 0 2023-01-01 1 # 0 2023-01-02 2 # 0 2023-01-03 3 pipe2 = mrsm.Pipe( 'dask', 'insert', columns = pipe.columns, instance = 'sql:local', ) pipe2.sync(ddf) assert pipe.get_data().to_dict() == pipe2.get_data().to_dict() pipe.sync([{'dt': '2023-01-01', 'val': 10}]) pipe2.sync(pipe.get_data(as_dask=True)) assert pipe.get_data().to_dict() == pipe2.get_data().to_dict()
-
Added
chunk_minutes
topipe.parameters['verify']
.
Likepipe.parameters['fetch']['backtrack_minutes']
, you may now specify the default chunk interval to use for verification syncs and iterating over the datetime axis.import meerschaum as mrsm pipe = mrsm.Pipe( 'a', 'b', instance = 'sql:local', columns = {'datetime': 'dt'}, parameters = { 'verify': { 'chunk_minutes': 2880, } }, ) pipe.sync([ {'dt': '2023-01-01'}, {'dt': '2023-01-02'}, {'dt': '2023-01-03'}, {'dt': '2023-01-04'}, {'dt': '2023-01-05'}, ]) chunk_bounds = pipe.get_chunk_bounds(bounded=True) for chunk_begin, chunk_end in chunk_bounds: print(chunk_begin, '-', chunk_end) # 2023-01-01 00:00:00 - 2023-01-03 00:00:00 # 2023-01-03 00:00:00 - 2023-01-05 00:00:00
-
Added
--chunk-minutes
,--chunk-hours
, and--chunk-days
.
You may override a pipe's chunk interval during a verification sync with--chunk-minutes
(or--chunk-hours
or--chunk-days
).mrsm verify pipes --chunk-days 3
-
Added
pipe.get_chunk_interval()
andpipe.get_backtrack_interval()
.
Return thetimedelta
(orint
for integer datetimes) fromverify:chunk_minutes
andfetch:backtrack_minutes
, respectively.import meerschaum as mrsm dt_pipe = mrsm.Pipe( 'demo', 'intervals', 'datetime', instance = 'sql:local', columns = {'datetime': 'dt'}, ) print(dt_pipe.get_chunk_interval()) # 1 day, 0:00:00 int_pipe = mrsm.Pipe( 'demo', 'intervals', 'int', instance = 'sql:local', columns = {'datetime': 'dt'}, dtypes = {'dt': 'int'}, ) print(int_pipe.get_chunk_interval()) # 1440
-
Added
pipe.get_chunk_bounds()
.
Return a list ofbegin
andend
values to use when iterating over a pipe's datetime axis.from datetime import datetime import meerschaum as mrsm pipe = mrsm.Pipe( 'demo', 'chunk_bounds', instance = 'sql:local', columns = {'datetime': 'dt'}, parameters = { 'verify': { 'chunk_minutes': 1440, } }, ) pipe.sync([ {'dt': '2023-01-01'}, {'dt': '2023-01-02'}, {'dt': '2023-01-03'}, {'dt': '2023-01-04'}, ]) open_bounds = pipe.get_chunk_bounds() for i, (begin, end) in enumerate(open_bounds): print(f"Chunk {i}: ({begin}, {end})") # Chunk 0: (None, 2023-01-01 00:00:00) # Chunk 1: (2023-01-01 00:00:00, 2023-01-02 00:00:00) # Chunk 2: (2023-01-02 00:00:00, 2023-01-03 00:00:00) # Chunk 3: (2023-01-03 00:00:00, 2023-01-04 00:00:00) # Chunk 4: (2023-01-04 00:00:00, None) closed_bounds = pipe.get_chunk_bounds(bounded=True) for i, (begin, end) in enumerate(closed_bounds): print(f"Chunk {i}: ({begin}, {end})") # Chunk 0: (2023-01-01 00:00:00, 2023-01-02 00:00:00) # Chunk 1: (2023-01-02 00:00:00, 2023-01-03 00:00:00) # Chunk 2: (2023-01-03 00:00:00, 2023-01-04 00:00:00) sub_bounds = pipe.get_chunk_bounds( begin = datetime(2023, 1, 1), end = datetime(2023, 1, 3), ) for i, (begin, end) in enumerate(sub_bounds): print(f"Chunk {i}: ({begin}, {end})") # Chunk 0: (2023-01-01 00:00:00, 2023-01-02 00:00:00) # Chunk 1: (2023-01-02 00:00:00, 2023-01-03 00:00:00)
-
Added
--bounded
to verification syncs.
By default,verify pipes
is unbounded, meaning it will sync values beyond the existing minimum and maximum datetime values. Running a verification sync with--bounded
will bound the search to the existing datetime axis.mrsm sync pipes --verify --bounded
-
Added
pipe.get_num_workers()
.
Return the number of concurrent threads to be used with this pipe (with respect to its instance connector's thread safety). -
Added
select_columns
andomit_columns
topipe.get_data()
.
In situations where not all columns are required, you can now either specify which columns you want to include (select_columns
) and which columns to filter out (omit_columns
). You may pass a list of columns or a single column, and the value'*'
forselect_columns
will be treated asNone
(i.e.SELECT *
).pipe = mrsm.Pipe('a', 'b', 'c', instance='sql:local') pipe.sync([{'a': 1, 'b': 2, 'c': 3}]) pipe.get_data(['a', 'b']) # a b # 0 1 2 pipe.get_data('*', 'b') # a c # 0 1 3 pipe.get_data(None, ['a', 'c']) # b # 0 2 pipe.get_data(omit_columns=['b', 'c']) # a # 0 1 pipe.get_data(select_columns=['c', 'a']) # c a # 0 3 1
-
Replace
daemoniker
withpython-daemon
.
python-daemon
is a well-maintained and well-behaved daemon process library. However, this migration removes Windows support for background jobs (which was never really fully supported already, so no harm there). -
Added
pause jobs
.
In addition tostart jobs
andstop jobs
, the commandpause jobs
will s...
v1.7.4
v1.7.3 – v1.7.4
-
Fix an issue with the local stack healthcheck.
Due to some edge cases, the local stackdocker-compose.yaml
file would not be correctly formatted untiledit config
had been executed. This patch ensures the files are synced with each invocation ofstack
. -
Fix an issue when running the local stack with non-default ports.
Initializing a local stack with a different database port (e.g. 5433) now routes correctly within the Docker compose network (now patching to internal port to 5432). -
Fix
upgrade mrsm
behavior.
Recent changes tostack
broke the automaticstack pull
withinmrsm upgrade mrsm
.
v1.7.2
v1.7.2
-
Fix
role "root" does not exist
from stack logs.
Although the healthcheck was working as expected, the log output was filled withError FATAL: role "root" does not exist
. These errors have been fixed. -
Fix
MRSM_CONFIG
behavior when runningstart api --production
.
Starting the Web API throughgunicorn
(i.e.--production
) now respectsMRSM_CONFIG
. This is useful for runningstack up
with non-default credentials. -
Added
--insecure
as an alias for--no-auth
.
To compliment the newly added--secure
flag, starting the Web API with--insecure
will bypass authentication. -
Bump default TimescaleDB version to PG15.
The default TimescaleDB version for the Meerschaum stack is nowlatest-pg15-oss
. -
Pass sysargs to
docker compose
viastack
This patch allows for jumping into theapi
container:mrsm stack exec api bash
-
Added the API endpoint
/healthcheck
.
This is used to determine reachability and the health of the local stack.
v1.7.1
v1.7.0 – v1.7.1
-
Remove
get_backtrack_data()
for instance connectors.
If provided, this method will still override the new generic implementation. -
Add
--keyfile
and--certfile
support.
When starting the Web API, you may now run via HTTPS with--keyfile
and--certfile
. Older releases required the keys to be set inMRSM_CONFIG
. This also brings SSL support for--production
(Gunicorn). -
Add the Webterm to the Web Console.
At long last, the webterm is embedded within the web console and is accessible from the Web API at the endpoint/webterm
. You must provide your active, authorized session ID to access to the Webterm. -
Add
--secure
tostart api
.
Starting the Web API with--secure
will now disallow actions from non-administrators. This is recommend for shared deployments. -
Fixed the registration page on the Web API.
Users should now be able to create accounts from Dockerized deployments. -
Held back
dash-extensions
The recent 1.0.2+ releases have shipped some broken changes, sodash-extensions
is held back to1.0.1
until newer releases have been tested. -
Allow for digits in environment connectors.
Connectors defined as environment variables may now have digits in the type.export MRSM_A2B_TEST='{"foo": "bar"}'
-
Fixed
stack
on Windows. -
Fixed a false error with background jobs.
-
Increased the minimum password length to 5.
v1.6.19
-
Add Pydantic v2 support
The only feature which requires Pydantic v1 is the--schedule
flag, which will throw a warning with a hint to install an older version. The underlying libraries for this feature should have Pydantic v2 support merged soon. -
Bump dependencies.
This patch bumps the minimum required versions fortyping-extensions
,rich
,prompt-toolkit
,rocketry
,uvicorn
,websockets
, andfastapi
and loosens the minimum version ofpydantic
. -
Fix shell formatting on Windows 10.
Some edge case issues have been patched for older versions of Windows.
v1.6.15
v1.6.14 (v1.5.23 for Python 3.7)
v1.6.14
-
Added healthchecks to
mrsm stack up
.
The internal Docker Compose file formrsm stack
was bumped to version 3.9, and secrets were replaced with environment variable references. -
Fixed
--no-auth
when starting the API.
The commandmrsm start api --no-auth
now correctly handles sessions.
v1.6.13 (v1.5.22 for Python 3.7)
v1.6.13
- Remove
\\u0000
from strings when inserting into PostgreSQL.
Replace both\0
and\\u0000
with empty strings when streaming rows into PostgreSQL.
Thank you @SheepIsGoat for your contribution!
v1.6.12 (v1.5.21 for Python 3.7)
v1.6.12
-
Allow nested chunk generators.
This patch more gracefully handles labels for situations with nested chunk generators and adds and explicit test for this scenario.import meerschaum as mrsm pipe = mrsm.Pipe('foo', 'bar', instance='sql:memory') docs = [{'color': 'red'}, {'color': 'green'}] num_chunks = 3 num_batches = 2 def build_chunks(): return ( [ {'chunk_ix': chunk_ix, **doc} for doc in docs ] for chunk_ix in range(num_chunks) ) batches = ( ( [ {'batch_ix': batch_ix, **doc} for doc in chunk ] for chunk in build_chunks() ) for batch_ix in range(num_batches) ) pipe.sync(batches) print(pipe.get_data()) # batch_ix chunk_ix color # 0 0 0 red # 1 0 0 green # 2 0 1 red # 3 0 1 green # 4 0 2 red # 5 0 2 green # 6 1 0 red # 7 1 0 green # 8 1 1 red # 9 1 1 green # 10 1 2 red # 11 1 2 green