You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand that this is very early and not yet ready for prime-time, but I tried running through this locally and had some issues:
Data volume was too high for my mac. The processing side could barely keep up with the data generation side. This wasn't a big deal, I could just increase the interval on that flow
I ran into parquet/arrow/json issues like the following:
File "/Users/mrocklin/workspace/etl-tpch/pipeline/resize.py", line 28, in repartition_table
df.to_parquet(outdir, compression="snappy", name_function=name)
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask_expr/_collection.py", line 2154, in to_parquetreturn to_parquet(self, path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 383, in to_parquet
out = out.compute(**compute_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask_expr/_collection.py", line 366, in computereturn DaskMethodsMixin.compute(out, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/base.py", line 377, in compute
(result,) = compute(self, traverse=False, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/base.py", line 663, in compute
results = schedule(dsk, keys, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 97, in __call__return read_parquet_part(
^^^^^^^^^^^^^^^^^^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 645, in read_parquet_part
dfs = [
^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 646, in <listcomp>
func(
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 640, in read_partition
arrow_table =cls._read_table(
^^^^^^^^^^^^^^^^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 1773, in _read_table
arrow_table = _read_table_from_path(
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 263, in _read_table_from_pathreturn pq.ParquetFile(fil, **pre_buffer).read(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 318, in __init__self.reader.open(
File "pyarrow/_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_statuspyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
convert_to_parquet-96daff616b2cee173abf09be2e6d04d4 ValueError('Unmatched \'\'"\' when when decoding \'string\'') File "/Users/mrocklin/workspace/etl-tpch/pipeline/preprocess.py", line 29, in convert_to_parquet\n df = pd.read_json(file, compression="zstd")\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 815, in read_json\n return json_reader.read()\n ^^^^^^^^^^^^^^^^^^\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1025, in read\n obj = self._get_object_parser(self.data)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser\n obj = FrameParser(json, **kwargs).parse()\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1187, in parse\n self._parse()\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1400, in _parse\n ujson_loads(json, precise_float=self.precise_float), dtype=None\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n tcp://127.0.0.1:52359 1
convert_to_parquet-b3c6b4c0da0395cda7932836ed79b155 ValueError("No ':' found when decoding object value") File "/Users/mrocklin/workspace/etl-tpch/pipeline/preprocess.py", line 29, in convert_to_parquet\n df = pd.read_json(file, compression="zstd")\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 815, in read_json\n return json_reader.read()\n ^^^^^^^^^^^^^^^^^^\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1025, in read\n obj = self._get_object_parser(self.data)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser\n obj = FrameParser(json, **kwargs).parse()\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1187, in parse\n self._parse()\n File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1400, in _parse\n ujson_loads(json, precise_float=self.precise_float), dtype=None\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n tcp://127.0.0.1:52212 1
I understand that this is very early and not yet ready for prime-time, but I tried running through this locally and had some issues:
As expected of course. If I have some time this weekend I may poke around a little.
The text was updated successfully, but these errors were encountered: