Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First local runthrough #5

Open
mrocklin opened this issue Feb 10, 2024 · 2 comments
Open

First local runthrough #5

mrocklin opened this issue Feb 10, 2024 · 2 comments

Comments

@mrocklin
Copy link
Member

I understand that this is very early and not yet ready for prime-time, but I tried running through this locally and had some issues:

  • Data volume was too high for my mac. The processing side could barely keep up with the data generation side. This wasn't a big deal, I could just increase the interval on that flow
  • I ran into parquet/arrow/json issues like the following:
  File "/Users/mrocklin/workspace/etl-tpch/pipeline/resize.py", line 28, in repartition_table
    df.to_parquet(outdir, compression="snappy", name_function=name)
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask_expr/_collection.py", line 2154, in to_parquet
    return to_parquet(self, path, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask_expr/io/parquet.py", line 383, in to_parquet
    out = out.compute(**compute_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask_expr/_collection.py", line 366, in compute
    return DaskMethodsMixin.compute(out, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/base.py", line 377, in compute
    (result,) = compute(self, traverse=False, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/base.py", line 663, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 97, in __call__
    return read_parquet_part(
           ^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 645, in read_parquet_part
    dfs = [
          ^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py", line 646, in <listcomp>
    func(
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 640, in read_partition
    arrow_table = cls._read_table(
                  ^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 1773, in _read_table
    arrow_table = _read_table_from_path(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py", line 263, in _read_table_from_path
    return pq.ParquetFile(fil, **pre_buffer).read(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 318, in __init__
    self.reader.open(
  File "pyarrow/_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
convert_to_parquet-96daff616b2cee173abf09be2e6d04d4	ValueError('Unmatched \'\'"\' when when decoding \'string\'')	  File "/Users/mrocklin/workspace/etl-tpch/pipeline/preprocess.py", line 29, in convert_to_parquet\n    df = pd.read_json(file, compression="zstd")\n         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 815, in read_json\n    return json_reader.read()\n           ^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1025, in read\n    obj = self._get_object_parser(self.data)\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser\n    obj = FrameParser(json, **kwargs).parse()\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1187, in parse\n    self._parse()\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1400, in _parse\n    ujson_loads(json, precise_float=self.precise_float), dtype=None\n    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n	tcp://127.0.0.1:52359	1
convert_to_parquet-b3c6b4c0da0395cda7932836ed79b155	ValueError("No ':' found when decoding object value")	  File "/Users/mrocklin/workspace/etl-tpch/pipeline/preprocess.py", line 29, in convert_to_parquet\n    df = pd.read_json(file, compression="zstd")\n         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 815, in read_json\n    return json_reader.read()\n           ^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1025, in read\n    obj = self._get_object_parser(self.data)\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser\n    obj = FrameParser(json, **kwargs).parse()\n          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1187, in parse\n    self._parse()\n  File "/Users/mrocklin/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/io/json/_json.py", line 1400, in _parse\n    ujson_loads(json, precise_float=self.precise_float), dtype=None\n    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n	tcp://127.0.0.1:52212	1

As expected of course. If I have some time this weekend I may poke around a little.

@jrbourbeau
Copy link
Member

Thanks for trying this out @mrocklin. I appreciate the feedback.

Totally agree the intervals need to be adjusted (their still in "run quickly to I can debug fast" mode).

The parquet error definitely looks strange, I've not seen it locally (at least not yet). I'll take a look on Monday

@jrbourbeau
Copy link
Member

Data volume was too high for my mac

This was handled in #7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants