Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(script_mode): Use UTC timestamp for tar file attributes #3154

Merged
merged 1 commit into from
Feb 25, 2025

Conversation

pingsutw
Copy link
Member

@pingsutw pingsutw commented Feb 24, 2025

Tracking issue

https://flyte-org.slack.com/archives/CP2HDHKE1/p1739795314554249

Why are the changes needed?

Failed to archive the file in the spark task

{"asctime": "2025-02-14 17:05:58,888", "name": "flytekit", "levelname": "ERROR", "message": "Trace:\n\n    Traceback (most recent call last):\n      File \"/databricks/python/lib/python3.10/site-packages/flytekit/bin/entrypoint.py\", line 179, in _dispatch_execute\n        outputs = task_def.dispatch_execute(ctx, idl_input_literals)\n      File \"/databricks/python/lib/python3.10/site-packages/flytekit/core/base_task.py\", line 728, in dispatch_execute\n        new_user_params = self.pre_execute(ctx.user_space_params)\n      File \"/databricks/python/lib/python3.10/site-packages/flytekitplugins/spark/task.py\", line 209, in pre_execute\n        shutil.make_archive(file_name, file_format, os.getcwd())\n      File \"/usr/lib/python3.10/shutil.py\", line 1124, in make_archive\n        filename = func(base_name, base_dir, **kwargs)\n      File \"/usr/lib/python3.10/shutil.py\", line 1009, in _make_zipfile\n        zf.write(path, arcname)\n      File \"/usr/lib/python3.10/zipfile.py\", line 1754, in write\n        zinfo = ZipInfo.from_file(filename, arcname,\n      File \"/usr/lib/python3.10/zipfile.py\", line 523, in from_file\n        zinfo = cls(arcname, date_time)\n      File \"/usr/lib/python3.10/zipfile.py\", line 366, in __init__\n        raise ValueError('ZIP does not support timestamps before 1980')\n    ValueError: ZIP does not support timestamps before 1980\n\nMessage:\n\n    ValueError: ZIP does not support timestamps before 1980"}
{"asctime": "2025-02-14 17:05:58,891", "name": "flytekit", "levelname": "ERROR", "message": "!! End Error Captured by Flyte !!"}

What changes were proposed in this pull request?

Set default timezone to UTC

How was this patch tested?

import datetime
import random
from operator import add
from flytekit import ImageSpec, Resources, task, workflow
import flytekit

from flytekitplugins.spark import Spark

new_flytekit = "git+https://github.com/flyteorg/flytekit.git@ddfae878eae76914c2199213c972b09d721ad6ce"
custom_image = ImageSpec(base_image="spark:3.5.3-python3", registry="ghcr.io/flyteorg", packages=[new_flytekit, "flytekitplugins-spark", "pyspark==3.5.2"], builder="default")


@task(
    task_config=Spark(
        spark_conf={
            "spark.driver.memory": "1000M",
            "spark.executor.memory": "1000M",
            "spark.executor.cores": "1",
            "spark.executor.instances": "2",
            "spark.driver.cores": "1",
            "spark.kubernetes.file.upload.path": "/opt/spark/work-dir",
            # "spark.jars": "https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar",
            "spark.jars": "https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.2/hadoop-aws-3.2.2.jar,https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar"
        },
        # executor_path="/opt/micromamba/envs/runtime/bin/python",
        # applications_path="local:///opt/micromamba/envs/runtime/bin/entrypoint.py",
        executor_path="/usr/bin/python3",
        applications_path="local:///usr/bin/entrypoint.py"
    ),
    limits=Resources(mem="2000M"),
    container_image=ImageSpec(base_image="spark:3.5.2-python3", python_version="3.10", registry="ghcr.io/flyteorg", packages=["flytekitplugins-spark", "pyspark==3.5.2"], builder="envd"),
)
def hello_spark(partitions: int) -> float:
    session = flytekit.current_context().spark_session
    print("spark version", session.version)  # spark version 3.5.3
    print("Starting Spark with Partitions: {}".format(partitions))

    n = 1 * partitions
    sess = flytekit.current_context().spark_session
    count = sess.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)

    pi_val = 4.0 * count / n
    return pi_val


def f(_):
    x = random.random() * 2 - 1
    y = random.random() * 2 - 1
    return 1 if x**2 + y**2 <= 1 else 0


@task(
    cache_version="2",
    container_image=custom_image,
)
def print_every_time(value_to_print: float, date_triggered: datetime.datetime) -> int:
    print("My printed value: {} @ {}".format(value_to_print, date_triggered))
    return 1


@workflow
def my_spark(triggered_date: datetime.datetime = datetime.datetime.now()) -> float:
    """
    Using the workflow is still as any other workflow. As image is a property of the task, the workflow does not care
    about how the image is configured.
    """
    pi = hello_spark(partitions=1)
    print_every_time(value_to_print=pi, date_triggered=triggered_date)
    return pi

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

NA

Docs link

NA

Summary by Bito

Fixed timestamp handling in script mode by implementing explicit UTC timezone setting for tar file attributes. This resolves ZIP file creation failures caused by timestamp inconsistencies. The solution ensures consistent handling of mtime attributes across different environments.

Unit tests added: False

Estimated effort to review (1-5, lower is better): 1

@flyte-bot
Copy link
Contributor

flyte-bot commented Feb 24, 2025

Code Review Agent Run #de9a9c

Actionable Suggestions - 0
Review Details
  • Files reviewed - 1 · Commit Range: ddfae87..ddfae87
    • flytekit/tools/script_mode.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Bug Fix - Fix Timestamp Handling in Script Mode

script_mode.py - Added UTC timezone specification for tar file timestamp to prevent issues with ZIP file creation

Copy link
Collaborator

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I hate timezone issues.

@pingsutw pingsutw merged commit f03cec8 into master Feb 25, 2025
112 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants