Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLCOMPUTE-967 | changes for Spark drivers on k8s #3679

Merged
merged 93 commits into from
Mar 27, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
55682bd
MLCOMPUTE-967 | add default iam role for Spark drivers on k8s
Aug 17, 2023
e9e89b0
MLCOMPUTE-967 | fixed naming, added TODO
Aug 22, 2023
003f0ae
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Aug 22, 2023
73c76fa
MLCOMPUTE-967 | move default to SystemPaastaConfig
Aug 23, 2023
34e3db1
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Sep 11, 2023
1d61a56
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Sep 13, 2023
de6926a
MLCOMPUTE-967 | add volume mounts for kubeconfig
Sep 14, 2023
7c98acd
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Sep 14, 2023
95bdac3
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Sep 19, 2023
119f912
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Sep 19, 2023
c895889
MLCOMPUTE-967 | test fixes
Sep 19, 2023
65ce515
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Sep 20, 2023
6d7e16f
MLCOMPUTE-967 | add spark to the list of executor types
Sep 22, 2023
c40a93d
MLCOMPUTE-967 | use correct k8s url as spark master argument
Sep 22, 2023
ecde517
MLCOMPUTE-967 | fix tests
Sep 25, 2023
759a013
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Sep 25, 2023
c28b60f
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Sep 26, 2023
86947b5
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Oct 2, 2023
c0f50e5
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Oct 3, 2023
42c518b
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Oct 11, 2023
759ca32
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Oct 16, 2023
921dfbb
MLCOMPUTE-1057 | add flow to create Spark executor service account dy…
Oct 17, 2023
32d52d0
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Oct 17, 2023
e9ec5f2
Merge branch 'MLCOMPUTE-967_spark_drivers_on_k8s' of https://github.c…
Oct 17, 2023
40e93a1
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Oct 23, 2023
d541894
MLCOMPUTE-967 | use same IAM role for Spark drivers and executors
Oct 25, 2023
5f29058
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Oct 25, 2023
80448ea
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Oct 31, 2023
526e6f2
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Nov 3, 2023
a945d47
MLCOMPUTE-1068 | logic for using duplicate iam role with eks suffix
Nov 21, 2023
ae492d0
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Nov 21, 2023
9416e7d
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Nov 22, 2023
a7540e2
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Nov 27, 2023
6226a1f
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Dec 4, 2023
b89f40f
remove use_k8s deprecated flag (cherry-picked from 59beb52)
bobokun Dec 14, 2023
d7d47fc
Use stable pool for Spark driver
chi-yelp Jan 29, 2024
2e90915
Use stable pool for Spark driver
chi-yelp Jan 29, 2024
913eb74
Merge branch 'MLCOMPUTE-967_spark_drivers_on_k8s' of https://github.c…
Jan 29, 2024
7f2cbab
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Jan 29, 2024
b52b510
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Jan 31, 2024
cbe6e09
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Feb 1, 2024
f45eed8
Merge branch 'master' into MLCOMPUTE-967_spark_drivers_on_k8s
chi-yelp Feb 6, 2024
825e564
lint
chi-yelp Feb 6, 2024
d7319f2
MLCOMPUTE-1160 | consolidate tron and paasta logic to spark_tools, ut…
Feb 8, 2024
22865d2
Merge branch 'MLCOMPUTE-967_spark_drivers_on_k8s' of https://github.c…
Feb 8, 2024
cb01872
MLCOMPUTE-1163 | handle space separated Spark args
Feb 16, 2024
da5972a
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Feb 16, 2024
8586368
MLCOMPUTE-967 | formatting
Feb 16, 2024
5790282
MLCOMPUTE-967 | bump up service configuration lib version
Feb 16, 2024
c7c6195
MLCOMPUTE-1119 | Support extra flags from paasta spark-run in tron_tools
Feb 17, 2024
067b703
MLCOMPUTE-967 | mount pod template file for Spark executor
Feb 17, 2024
bae0e17
MLCOMPUTE-967 | formatting
Feb 19, 2024
97f8077
Comment out driver pod ports definition for testing
chi-yelp Feb 20, 2024
8b84c3b
Comment out executor podTemplate for testing
chi-yelp Feb 21, 2024
bb86b32
MLCOMPUTE-967 | fix tests
Feb 22, 2024
18ca6fd
Merge branch 'MLCOMPUTE-967_spark_drivers_on_k8s' of https://github.c…
Feb 22, 2024
33a81c7
Add pod labels for monitoring metrics
chi-yelp Feb 22, 2024
e2b7166
Fix mypy
chi-yelp Feb 22, 2024
84f97b0
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Feb 22, 2024
3174e12
MLCOMPUTE-967 | fix tests
Feb 22, 2024
5669453
Parse additional tronfig args
chi-yelp Feb 23, 2024
4919cde
MLCOMPUTE-967 | fix tests
Feb 26, 2024
10a5e77
MLCOMPUTE-967 | fix tests
Feb 28, 2024
d0c9917
MLCOMPUTE-967 | use default pod template without compact bin packing
Feb 28, 2024
5af2a25
MLCOMPUTE-967 | bump service-configuration-lib version
Feb 28, 2024
481be1f
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Feb 28, 2024
4f31f0e
MLCOMPUTE-967 | fix tests
Feb 28, 2024
6e34085
MLCOMPUTE-967 | fix tests
Feb 28, 2024
899006d
MLCOMPUTE-967 | updated timeout_spark regex acc to bash timeout, remo…
Mar 4, 2024
4ef1f13
MLCOMPUTE-967 | fix tests
Mar 4, 2024
19d2161
MLCOMPUTE-967 | fix tests
Mar 4, 2024
b4f67ed
MLCOMPUTE-967 | fix tests
Mar 4, 2024
0126f03
MLCOMPUTE-967 | remove commented logic
Mar 4, 2024
99ce5c9
Merge 0126f03f5023644763781692706ee99951cf640d into e5156f2e4cb0d5732…
CaptainSame Mar 4, 2024
ccb88d9
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Mar 6, 2024
35938e4
Specify prometheus scrape path
chi-yelp Mar 7, 2024
1960bff
Merge branch 'master' into MLCOMPUTE-967_spark_drivers_on_k8s
chi-yelp Mar 7, 2024
33f89cd
Merge branch 'master' into MLCOMPUTE-967_spark_drivers_on_k8s
chi-yelp Mar 8, 2024
88c5320
Update driver pod annotations for metrics scraping
chi-yelp Mar 10, 2024
dca9d43
MLCOMPUTE-967 | deprecate obsolete args, add mrjob capability to tron…
Mar 14, 2024
02b4ffc
Merge branch 'MLCOMPUTE-967_spark_drivers_on_k8s' of https://github.c…
Mar 14, 2024
a17c3fa
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Mar 14, 2024
275dbf3
MLCOMPUTE-967 | fix tests
Mar 14, 2024
59622de
MLCOMPUTE-967 | formatting
Mar 14, 2024
97ea384
MLCOMPUTE-967 | fix tests
Mar 14, 2024
b4b5fb9
MLCOMPUTE-967 | remove obsolete args, sanitise code
Mar 25, 2024
232307b
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Mar 25, 2024
e60d5bc
MLCOMPUTE-967 | formatting
Mar 25, 2024
0da84e2
MLCOMPUTE-967 | sanitise code
Mar 25, 2024
39728a0
Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…
Mar 26, 2024
b409bb1
MLCOMPUTE-967 | fix tests
Mar 26, 2024
f430254
MLCOMPUTE-967 | fix tests
Mar 26, 2024
c49d25d
MLCOMPUTE-967 | use already defined time_delta
Mar 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 27 additions & 22 deletions paasta_tools/tron_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@
DEFAULT_AWS_REGION = "us-west-2"
EXECUTOR_TYPE_TO_NAMESPACE = {
"paasta": "tron",
"spark": "paasta-spark",
"spark": "tron",
CaptainSame marked this conversation as resolved.
Show resolved Hide resolved
}
DEFAULT_TZ = "US/Pacific"
clusterman_metrics, _ = get_clusterman_metrics()
Expand Down Expand Up @@ -291,16 +291,6 @@ def __init__(
self.for_validation = for_validation

def _build_spark_config(self) -> Dict[str, str]:
# this service account will either be used solely for k8s access or for
# k8s + AWS access - the executor will have its own account though: and
# only if users opt into using pod identity
driver_service_account = create_or_find_service_account_name(
iam_role=self.get_iam_role(),
namespace=EXECUTOR_TYPE_TO_NAMESPACE[self.get_executor()],
k8s_role=_spark_k8s_role(),
dry_run=self.for_validation,
)

# this is pretty similar to what's done in service_configuration_lib, but it doesn't seem
# worth refactoring things there when it's unlikely that other users of that library will
# have the k8s creds to deal with service account creation (or the role bindings for the
Expand All @@ -322,9 +312,6 @@ def _build_spark_config(self) -> Dict[str, str]:
"spark.executorEnv.PAASTA_CLUSTER": self.get_cluster(),
"spark.executorEnv.PAASTA_INSTANCE_TYPE": "spark",
"spark.executorEnv.SPARK_EXECUTOR_DIRS": "/tmp",
# XXX: do we need to set this for the driver if the pod running the driver was created with
# this SA already?
"spark.kubernetes.authenticate.driver.serviceAccountName": driver_service_account,
"spark.kubernetes.pyspark.pythonVersion": "3",
"spark.kubernetes.container.image": self.get_docker_url(
system_paasta_config
Expand Down Expand Up @@ -415,7 +402,8 @@ def get_job_name(self):
def get_action_name(self):
return self.action

# mypy does not like the SecretVolume -> TronSecretVolume conversion, because TypedDict inheritence is broken. Until this is fixed, let's ignore this issue.
# mypy does not like the SecretVolume -> TronSecretVolume conversion, because TypedDict inheritence is broken.
# Until this is fixed, let's ignore this issue.
def get_secret_volumes(self) -> List[TronSecretVolume]: # type: ignore
"""Adds the secret_volume_name to the objet so tron/task_processing can load it downstream without replicating code."""
secret_volumes = super().get_secret_volumes()
Expand Down Expand Up @@ -488,9 +476,18 @@ def get_env(
# if this changes and we do need it - please add a comment about *why* we need it!
# XXX: update PAASTA_RESOURCE_* env vars to use the correct value from spark_args and set
# these to the correct values for the executors as part of the driver commandline
env["KUBECONFIG"] = system_paasta_config.get_spark_kubeconfig()

return env

def get_iam_role(self) -> str:
iam_role = super().get_iam_role()

if not iam_role and self.get_executor() == "spark":
CaptainSame marked this conversation as resolved.
Show resolved Hide resolved
iam_role = load_system_paasta_config().get_spark_driver_iam_role()

return iam_role

def get_secret_env(self) -> Mapping[str, dict]:
base_env = self.config_dict.get("env", {})
secret_env = {}
Expand Down Expand Up @@ -896,7 +893,7 @@ def format_master_config(master_config, default_volumes, dockercfg_location):
def format_tron_action_dict(action_config: TronActionConfig, use_k8s: bool = False):
"""Generate a dict of tronfig for an action, from the TronActionConfig.

:param job_config: TronActionConfig
:param action_config: TronActionConfig
"""
executor = action_config.get_executor()
result = {
Expand All @@ -912,7 +909,7 @@ def format_tron_action_dict(action_config: TronActionConfig, use_k8s: bool = Fal
"triggered_by": action_config.get_triggered_by(),
"on_upstream_rerun": action_config.get_on_upstream_rerun(),
"trigger_timeout": action_config.get_trigger_timeout(),
# outside of Spark usescases, we also allow users to specify an expected-to-exist Service Account name
# outside of Spark use-cases, we also allow users to specify an expected-to-exist Service Account name
# in the Tron namespace in case an action needs specific k8s permissions (e.g., a Jolt batch may need
# k8s permissions to list Jolt pods in the jolt namespace to do science™ to them).
# if the provided Service Account does not exist, Tron should simply fail to create the Podspec and report
Expand All @@ -921,7 +918,7 @@ def format_tron_action_dict(action_config: TronActionConfig, use_k8s: bool = Fal
"service_account_name": action_config.get_service_account_name(),
}

# while we're tranisitioning, we want to be able to cleanly fallback to Mesos
# while we're transitioning, we want to be able to cleanly fall back to Mesos,
# so we'll default to Mesos unless k8s usage is enabled for both the cluster
# and job.
# there are slight differences between k8s and Mesos configs, so we'll translate
Expand Down Expand Up @@ -997,20 +994,25 @@ def format_tron_action_dict(action_config: TronActionConfig, use_k8s: bool = Fal
)

if executor == "spark":
system_paasta_config = load_system_paasta_config()
# this service account will only be used by Spark drivers since executors don't
# need Kubernetes access permissions
result["service_account_name"] = create_or_find_service_account_name(
iam_role=action_config.get_iam_role(),
namespace=EXECUTOR_TYPE_TO_NAMESPACE[executor],
k8s_role=_spark_k8s_role(),
dry_run=action_config.for_validation,
)
# spark, unlike normal batches, needs to expose several ports for things like the spark
result["extra_volumes"] = [{
"container_path": system_paasta_config.get_spark_kubeconfig(),
"host_path": system_paasta_config.get_spark_kubeconfig(),
"mode": "RO",
}]
# spark, unlike normal batches, needs to expose several ports for things like the spark
# ui and for executor->driver communication
result["ports"] = list(
set(
_get_spark_ports(
system_paasta_config=load_system_paasta_config()
system_paasta_config=system_paasta_config
).values()
)
)
Expand All @@ -1034,7 +1036,10 @@ def format_tron_action_dict(action_config: TronActionConfig, use_k8s: bool = Fal
result["cpus"] = action_config.get_cpus()
result["mem"] = action_config.get_mem()
result["disk"] = action_config.get_disk()
result["extra_volumes"] = format_volumes(action_config.get_extra_volumes())
if "extra_volumes" in result:
result["extra_volumes"] = result["extra_volumes"] + format_volumes(action_config.get_extra_volumes())
else:
result["extra_volumes"] = format_volumes(action_config.get_extra_volumes())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are your thoughts on having TronActionConfig provide an override for get_extra_volumes() and doing the same if self.get_executor() == "spark": trick we do elsewhere to transparently add this mount? that we we can avoid doing the merge here which (without additional comments) could potentially confuse future yelpers reading this code

result["docker_image"] = action_config.get_docker_url()

# Only pass non-None values, so Tron will use defaults for others
Expand Down
4 changes: 4 additions & 0 deletions paasta_tools/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1887,6 +1887,7 @@ class RemoteRunConfig(TypedDict, total=False):
class SparkRunConfig(TypedDict, total=False):
default_cluster: str
default_pool: str
default_spark_driver_iam_role: str


class PaastaNativeConfig(TypedDict, total=False):
Expand Down Expand Up @@ -2777,6 +2778,9 @@ def get_eks_cluster_aliases(self) -> Dict[str, str]:
def get_cluster_pools(self) -> Dict[str, List[str]]:
return self.config_dict.get("allowed_pools", {})

def get_spark_driver_iam_role(self) -> str:
return self.get_spark_run_config().get("default_spark_driver_iam_role", "")

def get_hacheck_match_initial_delay(self) -> bool:
return self.config_dict.get("hacheck_match_initial_delay", False)

Expand Down
Loading