Spline Agent on AWS Glue #414

dirkdejager · 2022-03-09T07:33:47Z

dirkdejager
Mar 9, 2022

Hi,

I am trying to run the Spline Agent on AWS Glue (Which runs Spark). It seems to partially work, but I am not getting any lineage out on the logs, or console (can't see any) or HTTP Dispatcher. I know that the spline-agent does communicate with the Rest Server as I can see it ping the rest server with a HEAD request on job startup, but no lineage comes through. My Glue setup is as follows:

Get the appropriate spline-agent jar from mvnrepository.com as per Table 1.
Allocate it with a "Dependent JARs path" parameter.
Add a spline.properties file to the job via "Referenced files path".
Add an extra "--conf" Job parameter pointing to "spark.sql.queryExecutionListeners=za.co.absa.spline.harvester.listener.SplineQueryExecutionListener" to enable "Codeless Initialization"
Configure a Composite Dispatcher with the following settings in Listing 1.

When running a job, I can see that the HTTP Dispatcher connects to the Spline Rest Server, but does not ever produce Lineage on the Cloudwatch logs, or connect to the Spline Rest server again (There are no calls to the server as per the logs). Is there anything that I am doing wrong, or does the Spline-Agent not support AWS Glue Spark? If it does not support Glue, how would one go about extending the Spline-Agent to support Glue? Would one need to extend the Agent to support the GlueContext? I have attached my sample Python Code in Listing 3.

Any ideas, suggestions or similar efforts?

Table 1. Spline Agent JARs which seem to work with Different Glue Versions

Glue Version	Spline Agent
Glue 1.0	-Not investigated-
Glue 2.0	spark-2.4-spline-agent-bundle_2.11-0.7.3.jar
Glue 3.0	spark-3.1-spline-agent-bundle_2.12-0.7.3.jar

spline.mode=REQUIRED
spline.lineageDispatcher=composite

spline.lineageDispatcher.http.className=za.co.absa.spline.harvester.dispatcher.HttpLineageDispatcher
spline.lineageDispatcher.http.producer.url=http://<server>:8080/producer

spline.lineageDispatcher.console.className=za.co.absa.spline.harvester.dispatcher.ConsoleLineageDispatcher
spline.lineageDispatcher.console.stream=ERR

spline.lineageDispatcher.logging.className=za.co.absa.spline.harvester.dispatcher.LoggingLineageDispatcher
spline.lineageDispatcher.logging.level=DEBUG

spline.lineageDispatcher.composite.className=za.co.absa.spline.harvester.dispatcher.CompositeLineageDispatcher
spline.lineageDispatcher.composite.dispatchers=logging,console,http
spline.lineageDispatcher.composite.failOnErrors=true

Listing 1. Contents of the spline.properties file


1646638253989,"2022-03-07 07:30:53,989 INFO [Thread-5] internal.SharedState (Logging.scala:logInfo(54)): Warehouse path is 'file:/tmp/spark-warehouse'.
"
1646638254489,"2022-03-07 07:30:54,489 INFO [Thread-5] state.StateStoreCoordinatorRef (Logging.scala:logInfo(54)): Registered StateStoreCoordinator endpoint
"
1646638254536,"2022-03-07 07:30:54,536 INFO [Thread-5] harvester.QueryExecutionEventHandlerFactory (Logging.scala:logInfo(54)): Initializing Spline agent...
"
1646638254536,"2022-03-07 07:30:54,536 INFO [Thread-5] harvester.QueryExecutionEventHandlerFactory (Logging.scala:logInfo(54)): Spline init type: AUTO (codeless)
"
1646638254542,"2022-03-07 07:30:54,541 INFO [Thread-5] harvester.QueryExecutionEventHandlerFactory (Logging.scala:logInfo(54)): Spline version: 0.7.3 (rev. e2d00f3)
"
1646638254542,"2022-03-07 07:30:54,542 INFO [Thread-5] harvester.QueryExecutionEventHandlerFactory (Logging.scala:logInfo(54)): Spline mode: REQUIRED
"
1646638254543,"2022-03-07 07:30:54,542 INFO [Thread-5] conf.DefaultSplineConfigurer (Logging.scala:logInfo(54)): Lineage Dispatcher: composite
"
1646638254543,"2022-03-07 07:30:54,543 INFO [Thread-5] conf.DefaultSplineConfigurer (Logging.scala:logInfo(54)): Post-Processing Filter: composite
"
1646638254543,"2022-03-07 07:30:54,543 INFO [Thread-5] conf.DefaultSplineConfigurer (Logging.scala:logInfo(54)): Ignore-Write Detection Strategy: default
"
1646638276699,"2022-03-07 07:31:16,699 INFO [dispatcher-event-loop-0] scheduler.JESSchedulerBackend$JESAsSchedulerBackendEndpoint (Logging.scala:logInfo(54)): Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.35.118.231:60862) with ID 1
"
1646638276704,"2022-03-07 07:31:16,704 INFO [spark-listener-group-shared] scheduler.ExecutorEventListener (Logging.scala:logInfo(54)): Got executor added event for 1 @ 1646638276702
"
1646638276706,"2022-03-07 07:31:16,705 INFO [spark-listener-group-shared] glue.ExecutorTaskManagement (Logging.scala:logInfo(54)): connected executor 1
"
1646638276926,"2022-03-07 07:31:16,926 INFO [dispatcher-event-loop-2] storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(54)): Registering block manager 172.35.118.231:42413 with 5.8 GB RAM, BlockManagerId(1, 172.35.118.231, 42413, None)
"
1646638289864,"2022-03-07 07:31:29,863 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.AvroPlugin
"
1646638289865,"2022-03-07 07:31:29,865 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.BigQueryPlugin
"
1646638289866,"2022-03-07 07:31:29,866 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.CassandraPlugin
"
1646638289867,"2022-03-07 07:31:29,866 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.CobrixPlugin
"
1646638289867,"2022-03-07 07:31:29,867 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.DataSourceV2Plugin
"
1646638289868,"2022-03-07 07:31:29,868 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.DatabricksPlugin
"
1646638289874,"2022-03-07 07:31:29,873 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.ElasticSearchPlugin
"
1646638289874,"2022-03-07 07:31:29,874 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.ExcelPlugin
"
1646638289874,"2022-03-07 07:31:29,874 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.JDBCPlugin
"
1646638289875,"2022-03-07 07:31:29,874 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.KafkaPlugin
"
1646638289875,"2022-03-07 07:31:29,875 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.MongoPlugin
"
1646638289875,"2022-03-07 07:31:29,875 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.SQLPlugin
"
1646638289877,"2022-03-07 07:31:29,876 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.embedded.XMLPlugin
"
1646638289877,"2022-03-07 07:31:29,877 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.composite.LogicalRelationPlugin
"
1646638289878,"2022-03-07 07:31:29,877 INFO [Thread-5] registry.AutoDiscoveryPluginRegistry (Logging.scala:logInfo(54)): Loading plugin: class za.co.absa.spline.harvester.plugin.composite.SaveIntoDataSourceCommandPlugin
"
1646638289893,"2022-03-07 07:31:29,893 INFO [Thread-5] dispatcher.HttpLineageDispatcher (Logging.scala:logInfo(54)): Producer URL: http://13.246.13.21:8080/producer
"
1646638289931,"2022-03-07 07:31:29,931 INFO [Thread-5] dispatcher.HttpLineageDispatcher (Logging.scala:logInfo(54)): Using Producer API version: 1.1
"
1646638289934,"2022-03-07 07:31:29,933 INFO [Thread-5] harvester.QueryExecutionEventHandlerFactory (Logging.scala:logInfo(54)): Spline successfully initialized. Spark Lineage tracking is ENABLED.
"
1646638290241,"2022-03-07 07:31:30,241 INFO [Thread-5] spark.SparkContext (Logging.scala:logInfo(54)): Starting job: resolveRelation at DataSource.scala:745
"
1646638290258,"2022-03-07 07:31:30,257 INFO [dag-scheduler-event-loop] scheduler.DAGScheduler (Logging.scala:logInfo(54)): Got job 0 (resolveRelation at DataSource.scala:745) with 1 output partitions
"
1646638290258,"2022-03-07 07:31:30,258 INFO [dag-scheduler-event-loop] scheduler.DAGScheduler (Logging.scala:logInfo(54)): Final stage: ResultStage 0 (resolveRelation at DataSource.scala:745)
"
1646638290304,"2022-03-07 07:31:30,259 INFO [dag-scheduler-event-loop] scheduler.DAGScheduler (Logging.scala:logInfo(54)): Parents of final stage: List()
2022-03-07 07:31:30,260 INFO [dag-scheduler-event-loop] scheduler.DAGScheduler (Logging.scala:logInfo(54)): Missing parents: List()
"

Listing 2. Extract of the AWS Glue log showing Spline Startup

 import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node xneelo_aws_test
xneelo_aws_test_node1646313615566 = glueContext.create_dynamic_frame.from_options(
    connection_type="custom.jdbc",
    connection_options={
        "tableName": "film",
        "dbTable": "film",
        "connectionName": "xneelo2",
    },
    transformation_ctx="xneelo_aws_test_node1646313615566",
)

# Script generated for node ApplyMapping
ApplyMapping_node1631618598220 = ApplyMapping.apply(
    frame=xneelo_aws_test_node1646313615566,
    mappings=[],
    transformation_ctx="ApplyMapping_node1631618598220",
)

# Script generated for node Amazon S3
AmazonS3_node1631618601508 = glueContext.getSink(
    path="s3://dirk-ct-bucket/adv_works/",
    connection_type="s3",
    updateBehavior="UPDATE_IN_DATABASE",
    partitionKeys=[],
    enableUpdateCatalog=True,
    transformation_ctx="AmazonS3_node1631618601508",
)
AmazonS3_node1631618601508.setCatalogInfo(
    catalogDatabase="xneelo", catalogTableName="skalia_test"
)
AmazonS3_node1631618601508.setFormat("json")
AmazonS3_node1631618601508.writeFrame(ApplyMapping_node1631618598220)
job.commit()

Listing 3. AWS Glue Python Code

Answered by wajda

Mar 9, 2022

It could be caused by the internal implementation of the writeFrame method. If it uses RDD then lineage won't be captured. See #33
Also see #394 for a troubleshooting scenario for similar issues.

View full answer

wajda · 2022-03-09T17:00:37Z

wajda
Mar 9, 2022
Maintainer

It could be caused by the internal implementation of the writeFrame method. If it uses RDD then lineage won't be captured. See #33
Also see #394 for a troubleshooting scenario for similar issues.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spline Agent on AWS Glue #414

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Spline Agent on AWS Glue #414

dirkdejager Mar 9, 2022

Replies: 1 comment

wajda Mar 9, 2022 Maintainer

dirkdejager
Mar 9, 2022

wajda
Mar 9, 2022
Maintainer