Merge pull request #174 from marklogic/release/2.2.0

Release/2.2.0
marklogic · Feb 22, 2024 · f1bbf9c · f1bbf9c
2 parents 885d300 + 8f639d7
commit f1bbf9c
Show file tree

Hide file tree

Showing 186 changed files with 6,981 additions and 1,291 deletions.
diff --git a/.env b/.env
@@ -0,0 +1,3 @@
+# Defines environment variables for docker-compose.
+# Can be overridden via e.g. `MARKLOGIC_TAG=latest-10.0 docker-compose up -d --build`.
+MARKLOGIC_TAG=11.1.0-centos-1.1.0
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,12 +1,14 @@
-This is an evolving guide for developers interested in developing and testing this project. This guide assumes that you
-have cloned this repository to your local workstation. 
+This guide covers how to develop and test this project. It assumes that you have cloned this repository to your local
+workstation.
 
-# Do this first!
+Due to the use of the Sonar plugin for Gradle, you must use Java 11 or higher for developing and testing the project. 
+The `build.gradle` file for this project ensures that the connector is built to run on Java 8 or higher. 
 
-In order to develop and/or test the connector, or to try out the PySpark instructions below, you first 
-need to deploy the test application in this project to MarkLogic. You can do so either on your own installation of 
-MarkLogic, or you can use `docker-compose` to install MarkLogic, optionally as a 3-node cluster with a load balancer
-in front of it.
+# Setup
+
+To begin, you need to deploy the test application in this project to MarkLogic. You can do so either on your own 
+installation of MarkLogic, or you can use `docker-compose` to install MarkLogic, optionally as a 3-node cluster with 
+a load balancer in front of it.
 
 ## Installing MarkLogic with docker-compose
 
@@ -22,9 +24,9 @@ The above will result in a new MarkLogic instance with a single node.
 Alternatively, if you would like to test against a 3-node MarkLogic cluster with a load balancer in front of it, 
 run `docker-compose -f docker-compose-3nodes.yaml up -d --build`.
 
-### Accessing MarkLogic logs in Grafana
+## Accessing MarkLogic logs in Grafana
 
-This project's `docker-compose.yaml` file includes 
+This project's `docker-compose-3nodes.yaml` file includes 
 [Grafana, Loki, and promtail services](https://grafana.com/docs/loki/latest/clients/promtail/) for the primary reason of 
 collecting MarkLogic log files and allowing them to be viewed and searched via Grafana. 
 
@@ -75,6 +77,46 @@ You can then run the tests from within the Docker environment via the following
     ./gradlew dockerTest
 
 
+## Generating code quality reports with SonarQube
+
+In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file and you must
+have the services in that file running.
+
+To configure the SonarQube service, perform the following steps:
+
+1. Go to http://localhost:9000 .
+2. Login as admin/admin. SonarQube will ask you to change this password; you can choose whatever you want ("password" works).
+3. Click on "Create project manually".
+4. Enter "marklogic-spark" for the Project Name; use that as the Project Key too.
+5. Enter "develop" as the main branch name.
+6. Click on "Next".
+7. Click on "Use the global setting" and then "Create project".
+8. On the "Analysis Method" page, click on "Locally".
+9. In the "Provide a token" panel, click on "Generate". Copy the token.
+10. Add `systemProp.sonar.token=your token pasted here` to `gradle-local.properties` in the root of your project, creating
+that file if it does not exist yet.
+
+To run SonarQube, run the following Gradle tasks, which will run all the tests with code coverage and then generate
+a quality report with SonarQube:
+
+    ./gradlew test sonar
+
+If you do not add `systemProp.sonar.token` to your `gradle-local.properties` file, you can specify the token via the
+following:
+
+    ./gradlew test sonar -Dsonar.token=paste your token here
+
+When that completes, you will see a line like this near the end of the logging:
+
+    ANALYSIS SUCCESSFUL, you can find the results at: http://localhost:9000/dashboard?id=marklogic-spark
+
+Click on that link. If it's the first time you've run the report, you'll see all issues. If you've run the report
+before, then SonarQube will show "New Code" by default. That's handy, as you can use that to quickly see any issues
+you've introduced on the feature branch you're working on. You can then click on "Overall Code" to see all issues.
+
+Note that if you only need results on code smells and vulnerabilities, you can repeatedly run `./gradlew sonar`
+without having to re-run the tests.
+
 # Testing with PySpark
 
 The documentation for this project 
@@ -89,19 +131,16 @@ This will produce a single jar file for the connector in the `./build/libs` dire
 
 You can then launch PySpark with the connector available via:
 
-    pyspark --jars build/libs/marklogic-spark-connector-2.1.0.jar
+    pyspark --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar
 
 The below command is an example of loading data from the test application deployed via the instructions at the top of 
 this page. 
 
 ```
-df = spark.read.format("com.marklogic.spark")\
-    .option("spark.marklogic.client.host", "localhost")\
-    .option("spark.marklogic.client.port", "8016")\
-    .option("spark.marklogic.client.username", "admin")\
-    .option("spark.marklogic.client.password", "admin")\
-    .option("spark.marklogic.client.authType", "digest")\
+df = spark.read.format("marklogic")\
+    .option("spark.marklogic.client.uri", "spark-test-user:spark@localhost:8016")\
     .option("spark.marklogic.read.opticQuery", "op.fromView('Medical', 'Authors')")\
+    .option("spark.marklogic.read.numPartitions", 8)\
     .load()
 ```
 
@@ -114,6 +153,74 @@ You now have a Spark dataframe - try some commands out on it:
 Check out the [PySpark docs](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html) for 
 more commands you can try out. 
 
+You can query for documents as well - the following shows a simple example along with a technique for converting the
+binary content of each document into a string of JSON.
+
+```
+import json
+from pyspark.sql import functions as F
+
+df = spark.read.format("marklogic")\
+    .option("spark.marklogic.client.uri", "spark-test-user:spark@localhost:8016")\
+    .option("spark.marklogic.read.documents.collections", "author")\
+    .load()
+df.show()
+
+df2 = df.select(F.col("content").cast("string"))
+df2.head()
+json.loads(df2.head()['content'])
+```
+
+
+# Testing against a local Spark cluster
+
+When you run PySpark, it will create its own Spark cluster. If you'd like to try against a separate Spark cluster
+that still runs on your local machine, perform the following steps:
+
+1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.4.1` since we are currently
+building against Spark 3.4.1.
+2. `cd ~/.sdkman/candidates/spark/current/sbin`, which is where sdkman will install Spark.
+3. Run `./start-master.sh` to start a master Spark node.
+4. `cd ../logs` and open the master log file that was created to find the address for the master node. It will be in a
+log message similar to `Starting Spark master at spark://NYWHYC3G0W:7077` - copy that address at the end of the message.
+5. `cd ../sbin`.
+6. Run `./start-worker.sh spark://NYWHYC3G0W:7077`, changing that address as necessary.
+
+You can of course simplify the above steps by adding `SPARK_HOME` to your env and adding `$SPARK_HOME/sbin` to your
+path, which thus avoids having to change directories. The log files in `./logs` are useful to tail as well.
+
+The Spark master GUI is at <http://localhost:8080>. You can use this to view details about jobs running in the cluster.
+
+Now that you have a Spark cluster running, you just need to tell PySpark to connect to it:
+
+    pyspark --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar
+
+You can then run the same commands as shown in the PySpark section above. The Spark master GUI will allow you to 
+examine details of each of the commands that you run.
+
+The above approach is ultimately a sanity check to ensure that the connector works properly with a separate cluster
+process. 
+
+## Testing spark-submit
+
+Once you have the above Spark cluster running, you can test out 
+[spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) which enables submitting a program
+and an optional set of jars to a Spark cluster for execution. 
+
+You will need the connector jar available, so run `./gradlew clean shadowJar` if you have not already.
+
+You can then run a test Python program in this repository via the following (again, change the master address as 
+needed); note that you run this outside of PySpark, and `spark-submit` is available after having installed PySpark:
+
+    spark-submit --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar src/test/python/test_program.py
+
+You can also test a Java program. To do so, first move the `com.marklogic.spark.TestProgram` class from `src/test/java`
+to `src/main/java`. Then run `./gradlew clean shadowJar` to rebuild the connector jar. Then run the following:
+
+    spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar
+
+Be sure to move `TestProgram` back to `src/test/java` when you are done. 
+
 # Testing the documentation locally
 
 See the section with the same name in the 

diff --git a/Jenkinsfile b/Jenkinsfile
@@ -1,20 +1,32 @@
 @Library('shared-libraries') _
 
-def runtests(String mlVersionType, String mlVersion, String javaVersion){
-  copyRPM mlVersionType,mlVersion
-  setUpML '$WORKSPACE/xdmp/src/Mark*.rpm'
+def runtests(String javaVersion){
   sh label:'test', script: '''#!/bin/bash
     export JAVA_HOME=$'''+javaVersion+'''
     export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
     export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
     cd marklogic-spark-connector
     echo "mlPassword=admin" > gradle-local.properties
+    echo "Waiting for MarkLogic server to initialize."
+    sleep 30s
    ./gradlew -i mlDeploy
+   echo "Loading data a second time to try to avoid Optic bug with duplicate rows being returned."
+   ./gradlew -i mlLoadData
    ./gradlew test || true
   '''
   junit '**/build/**/*.xml'
 }
 
+def runSonarScan(String javaVersion){
+    sh label:'test', script: '''#!/bin/bash
+      export JAVA_HOME=$'''+javaVersion+'''
+      export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
+      export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
+      cd marklogic-spark-connector
+     ./gradlew sonar -Dsonar.projectKey='marklogic_marklogic-spark-connector_AY1bXn6J_50_odbCDKMX' -Dsonar.projectName='ML-DevExp-marklogic-spark-connector' || true
+    '''
+}
+
 pipeline{
   agent none
   triggers{
@@ -30,16 +42,39 @@ pipeline{
   environment{
     JAVA8_HOME_DIR="/home/builder/java/openjdk-1.8.0-262"
     JAVA11_HOME_DIR="/home/builder/java/jdk-11.0.2"
-    JAVA17_HOME_DIR="/home/builder/java/jdk-17.0.2"
     GRADLE_DIR   =".gradle"
     DMC_USER     = credentials('MLBUILD_USER')
     DMC_PASSWORD = credentials('MLBUILD_PASSWORD')
   }
   stages{
     stage('tests'){
+      environment{
+        scannerHome = tool 'SONAR_Progress'
+      }
       agent {label 'devExpLinuxPool'}
       steps{
-        runtests('Latest','11','JAVA8_HOME_DIR')
+        sh label:'mlsetup', script: '''#!/bin/bash
+            echo "Removing any running MarkLogic server and clean up MarkLogic data directory"
+            sudo /usr/local/sbin/mladmin remove
+            sudo /usr/local/sbin/mladmin cleandata
+            cd marklogic-spark-connector
+            mkdir -p docker/marklogic/logs
+            docker-compose down -v || true
+            docker-compose up -d --build
+          '''
+        runtests('JAVA11_HOME_DIR')
+        withSonarQubeEnv('SONAR_Progress') {
+          runSonarScan('JAVA11_HOME_DIR')
+        }
+      }
+      post{
+        always{
+          sh label:'mlcleanup', script: '''#!/bin/bash
+            cd marklogic-spark-connector
+            docker-compose down -v || true
+            sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/marklogic/logs/
+          '''
+        }
       }
     }
     stage('publish'){
@@ -49,7 +84,7 @@ pipeline{
       }
       steps{
       	sh label:'publish', script: '''#!/bin/bash
-          export JAVA_HOME=$JAVA_HOME_DIR
+          export JAVA_HOME=$JAVA11_HOME_DIR
           export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
           export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
           cp ~/.gradle/gradle.properties $GRADLE_USER_HOME;
@@ -59,55 +94,35 @@ pipeline{
       }
     }
     stage('regressions'){
+      agent {label 'devExpLinuxPool'}
       when{
         allOf{
           branch 'develop'
           expression {return params.regressions}
         }
       }
-      parallel{
-        stage('11-nightly-java11'){
-          agent {label 'devExpLinuxPool'}
-          steps{
-            runtests('Latest','11','JAVA11_HOME_DIR')
-          }
-        }
-        stage('11-nightly-java17'){
-          agent {label 'devExpLinuxPool'}
-          steps{
-            runtests('Latest','11','JAVA17_HOME_DIR')
-          }
-        }
-        stage('10.0-9.5-java11'){
-          agent {label 'devExpLinuxPool'}
-          steps{
-            runtests('Release','10.0-9.5','JAVA11_HOME_DIR')
-          }
-        }
-        stage('10.0-9.5-nightly-java17'){
-          agent {label 'devExpLinuxPool'}
-          steps{
-            runtests('Release','10.0-9.5','JAVA17_HOME_DIR')
-          }
-        }
-        stage('11.0.2-java8-spark3.4'){
-          agent {label 'devExpLinuxPool'}
-          steps{
-            copyRPM 'Release','11.0.2'
-            setUpML '$WORKSPACE/xdmp/src/Mark*.rpm'
-            sh label:'test', script: '''#!/bin/bash
-              export JAVA_HOME=$JAVA8_HOME_DIR
-              export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
-              export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
-              cd marklogic-spark-connector
-              echo "mlPassword=admin" > gradle-local.properties
-              ./gradlew -i mlDeploy
-              ./gradlew test -PsparkVersion="3.4.0" || true
+      steps{
+            sh label:'mlsetup', script: '''#!/bin/bash
+                echo "Removing any running MarkLogic server and clean up MarkLogic data directory"
+                sudo /usr/local/sbin/mladmin remove
+                sudo /usr/local/sbin/mladmin cleandata
+                cd marklogic-spark-connector
+                mkdir -p docker/marklogic/logs
+                docker-compose down -v || true
+                MARKLOGIC_TAG=latest-10.0 docker-compose up -d --build
             '''
-            junit '**/build/**/*.xml'
-          }
+            runtests('JAVA11_HOME_DIR')
+      }
+      post{
+        always{
+          sh label:'mlcleanup', script: '''#!/bin/bash
+            cd marklogic-spark-connector
+            docker-compose down -v || true
+            sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/marklogic/logs/
+          '''
         }
       }
+
     }
   }
 }