Skip to content

Commit

Permalink
Merge pull request #174 from marklogic/release/2.2.0
Browse files Browse the repository at this point in the history
Release/2.2.0
  • Loading branch information
rjrudin authored Feb 22, 2024
2 parents 885d300 + 8f639d7 commit f1bbf9c
Show file tree
Hide file tree
Showing 186 changed files with 6,981 additions and 1,291 deletions.
3 changes: 3 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Defines environment variables for docker-compose.
# Can be overridden via e.g. `MARKLOGIC_TAG=latest-10.0 docker-compose up -d --build`.
MARKLOGIC_TAG=11.1.0-centos-1.1.0
139 changes: 123 additions & 16 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
This is an evolving guide for developers interested in developing and testing this project. This guide assumes that you
have cloned this repository to your local workstation.
This guide covers how to develop and test this project. It assumes that you have cloned this repository to your local
workstation.

# Do this first!
Due to the use of the Sonar plugin for Gradle, you must use Java 11 or higher for developing and testing the project.
The `build.gradle` file for this project ensures that the connector is built to run on Java 8 or higher.

In order to develop and/or test the connector, or to try out the PySpark instructions below, you first
need to deploy the test application in this project to MarkLogic. You can do so either on your own installation of
MarkLogic, or you can use `docker-compose` to install MarkLogic, optionally as a 3-node cluster with a load balancer
in front of it.
# Setup

To begin, you need to deploy the test application in this project to MarkLogic. You can do so either on your own
installation of MarkLogic, or you can use `docker-compose` to install MarkLogic, optionally as a 3-node cluster with
a load balancer in front of it.

## Installing MarkLogic with docker-compose

Expand All @@ -22,9 +24,9 @@ The above will result in a new MarkLogic instance with a single node.
Alternatively, if you would like to test against a 3-node MarkLogic cluster with a load balancer in front of it,
run `docker-compose -f docker-compose-3nodes.yaml up -d --build`.

### Accessing MarkLogic logs in Grafana
## Accessing MarkLogic logs in Grafana

This project's `docker-compose.yaml` file includes
This project's `docker-compose-3nodes.yaml` file includes
[Grafana, Loki, and promtail services](https://grafana.com/docs/loki/latest/clients/promtail/) for the primary reason of
collecting MarkLogic log files and allowing them to be viewed and searched via Grafana.

Expand Down Expand Up @@ -75,6 +77,46 @@ You can then run the tests from within the Docker environment via the following
./gradlew dockerTest


## Generating code quality reports with SonarQube

In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file and you must
have the services in that file running.

To configure the SonarQube service, perform the following steps:

1. Go to http://localhost:9000 .
2. Login as admin/admin. SonarQube will ask you to change this password; you can choose whatever you want ("password" works).
3. Click on "Create project manually".
4. Enter "marklogic-spark" for the Project Name; use that as the Project Key too.
5. Enter "develop" as the main branch name.
6. Click on "Next".
7. Click on "Use the global setting" and then "Create project".
8. On the "Analysis Method" page, click on "Locally".
9. In the "Provide a token" panel, click on "Generate". Copy the token.
10. Add `systemProp.sonar.token=your token pasted here` to `gradle-local.properties` in the root of your project, creating
that file if it does not exist yet.

To run SonarQube, run the following Gradle tasks, which will run all the tests with code coverage and then generate
a quality report with SonarQube:

./gradlew test sonar

If you do not add `systemProp.sonar.token` to your `gradle-local.properties` file, you can specify the token via the
following:

./gradlew test sonar -Dsonar.token=paste your token here

When that completes, you will see a line like this near the end of the logging:

ANALYSIS SUCCESSFUL, you can find the results at: http://localhost:9000/dashboard?id=marklogic-spark

Click on that link. If it's the first time you've run the report, you'll see all issues. If you've run the report
before, then SonarQube will show "New Code" by default. That's handy, as you can use that to quickly see any issues
you've introduced on the feature branch you're working on. You can then click on "Overall Code" to see all issues.

Note that if you only need results on code smells and vulnerabilities, you can repeatedly run `./gradlew sonar`
without having to re-run the tests.

# Testing with PySpark

The documentation for this project
Expand All @@ -89,19 +131,16 @@ This will produce a single jar file for the connector in the `./build/libs` dire

You can then launch PySpark with the connector available via:

pyspark --jars build/libs/marklogic-spark-connector-2.1.0.jar
pyspark --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar

The below command is an example of loading data from the test application deployed via the instructions at the top of
this page.

```
df = spark.read.format("com.marklogic.spark")\
.option("spark.marklogic.client.host", "localhost")\
.option("spark.marklogic.client.port", "8016")\
.option("spark.marklogic.client.username", "admin")\
.option("spark.marklogic.client.password", "admin")\
.option("spark.marklogic.client.authType", "digest")\
df = spark.read.format("marklogic")\
.option("spark.marklogic.client.uri", "spark-test-user:spark@localhost:8016")\
.option("spark.marklogic.read.opticQuery", "op.fromView('Medical', 'Authors')")\
.option("spark.marklogic.read.numPartitions", 8)\
.load()
```

Expand All @@ -114,6 +153,74 @@ You now have a Spark dataframe - try some commands out on it:
Check out the [PySpark docs](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html) for
more commands you can try out.

You can query for documents as well - the following shows a simple example along with a technique for converting the
binary content of each document into a string of JSON.

```
import json
from pyspark.sql import functions as F
df = spark.read.format("marklogic")\
.option("spark.marklogic.client.uri", "spark-test-user:spark@localhost:8016")\
.option("spark.marklogic.read.documents.collections", "author")\
.load()
df.show()
df2 = df.select(F.col("content").cast("string"))
df2.head()
json.loads(df2.head()['content'])
```


# Testing against a local Spark cluster

When you run PySpark, it will create its own Spark cluster. If you'd like to try against a separate Spark cluster
that still runs on your local machine, perform the following steps:

1. Use [sdkman to install Spark](https://sdkman.io/sdks#spark). Run `sdk install spark 3.4.1` since we are currently
building against Spark 3.4.1.
2. `cd ~/.sdkman/candidates/spark/current/sbin`, which is where sdkman will install Spark.
3. Run `./start-master.sh` to start a master Spark node.
4. `cd ../logs` and open the master log file that was created to find the address for the master node. It will be in a
log message similar to `Starting Spark master at spark://NYWHYC3G0W:7077` - copy that address at the end of the message.
5. `cd ../sbin`.
6. Run `./start-worker.sh spark://NYWHYC3G0W:7077`, changing that address as necessary.

You can of course simplify the above steps by adding `SPARK_HOME` to your env and adding `$SPARK_HOME/sbin` to your
path, which thus avoids having to change directories. The log files in `./logs` are useful to tail as well.

The Spark master GUI is at <http://localhost:8080>. You can use this to view details about jobs running in the cluster.

Now that you have a Spark cluster running, you just need to tell PySpark to connect to it:

pyspark --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar

You can then run the same commands as shown in the PySpark section above. The Spark master GUI will allow you to
examine details of each of the commands that you run.

The above approach is ultimately a sanity check to ensure that the connector works properly with a separate cluster
process.

## Testing spark-submit

Once you have the above Spark cluster running, you can test out
[spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) which enables submitting a program
and an optional set of jars to a Spark cluster for execution.

You will need the connector jar available, so run `./gradlew clean shadowJar` if you have not already.

You can then run a test Python program in this repository via the following (again, change the master address as
needed); note that you run this outside of PySpark, and `spark-submit` is available after having installed PySpark:

spark-submit --master spark://NYWHYC3G0W:7077 --jars build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar src/test/python/test_program.py

You can also test a Java program. To do so, first move the `com.marklogic.spark.TestProgram` class from `src/test/java`
to `src/main/java`. Then run `./gradlew clean shadowJar` to rebuild the connector jar. Then run the following:

spark-submit --master spark://NYWHYC3G0W:7077 --class com.marklogic.spark.TestProgram build/libs/marklogic-spark-connector-2.2-SNAPSHOT.jar

Be sure to move `TestProgram` back to `src/test/java` when you are done.

# Testing the documentation locally

See the section with the same name in the
Expand Down
107 changes: 61 additions & 46 deletions Jenkinsfile
Original file line number Diff line number Diff line change
@@ -1,20 +1,32 @@
@Library('shared-libraries') _

def runtests(String mlVersionType, String mlVersion, String javaVersion){
copyRPM mlVersionType,mlVersion
setUpML '$WORKSPACE/xdmp/src/Mark*.rpm'
def runtests(String javaVersion){
sh label:'test', script: '''#!/bin/bash
export JAVA_HOME=$'''+javaVersion+'''
export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
cd marklogic-spark-connector
echo "mlPassword=admin" > gradle-local.properties
echo "Waiting for MarkLogic server to initialize."
sleep 30s
./gradlew -i mlDeploy
echo "Loading data a second time to try to avoid Optic bug with duplicate rows being returned."
./gradlew -i mlLoadData
./gradlew test || true
'''
junit '**/build/**/*.xml'
}

def runSonarScan(String javaVersion){
sh label:'test', script: '''#!/bin/bash
export JAVA_HOME=$'''+javaVersion+'''
export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
cd marklogic-spark-connector
./gradlew sonar -Dsonar.projectKey='marklogic_marklogic-spark-connector_AY1bXn6J_50_odbCDKMX' -Dsonar.projectName='ML-DevExp-marklogic-spark-connector' || true
'''
}

pipeline{
agent none
triggers{
Expand All @@ -30,16 +42,39 @@ pipeline{
environment{
JAVA8_HOME_DIR="/home/builder/java/openjdk-1.8.0-262"
JAVA11_HOME_DIR="/home/builder/java/jdk-11.0.2"
JAVA17_HOME_DIR="/home/builder/java/jdk-17.0.2"
GRADLE_DIR =".gradle"
DMC_USER = credentials('MLBUILD_USER')
DMC_PASSWORD = credentials('MLBUILD_PASSWORD')
}
stages{
stage('tests'){
environment{
scannerHome = tool 'SONAR_Progress'
}
agent {label 'devExpLinuxPool'}
steps{
runtests('Latest','11','JAVA8_HOME_DIR')
sh label:'mlsetup', script: '''#!/bin/bash
echo "Removing any running MarkLogic server and clean up MarkLogic data directory"
sudo /usr/local/sbin/mladmin remove
sudo /usr/local/sbin/mladmin cleandata
cd marklogic-spark-connector
mkdir -p docker/marklogic/logs
docker-compose down -v || true
docker-compose up -d --build
'''
runtests('JAVA11_HOME_DIR')
withSonarQubeEnv('SONAR_Progress') {
runSonarScan('JAVA11_HOME_DIR')
}
}
post{
always{
sh label:'mlcleanup', script: '''#!/bin/bash
cd marklogic-spark-connector
docker-compose down -v || true
sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/marklogic/logs/
'''
}
}
}
stage('publish'){
Expand All @@ -49,7 +84,7 @@ pipeline{
}
steps{
sh label:'publish', script: '''#!/bin/bash
export JAVA_HOME=$JAVA_HOME_DIR
export JAVA_HOME=$JAVA11_HOME_DIR
export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
cp ~/.gradle/gradle.properties $GRADLE_USER_HOME;
Expand All @@ -59,55 +94,35 @@ pipeline{
}
}
stage('regressions'){
agent {label 'devExpLinuxPool'}
when{
allOf{
branch 'develop'
expression {return params.regressions}
}
}
parallel{
stage('11-nightly-java11'){
agent {label 'devExpLinuxPool'}
steps{
runtests('Latest','11','JAVA11_HOME_DIR')
}
}
stage('11-nightly-java17'){
agent {label 'devExpLinuxPool'}
steps{
runtests('Latest','11','JAVA17_HOME_DIR')
}
}
stage('10.0-9.5-java11'){
agent {label 'devExpLinuxPool'}
steps{
runtests('Release','10.0-9.5','JAVA11_HOME_DIR')
}
}
stage('10.0-9.5-nightly-java17'){
agent {label 'devExpLinuxPool'}
steps{
runtests('Release','10.0-9.5','JAVA17_HOME_DIR')
}
}
stage('11.0.2-java8-spark3.4'){
agent {label 'devExpLinuxPool'}
steps{
copyRPM 'Release','11.0.2'
setUpML '$WORKSPACE/xdmp/src/Mark*.rpm'
sh label:'test', script: '''#!/bin/bash
export JAVA_HOME=$JAVA8_HOME_DIR
export GRADLE_USER_HOME=$WORKSPACE/$GRADLE_DIR
export PATH=$GRADLE_USER_HOME:$JAVA_HOME/bin:$PATH
cd marklogic-spark-connector
echo "mlPassword=admin" > gradle-local.properties
./gradlew -i mlDeploy
./gradlew test -PsparkVersion="3.4.0" || true
steps{
sh label:'mlsetup', script: '''#!/bin/bash
echo "Removing any running MarkLogic server and clean up MarkLogic data directory"
sudo /usr/local/sbin/mladmin remove
sudo /usr/local/sbin/mladmin cleandata
cd marklogic-spark-connector
mkdir -p docker/marklogic/logs
docker-compose down -v || true
MARKLOGIC_TAG=latest-10.0 docker-compose up -d --build
'''
junit '**/build/**/*.xml'
}
runtests('JAVA11_HOME_DIR')
}
post{
always{
sh label:'mlcleanup', script: '''#!/bin/bash
cd marklogic-spark-connector
docker-compose down -v || true
sudo /usr/local/sbin/mladmin delete $WORKSPACE/marklogic-spark-connector/docker/marklogic/logs/
'''
}
}

}
}
}
Loading

0 comments on commit f1bbf9c

Please sign in to comment.