Documentation PortalBack to Self Assist PortalBack
Documentation Portal
Contents

Apache Spark - V 4.2

Apache Spark Analytics Engine

Spark datastack is used to configure and deploy a Apache Spark analytics engine into dataspaces and provide Spark analytics functionalities on CloudOne environment.

Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for incremental computation and stream processing.

Kubernetes Operator for Apache Spark

Kubernetes Operator for Apache Spark aims to make specifying and running Spark applications as easy and idiomatic as running other workloads on Kubernetes. It uses Kubernetes custom resources for specifying, running, and surfacing status of Spark applications.

The Kubernetes Operator for Apache Spark currently supports the following list of features:

  • Supports Spark 2.3 and up.
  • Enables declarative application specification and management of applications through custom resources.
  • Automatically runs spark-submit on behalf of users for each SparkApplication eligible for submission.
  • Provides native cron support for running scheduled applications.
  • Supports customization of Spark pods beyond what Spark natively is able to do through the mutating admission webhook, e.g., mounting ConfigMaps and volumes, and setting pod affinity/anti-affinity.
  • Supports automatic application re-submission for updated SparkApplication objects with updated specification.
  • Supports automatic application restart with a configurable restart policy.
  • Supports automatic retries of failed submissions with optional linear back-off.

More information about Kubernetes Operator for Apache Spark can be found in here.

Provisioning

When provisioning an Spark datastack, the CI/CD pipeline will generate a Helm chart to provision the Spark cluster and its other components. Helm chart includes a "SparkApplication" and "Mapping" custom resources which are a YAML structure passed to Spark Operator that will define how to provision the new Spark cluster.

Getting Started in the Azure DevOps environment

Refer to the following link to learn about getting started in the Azure DevOps environment: Getting Started in Azure DevOps Environment

Repository Structure

All of the information that instructs the CI/CD pipeline how to provision the Spark cluster and to what specifications are contained within a Git repository.

.
├── azure-pipelines.yml
├── <appcode>-<Spark>
│   ├── Chart.yaml
│   ├── templates
│   │   ├── ambassador.yaml
│   │   ├── _annotations.tpl
│   │   ├── _helpers.tpl
│   │   ├── NOTES.txt
│   │   └── spark-application.yaml
│   ├── values.dataspace.yaml
│   └── values.yaml
└── README.md

The structure of the Git repository for the Spark cluster will contain the aforementioned Helm chart directory. The files under this directory are automatically generated with some default configuration for Spark cluster. These files can be modified based on Spark cluster requirements.

Helm chart directory has values.dataspace.yaml, which should be used to override default values stored in values.yaml. If more advanced configuration is required, other files within templates directory can be modified.

Also at the top of the repository is a file called azure-pipelines.yml. This file contains reference to the appropriate version of the CI/CD pipeline logic.

Continuous Integration and Continuous Delivery Pipelines

Please note that the document “CloudOne Concepts and Procedures” contains more details about the specific flow of the application through the CI/CD pipelines

The triggering of the CI/CD pipeline to provision the datastack is manual. Details for navigating to the CI/CD pipeline in Azure DevOps, triggering a run and examining the results can be found here: Continuous Integration and Continuous Delivery Pipelines

Getting Started with Spark

Spark version

By default, pipeline is configured to use Apache Spark version 3.1.3, which can be configured in azure-pipelines.yml. Currently supported Spark versions are 3.1.x.

extends:
  template: dbaas/spark-<dataspace name>.yml@spaces
  parameters:
    spark:
      version: 3.1.3

Spark Configuration

Kubernetes Operator for Apache Spark provides Custom Resource SparkApplication to deploy and manage Spark cluster in Kubernetes. SparkApplication schema reference can be found here.

While provisioning Spark cluster, it is provided with some default configurations for Spark.

This can be configured in sparkApplication section of values.dataspace.yaml in Helm chart directory.

Spark configuration:

sparkApplication:
  mainClass: org.apache.spark.examples.SparkPi
  arguments:
    - "100000"
  sparkConf:
  driver:
    cores: 1
    memory: 512m
    coreRequest: 120m
    coreLimit: 1200m
    memoryOverhead: 2g
    serviceAccount: spark
    javaOptions:
  executor:
    cores: 1
    memory: 512m
    coreRequest: 120m
    coreLimit: 1200m
    memoryOverhead: 2g
    serviceAccount: spark
    instances: 3
    javaOptions:
  restartPolicy:
    type: Never

SparkApplication Docker image

Like Apache Spark, Kubernetes Operator for Apache Spark provides support for Java, Python, R, Scala.

In order to run Spark Application on Kubernetes, a docker image with with Spark application and its dependencies is required.

While provisioning, a default docker image is used which run 'SparkPi' example.

Using Custom Spark Docker image

This procedure shows how to use create Spark image with Spark application.

  1. Create a new Dockerfile using quay.io/strimzi/Spark:0.27.1-Spark-3.0.0 as the base image

    FROM apache/spark:v3.1.3
    COPY SparkApplication-v1.jar $SPARK_HOME/work-dir/
    COPY ./dependencies/* $SPARK_HOME/jars/
    ENTRYPOINT ["/opt/entrypoint.sh"]
  2. Build the container image.

    Here <appcode> is three letter appcode and 1.0.0 is image tag.

    $ docker build -t docker-<appcode>.repo-hio.cloudone.netapp.com/spark/<servicename>:1.0.0 .
  3. Push custom image to JFrog Artifactory.

    $ docker push docker-<appcode>.repo-hio.cloudone.netapp.com/spark/<servicename>:1.0.0
  4. Enable Spark Connect and provide image tag in azure-pipelines.yml.

    Here imageTag is a reference to image tag from step 2.

    template: dbaas/spark-devexp-stg-4.yml@spaces
    parameters:
     spark:
       version: 3.1.3
       imageTag: 1.0.0
       applicationType: Scala
       applicationFile: "local:///opt/spark/work-dir/SparkApplication-v1.jar"

Generating base images for Python and R

Spark (starting with version 2.3) ships with a Dockerfile that can be used for generating base docker images, or customized to match an individual application’s needs. It can be found in the kubernetes/dockerfiles/ directory.

Spark also ships with a bin/docker-image-tool.sh script that can be used to build Docker images to use with the Kubernetes.

Example usage for JVM based image is:

$ bin/docker-image-tool.sh -r spark -t 3.1.3 build

Example usage for Python based image is:

$ bin/docker-image-tool.sh -r python -t 3.1.3 -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

Example usage for R based image is:

$ bin/docker-image-tool.sh -r python -t 3.1.3 -R ./kubernetes/dockerfiles/spark/bindings/R/Dockerfile build

Troubleshooting

If something fails to deploy, the information about why the deployment failed (or was not even initiated) will be found in the logs of the CI/CD pipeline and can be tracked down using the instructions at the link listed in the Continuous Integration and Continuous Delivery Pipelines section.

However, additional information may be required either to better troubleshoot a failed deployment or to investigate the runtime behavior of the Spark cluster that has been successfully deployed. In those cases, much of the information can be found in Rancher web console.