Skip to content

How to install to EMR from maven repository to /usr/lib/spark/jars #2355

@hueiyuan

Description

@hueiyuan

SynapseML version

1.0.10

System information

  • Language version (e.g. python 3.8, scala 2.12): python 3.9
  • Spark Version (e.g. 3.2.3): 3.5.1
  • Spark Platform (e.g. Synapse, Databricks): AWS EMR Release 7.3.1

Describe the problem

Now I would like to try to install SynapseML to EMR for pyspark. If we execute configuration based on the below command on Jupyter notebooks that is work.

%%configure -f
{
  "name": "synapseml",
  "conf": {
      "spark.jars.packages": "com.microsoft.azure:synapseml_2.12:1.0.9-spark3.5",
      "spark.jars.repositories": "https://mmlspark.azureedge.net/maven"
  }
}

But in production, we don't use Jupyter notebooks. Therefore, we first download corresponding jars from maven repository and copy to the path /usr/lib/spark/jars on EMR and do not work and show com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM

Have anyone know what is the root cause result in this? Thank you.

Code to reproduce issue

from synapse.ml.isolationforest import IsolationForest

# print(type(IsolationForest))
hyper_params = {
    'n_estimators': 100,
    'max_samples': 32
    'max_features': 1,
    'bootstrap': False,
    'contamination': 0.1,    
}

isolation_forest_model = (
    IsolationForest()
    .setNumEstimators(hyper_params["n_estimators"])
    .setBootstrap(hyper_params["bootstrap"])
    .setMaxSamples(hyper_params["max_samples"])
    .setMaxFeatures(hyper_params["max_features"])
    .setFeaturesCol("features")
    .setPredictionCol("predictedLabel")
    .setScoreCol("outlierScore")
    .setContamination(hyper_params["contamination"])
    .setContaminationError(0.01 * hyper_params["contamination"])
)

Other info / logs

An error was encountered:
com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM
Traceback (most recent call last):
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/pyspark.zip/pyspark/__init__.py", line 139, in wrapper
    return func(self, **kwargs)
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/com.microsoft.azure_synapseml-core_2.12-1.0.9-spark3.5.jar/synapse/ml/isolationforest/IsolationForest.py", line 78, in __init__
    self._java_obj = self._new_java_obj("com.microsoft.azure.synapse.ml.isolationforest.IsolationForest", self.uid)
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/pyspark.zip/pyspark/ml/wrapper.py", line 84, in _new_java_obj
    java_obj = getattr(java_obj, name)
  File "/mnt1/yarn/usercache/livy/appcache/application_1742368398137_0002/container_1742368398137_0002_01_000001/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1664, in __getattr__
    raise Py4JError("{0} does not exist in the JVM".format(new_fqn))
py4j.protocol.Py4JError: com.microsoft.azure.synapse.ml.isolationforest.IsolationForest does not exist in the JVM

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions