Show Menu
TOPICS×

Recipe and notebook migration guides

Notebooks and recipes using Python/R remain unaffected. The migration only applies to PySpark/Spark (2.3) recipes and notebooks.
The following guides outline the steps and information required for migrating existing recipes and notebooks.

Spark migration guide

The recipe artifact that is generated by the build steps is now a Docker image which contains your .jar binary file. Additionally, the syntax used to read and write datasets using the Platform SDK has changed and requires you to modify your recipe code.
The following video is designed to further assist in understanding the changes that are required for Spark recipes:

Read and write datasets (Spark)

Before you build the Docker image, review the examples for reading and writing datasets in the Platform SDK, provided in the sections below. If you are converting existing recipes, your Platform SDK code needs to be updated.

Read a dataset

This section outlines the changes that are needed for reading a dataset and uses the helper.scala example, provided by Adobe.
Old way of reading a dataset
 var df = sparkSession.read.format("com.adobe.platform.dataset")
    .option(DataSetOptions.orgId, orgId)
    .option(DataSetOptions.serviceToken, serviceToken)
    .option(DataSetOptions.userToken, userToken)
    .option(DataSetOptions.serviceApiKey, apiKey)
    .load(dataSetId)

New way of reading a dataset
With the updates to Spark recipes, a number of values need to be added and changed. First, DataSetOptions is no longer used. Replace DataSetOptions with QSOption . Additionally, new option parameters are required. Both QSOption.mode and QSOption.datasetId are needed. Lastly, orgId and serviceApiKey need to be changed to imsOrg and apiKey . Review the following example for a comparison on reading datasets:
import com.adobe.platform.query.QSOption
var df = sparkSession.read.format("com.adobe.platform.query")
  .option(QSOption.userToken", {userToken})
  .option(QSOption.serviceToken, {serviceToken})
  .option(QSOption.imsOrg, {orgId})
  .option(QSOption.apiKey, {apiKey})
  .option(QSOption.mode, "interactive")
  .option(QSOption.datasetId, {dataSetId})
  .load()

Interactive mode times out if queries are running longer than 10 minutes. If you are ingesting more than a few gigabytes of data, it is recommended that you switch to "batch" mode. Batch mode takes longer to start up but can handle larger sets of data.

Write to a dataset

This section outlines the changes needed for writing a dataset by using the ScoringDataSaver.scala example, provided by Adobe.
Old way of writing a dataset
df.write.format("com.adobe.platform.dataset")
    .option(DataSetOptions.orgId, orgId)
    .option(DataSetOptions.serviceToken, serviceToken)
    .option(DataSetOptions.userToken, userToken)
    .option(DataSetOptions.serviceApiKey, apiKey)
    .save(scoringResultsDataSetId)

New way of writing a dataset
With the updates to Spark recipes, a number of values need to be added and changed. First, DataSetOptions is no longer used. Replace DataSetOptions with QSOption . Additionally, new option parameters are required. QSOption.datasetId is needed and replaces the need to load the {dataSetId} in .save() . Lastly, orgId and serviceApiKey need to be changed to imsOrg and apiKey . Review the following example for a comparison on writing datasets:
import com.adobe.platform.query.QSOption
df.write.format("com.adobe.platform.query")
  .option(QSOption.userToken", {userToken})
  .option(QSOption.serviceToken, {serviceToken})
  .option(QSOption.imsOrg, {orgId})
  .option(QSOption.apiKey, {apiKey})
  .option(QSOption.datasetId, {dataSetId})
  .save()

Package Docker-based source files (Spark)

Start by navigating to the directory where your recipe is located.
The following sections use the new Scala Retail Sales recipe which can be found in the Data Science Workspace public Github repository .

Download the sample recipe (Spark)

The sample recipe contains files that need to be copied over to your existing recipe. To clone the public Github that contains all the sample recipes, enter the following in terminal:
git clone https://github.com/adobe/experience-platform-dsw-reference.git

The Scala recipe is located in the following directory experience-platform-dsw-reference/recipes/scala/retail .

Add the Dockerfile (Spark)

A new file is needed in your recipe folder in order to use the docker based workflow. Copy and paste the Dockerfile from the the recipes folder located at experience-platform-dsw-reference/recipes/scala/Dockerfile . Optionally, you can also copy and paste the code below in a new file called Dockerfile .
The example jar file shown below ml-retail-sample-spark-*-jar-with-dependencies.jar should be replaced with the name of your recipe's jar file.
FROM adobe/acp-dsw-ml-runtime-spark:0.0.1

COPY target/ml-retail-sample-spark-*-jar-with-dependencies.jar /application.jar

Change dependencies (Spark)

If you are using an existing recipe, changes are required in the pom.xml file for dependencies. Change the model-authoring-sdk dependency version to 2.0.0. Next, update the Spark version in the pom file to 2.4.3 and the Scala version to 2.11.12.
<groupId>com.adobe.platform.ml</groupId>
<artifactId>authoring-sdk_2.11</artifactId>
<version>2.0.0</version>
<classifier>jar-with-dependencies</classifier>

Prepare your Docker scripts (Spark)

Spark recipes no longer use Binary Artifacts and instead require building a Docker image. If you have not done so, download and install Docker .
In the provided Scala sample recipe, you can find the scripts login.sh and build.sh located at experience-platform-dsw-reference/recipes/scala/ . Copy and paste these files into your existing recipe.
Your folder structure should now look similar to the following example (newly added files are highlighted):
The next step is to follow the package source files into a recipe tutorial. This tutorial has a section that outlines building a docker image for a Scala (Spark) recipe. Once complete, you are provided with the Docker image in an Azure Container Registry along with the corresponding image URL.

Create a recipe (Spark)

In order to create a recipe, you must first complete the package source files tutorial and have your docker image URL ready. You can create a recipe with the UI or API.
To build your recipe using the UI, follow the import a packaged recipe (UI) tutorial for Scala.
To build your recipe using the API, follow the import a packaged recipe (API) tutorial for Scala.

PySpark migration guide

The recipe artifact that is generated by the build steps is now a Docker image which contains your .egg binary file. Additionally, the syntax used to read and write datasets using the Platform SDK has changed and requires you to modify your recipe code.
The following video is designed to further assist in understanding the changes that are required for PySpark recipes:

Read and write datasets (PySpark)

Before you build the Docker image, review the examples for reading and writing datasets in the Platform SDK, provided in the sections below. If you are converting existing recipes, your Platform SDK code needs to be updated.

Read a dataset

This section outlines the changes needed for reading a dataset by using the helper.py example, provided by Adobe.
Old way of reading a dataset
dataset_options = get_dataset_options(spark.sparkContext)
pd = spark.read.format("com.adobe.platform.dataset") 
  .option(dataset_options.serviceToken(), service_token) 
  .option(dataset_options.userToken(), user_token) 
  .option(dataset_options.orgId(), org_id) 
  .option(dataset_options.serviceApiKey(), api_key)
  .load(dataset_id)

New way of reading a dataset
With the updates to Spark recipes, a number of values need to be added and changed. First, DataSetOptions is no longer used. Replace DataSetOptions with qs_option . Additionally, new option parameters are required. Both qs_option.mode and qs_option.datasetId are needed. Lastly, orgId and serviceApiKey need to be changed to imsOrg and apiKey . Review the following example for a comparison on reading datasets:
qs_option = spark_context._jvm.com.adobe.platform.query.QSOption
pd = sparkSession.read.format("com.adobe.platform.query") 
  .option(qs_option.userToken, {userToken}) 
  .option(qs_option.serviceToken, {serviceToken}) 
  .option(qs_option.imsOrg, {orgId}) 
  .option(qs_option.apiKey, {apiKey}) 
  .option(qs_option.mode, "interactive") 
  .option(qs_option.datasetId, {dataSetId}) 
  .load()

Interactive mode times out if queries are running longer than 10 minutes. If you are ingesting more than a few gigabytes of data, it is recommended that you switch to "batch" mode. Batch mode takes longer to start up but can handle larger sets of data.

Write to a dataset

This section outlines the changes needed for writing a dataset by using the data_saver.py example, provided by Adobe.
Old way of writing a dataset
df.write.format("com.adobe.platform.dataset")
  .option(DataSetOptions.orgId, orgId)
  .option(DataSetOptions.serviceToken, serviceToken)
  .option(DataSetOptions.userToken, userToken)
  .option(DataSetOptions.serviceApiKey, apiKey)
  .save(scoringResultsDataSetId)

New way of writing a dataset
With the updates to PySpark recipes, a number of values need to be added and changed. First, DataSetOptions is no longer used. Replace DataSetOptions with qs_option . Additionally, new option parameters are required. qs_option.datasetId is needed and replaces the need to load the {dataSetId} in .save() . Lastly, orgId and serviceApiKey need to be changed to imsOrg and apiKey . Review the following example for a comparison on reading datasets:
qs_option = spark_context._jvm.com.adobe.platform.query.QSOption
scored_df.write.format("com.adobe.platform.query") 
  .option(qs_option.userToken, {userToken}) 
  .option(qs_option.serviceToken, {serviceToken}) 
  .option(qs_option.imsOrg, {orgId}) 
  .option(qs_option.apiKey, {apiKey}) 
  .option(qs_option.datasetId, {dataSetId}) 
  .save()

Package Docker-based source files (PySpark)

Start by navigating to the directory where your recipe is located.
For this example the new PySpark Retail Sales recipe is used and can be found in the Data Science Workspace public Github repository .

Download the sample recipe (PySpark)

The sample recipe contains files that need to be copied over to your existing recipe. To clone the public Github that contains all the sample recipes, enter the following in terminal.
git clone https://github.com/adobe/experience-platform-dsw-reference.git

The PySpark recipe is located in the following directory experience-platform-dsw-reference/recipes/pyspark .

Add the Dockerfile (PySpark)

A new file is needed in your recipe folder in order to use the docker based workflow. Copy and paste the Dockerfile from the the recipes folder located at experience-platform-dsw-reference/recipes/pyspark/Dockerfile . Optionally, you can also copy and paste the code below and make a new file called Dockerfile .
The example egg file shown below pysparkretailapp-*.egg should be replaced with the name of your recipe's egg file.
FROM adobe/acp-dsw-ml-runtime-pyspark:0.0.1
RUN mkdir /recipe

COPY . /recipe

RUN cd /recipe && \
    ${PYTHON} setup.py clean install && \
    rm -rf /recipe

RUN cp /databricks/conda/envs/${DEFAULT_DATABRICKS_ROOT_CONDA_ENV}/lib/python3.6/site-packages/pysparkretailapp-*.egg /application.egg

Prepare your Docker scripts (PySpark)

PySpark recipes no longer use Binary Artifacts and instead require building a Docker image. If you have not done so, download and install Docker .
In the provided PySpark sample recipe, you can find the scripts login.sh and build.sh located at experience-platform-dsw-reference/recipes/pyspark . Copy and paste these files into your existing recipe.
Your folder structure should now look similar to the following example (newly added files are highlighted):
Your recipe is now ready to be built using a Docker image. The next step is to follow the package source files into a recipe tutorial. This tutorial has a section that outlines building a docker image for a PySpark (Spark 2.4) recipe. Once complete, you are provided with the Docker image in an Azure Container Registry along with the corresponding image URL.

Create a recipe (PySpark)

In order to create a recipe, you must first complete the package source files tutorial and have your docker image URL ready. You can create a recipe with the UI or API.
To build your recipe using the UI, follow the import a packaged recipe (UI) tutorial for PySpark.
To build your recipe using the API, follow the import a packaged recipe (API) tutorial for PySpark.

Notebook migration guides

Recent changes to JupyterLab notebooks require that you update your existing PySpark and Spark 2.3 notebooks to 2.4. With this change, JupyterLab Launcher has been updated with new starter notebooks. For a step-by-step guide on how to convert your notebooks, select one of the following guides:
The following video is designed to further assist in understanding the changes that are required for JupyterLab Notebooks:

PySpark 2.3 to 2.4 notebook migration guide

With the introduction of PySpark 2.4 to JupyterLab Notebooks, new Python notebooks with PySpark 2.4 are now using the Python 3 kernel instead of the PySpark 3 kernel. This means existing code running on PySpark 2.3 is not supported in PySpark 2.4.
PySpark 2.3 is deprecated and set to be removed in a subsequent release. All existing examples are set to be replaced with PySpark 2.4 examples.
To convert your existing PySpark 3 (Spark 2.3) notebooks to Spark 2.4, follow the examples outlined below:

Kernel

PySpark 3 (Spark 2.4) notebooks use the Python 3 Kernel instead of the deprecated PySpark kernel used in PySpark 3 (Spark 2.3 - deprecated) notebooks.
To confirm or change the kernel in the JupyterLab UI, select the kernel button located in the top right navigation bar of your notebook. If you are using a one of the predefined launcher notebooks, the kernel is pre-selected. The example below uses the PySpark 3 (Spark 2.4) Aggregation notebook starter.
Selecting the drop down menu opens up a list of available kernels.
For PySpark 3 (Spark 2.4) notebooks, select the Python 3 kernel and confirm by clicking the Select button.

Initializing sparkSession

All Spark 2.4 notebooks require that you initialize the session with the new boilerplate code.
Notebook PySpark 3 ([!DNL Spark] 2.3 - deprecated) PySpark 3 ([!DNL Spark] 2.4)
Kernel PySpark 3 Python 3
Code
  [!DNL spark]


from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()


The following images highlight the differences in configuration for PySpark 2.3 and PySpark 2.4. This example uses the Aggregation starter notebooks provided in JupyterLab Launcher.
Configuration example for 2.3 (deprecated)
Configuration example for 2.4

Using %dataset magic

With the introduction of Spark 2.4, %dataset custom magic is supplied for use in new PySpark 3 (Spark 2.4) notebooks (Python 3 kernel).
Usage
%dataset {action} --datasetId {id} --dataFrame {df}
Description
A custom Data Science Workspace magic command for reading or writing a dataset from a Python notebook (Python 3 kernel).
  • : The type of action to perform on the dataset. Two actions are available "read" or "write".
  • --datasetId : Used to supply the id of the dataset to read or write. This is a required argument.
  • --dataFrame : The pandas dataframe. This is a required argument.
    • When the action is "read", is the variable where results of the dataset read operation are available.
    • When the action is "write", this dataframe is written to the dataset.
  • --mode (optional) : Allowed parameters are "batch", and "interactive". By default the mode is set to "interactive". It is recommended to use "batch" mode when reading large amounts of data.
Examples
  • Read example : %dataset read --datasetId 5e68141134492718af974841 --dataFrame pd0
  • Write example : %dataset write --datasetId 5e68141134492718af974842 --dataFrame pd0

Load into a dataframe in LocalContext

With the introduction of Spark 2.4, %dataset custom magic is supplied. The following example highlights the key differences for loading dataframe in PySpark (Spark 2.3) and PySpark (Spark 2.4) notebooks:
Using PySpark 3 (Spark 2.3 - deprecated) - PySpark 3 Kernel
dataset_options = sc._jvm.com.adobe.platform.dataset.DataSetOptions
pd0 = spark.read.format("com.adobe.platform.dataset")
  .option(dataset_options.orgId(), "310C6D375BA5248F0A494212@AdobeOrg")
  .load("5e68141134492718af974844")

Using PySpark 3 (Spark 2.4) - Python 3 Kernel
%dataset read --datasetId 5e68141134492718af974844 --dataFrame pd0

Element
Description
pd0
Name of pandas dataframe object to use or create.
Custom magic for data access in Python 3 kernel.
The following images highlight the key differences in loading data for PySpark 2.3 and PySpark 2.4. This example uses the Aggregation starter notebooks provided in JupyterLab Launcher.
Loading data in PySpark 2.3 (Luma dataset) - deprecated
Loading data in PySpark 2.4 (Luma dataset)
With PySpark 3 (Spark 2.4) sc = spark.sparkContext is defined in loading.
Loading Experience Cloud Platform data in PySpark 2.3 - deprecated
Loading Experience Cloud Platform data in PySpark 2.4
With PySpark 3 (Spark 2.4) the org_id and dataset_id no longer need to be defined. Additionally, df = spark.read.format has been replaced with a custom magic %dataset to make reading and writing datasets easier.
Element
description
Custom magic for data access in Python 3 kernel.
--mode can be set to interactive or batch . The default for --mode is interactive . It is recommended to use batch mode when reading large amounts of data.

Creating a local dataframe

With PySpark 3 (Spark 2.4) %% sparkmagic is no longer supported. The following operations can no longer be utilized:
  • %%help
  • %%info
  • %%cleanup
  • %%delete
  • %%configure
  • %%local
The following table outlines the changes needed to convert %%sql sparkmagic queries:
Notebook PySpark 3 ([!DNL Spark] 2.3 - deprecated) PySpark 3 ([!DNL Spark] 2.4)
Kernel PySpark 3 [!DNL Python] 3
Code
%%sql -o df
select * from sparkdf


 %%sql -o df -n limit
select * from sparkdf


%%sql -o df -q
select * from sparkdf


 %%sql -o df -r fraction
select * from sparkdf


df = spark.sql('''
  SELECT *
  FROM sparkdf
''')


df = spark.sql('''
  SELECT *
  FROM sparkdf
  LIMIT limit
''')


df = spark.sql('''
  SELECT *
  FROM sparkdf
  LIMIT limit
''')


sample_df = df.sample(fraction)


You can also specify an optional seed sample such as a boolean withReplacement, double fraction, or a long seed.
The following images highlight the key differences for creating a local dataframe in PySpark 2.3 and PySpark 2.4. This example uses the Aggregation starter notebooks provided in JupyterLab Launcher.
Create local dataframe PySpark 2.3 - deprecated
Create local dataframe PySpark 2.4
With PySpark 3 (Spark 2.4) %%sql Sparkmagic is not longer supported and has been replaced with the following:

Write to a dataset

With the introduction of Spark 2.4, %dataset custom magic is supplied which makes writing datasets cleaner. To write to a dataset, use the following Spark 2.4 example:
Using PySpark 3 (Spark 2.3 - deprecated) - PySpark 3 Kernel
userToken = spark.sparkContext.getConf().get("spark.yarn.appMasterEnv.USER_TOKEN")
serviceToken = spark.sparkContext.getConf().get("spark.yarn.appMasterEnv.SERVICE_TOKEN")
serviceApiKey = spark.sparkContext.getConf().get("spark.yarn.appMasterEnv.SERVICE_API_KEY")

dataset_options = sc._jvm.com.adobe.platform.dataset.DataSetOptions

pd0.write.format("com.adobe.platform.dataset")
  .option(dataset_options.orgId(), "310C6D375BA5248F0A494212@AdobeOrg")
  .option(dataset_options.userToken(), userToken)
  .option(dataset_options.serviceToken(), serviceToken)
  .option(dataset_options.serviceApiKey(), serviceApiKey)
  .save("5e68141134492718af974844")

Using PySpark 3 (Spark 2.4) - Python 3 Kernel
%dataset write --datasetId 5e68141134492718af974844 --dataFrame pd0
pd0.describe()
pd0.show(10, False)

Element
description
pd0
Name of pandas dataframe object to use or create.
Custom magic for data access in Python 3 kernel.
--mode can be set to interactive or batch . The default for --mode is interactive . It is recommended to use batch mode when reading large amounts of data.
The following images highlight the key differences for writing data back to Platform in PySpark 2.3 and PySpark 2.4. This example uses the Aggregation starter notebooks provided in JupyterLab Launcher.
Writing data back to Platform PySpark 2.3 - deprecated
Writing data back to Platform PySpark 2.4
With PySpark 3 (Spark 2.4) the %dataset custom magic removes the need to define values such as userToken , serviceToken , serviceApiKey , and .option . Additionally, orgId no longer needs to be defined.

Spark 2.3 to Spark 2.4 (Scala) notebook migration guide

With the introduction of Spark 2.4 to JupyterLab Notebooks, existing Spark (Spark 2.3) notebooks are now using the Scala kernel instead of the Spark kernel. This means existing code running on Spark (Spark 2.3) is not supported in Scala (Spark 2.4). Additionally, all new Spark notebooks should use Scala (Spark 2.4) in the JupyterLab Launcher.
Spark (Spark 2.3) is deprecated and set to be removed in a subsequent release. All existing examples are set to be replaced with Scala (Spark 2.4) examples.
To convert your existing Spark (Spark 2.3) notebooks to Scala (Spark 2.4), follow the examples outlined below:

Kernel

Scala (Spark 2.4) notebooks use the Scala Kernel instead of the deprecated Spark kernel used in Spark (Spark 2.3 - deprecated) notebooks.
To confirm or change the kernel in the JupyterLab UI, select the kernel button located in the top right navigation bar of your notebook. The Select Kernel popover appears. If you are using one of the predefined launcher notebooks, the kernel is pre-selected. The example below uses the Scala Clustering notebook in JupyterLab Launcher.
Selecting the drop down menu opens up a list of available kernels.
For Scala (Spark 2.4) notebooks, select the Scala kernel and confirm by clicking the Select button.

Initializing SparkSession

All Scala (Spark 2.4) notebooks require that you initialize the session with the following boilerplate code:
Notebook Spark ([!DNL Spark] 2.3 - deprecated) Scala ([!DNL Spark] 2.4)
Kernel [!DNL Spark] Scala
code no code required
import org.apache.spark.sql.{ SparkSession }
val spark = SparkSession
  .builder()
  .master("local")
  .getOrCreate()


The Scala (Spark 2.4) image below highlights the key difference in initializing sparkSession with the Spark 2.3 Spark kernel and Spark 2.4 Scala kernel. This example uses the Clustering starter notebooks provided in JupyterLab Launcher.
Spark (Spark 2.3 - deprecated)
Spark (Spark 2.3 - deprecated) uses the Spark kernel, and therefore, you were not required to define Spark.
Scala (Spark 2.4)
Using Spark 2.4 with the Scala kernel requires that you define val spark and import SparkSesson in order to read or write:

Query data

With Scala (Spark 2.4) %% sparkmagic is no longer supported. The following operations can no longer be utilized:
  • %%help
  • %%info
  • %%cleanup
  • %%delete
  • %%configure
  • %%local
The following table outlines the changes needed to convert %%sql sparkmagic queries:
Notebook [!DNL Spark] ([!DNL Spark] 2.3 - deprecated) Scala ([!DNL Spark] 2.4)
Kernel [!DNL Spark] Scala
code
%%sql -o df
select * from sparkdf


%%sql -o df -n limit
select * from sparkdf


%%sql -o df -q
select * from sparkdf


%%sql -o df -r fraction
select * from sparkdf


val df = spark.sql('''
  SELECT *
  FROM sparkdf
''')


val df = spark.sql('''
  SELECT *
  FROM sparkdf
  LIMIT limit
''')


val df = spark.sql('''
  SELECT *
  FROM sparkdf
  LIMIT limit
''')


val sample_df = df.sample(fraction) 

The Scala (Spark 2.4) image below highlights the key differences in making queries with the Spark 2.3 Spark kernel and Spark 2.4 Scala kernel. This example uses the Clustering starter notebooks provided in JupyterLab Launcher.
Spark (Spark 2.3 - deprecated)
The Spark (Spark 2.3 - deprecated) notebook uses the Spark kernel. The Spark kernel supports and uses %%sql sparkmagic.
Scala (Spark 2.4)
The Scala kernel no longer supports %%sql sparkmagic. Existing sparkmagic code needs to be converted.

Read a dataset

In Spark 2.3 you needed define variables for option values used to read data or use the raw values in the code cell. In Scala, you can use sys.env("PYDASDK_IMS_USER_TOKEN") to declare and return a value, this eliminates the need to define variables such as var userToken . In the Scala (Spark 2.4) example below, sys.env is used to define and return all the required values needed for reading a dataset.
Using Spark (Spark 2.3 - deprecated) - Spark Kernel
import com.adobe.platform.dataset.DataSetOptions
var df1 = spark.read.format("com.adobe.platform.dataset")
  .option(DataSetOptions.orgId, "310C6D375BA5248F0A494212@AdobeOrg")
  .option(DataSetOptions.batchId, "dbe154d3-197a-4e6c-80f8-9b7025eea2b9")
  .load("5e68141134492718af974844")

Using Scala (Spark 2.4) - Scala Kernel
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().master("local").getOrCreate()
val df1 = spark.read.format("com.adobe.platform.query")
  .option("user-token", sys.env("PYDASDK_IMS_USER_TOKEN"))
  .option("ims-org", sys.env("IMS_ORG_ID"))
  .option("api-key", sys.env("PYDASDK_IMS_CLIENT_ID"))
  .option("service-token", sys.env("PYDASDK_IMS_SERVICE_TOKEN"))
  .option("mode", "interactive")
  .option("dataset-id", "5e68141134492718af974844")
  .load()

element
description
df1
A variable that represents the Pandas dataframe used to read and write data.
user-token
Your user token that is automatically fetched using sys.env("PYDASDK_IMS_USER_TOKEN") .
service-token
Your service token that is automatically fetched using sys.env("PYDASDK_IMS_SERVICE_TOKEN") .
ims-org
Your ims-org id that is automatically fetched using sys.env("IMS_ORG_ID") .
api-key
Your api key that is automatically fetched using sys.env("PYDASDK_IMS_CLIENT_ID") .
The images below highlight the key differences in loading data with the Spark 2.3 and Spark 2.4. This example uses the Clustering starter notebooks provided in JupyterLab Launcher.
Spark (Spark 2.3 - deprecated)
The Spark (Spark 2.3 - deprecated) notebook uses the Spark kernel. The following two cells shows an example of loading the dataset with a specified dataset id in the date range of (2019-3-21, 2019-3-29).
Scala (Spark 2.4)
The Scala (Spark 2.4) notebook uses the Scala kernel which requires more values upon setup as highlighted in the first code cell. Additionally, var mdata requires more option values to be filled. In this notebook, the previously mentioned code for initializing SparkSession is included within the var mdata code cell.
In Scala, you can use sys.env() to declare and return a value from within option . This eliminates the need to define variables if you know they are only going to be used a single time. The following example takes val userToken in the above example and declares it in-line within option :
.option("user-token", sys.env("PYDASDK_IMS_USER_TOKEN"))

Write to a dataset

Similar to reading a dataset , writing to a dataset requires additional option values outlined in the example below. In Scala, you can use sys.env("PYDASDK_IMS_USER_TOKEN") to declare and return a value, this eliminates the need to define variables such as var userToken . In the Scala example below, sys.env is used to define and return all the required values needed for writing to a dataset.
Using Spark (Spark 2.3 - deprecated) - Spark Kernel
import com.adobe.platform.dataset.DataSetOptions

var userToken = spark.sparkContext.getConf.getOption("spark.yarn.appMasterEnv.USER_TOKEN").get
var serviceToken = spark.sparkContext.getConf.getOption("spark.yarn.appMasterEnv.SERVICE_TOKEN").get
var serviceApiKey = spark.sparkContext.getConf.getOption("spark.yarn.appMasterEnv.SERVICE_API_KEY").get

df1.write.format("com.adobe.platform.dataset")
  .option(DataSetOptions.orgId, "310C6D375BA5248F0A494212@AdobeOrg")
  .option(DataSetOptions.userToken, userToken)
  .option(DataSetOptions.serviceToken, serviceToken)
  .option(DataSetOptions.serviceApiKey, serviceApiKey)
  .save("5e68141134492718af974844")

Using Scala (Spark 2.4) - Scala Kernel
import org.apache.spark.sql.{Dataset, SparkSession}

val spark = SparkSession.builder().master("local").getOrCreate()

df1.write.format("com.adobe.platform.query")
  .option("user-token", sys.env("PYDASDK_IMS_USER_TOKEN"))
  .option("service-token", sys.env("PYDASDK_IMS_SERVICE_TOKEN"))
  .option("ims-org", sys.env("IMS_ORG_ID"))
  .option("api-key", sys.env("PYDASDK_IMS_CLIENT_ID"))
  .option("mode", "interactive")
  .option("dataset-id", "5e68141134492718af974844")
  .save()

element
description
df1
A variable that represents the Pandas dataframe used to read and write data.
user-token
Your user token that is automatically fetched using sys.env("PYDASDK_IMS_USER_TOKEN") .
service-token
Your service token that is automatically fetched using sys.env("PYDASDK_IMS_SERVICE_TOKEN") .
ims-org
Your ims-org id that is automatically fetched using sys.env("IMS_ORG_ID") .
api-key
Your api key that is automatically fetched using sys.env("PYDASDK_IMS_CLIENT_ID") .