Spark 2.x External Packages
Photo by Mika Baumeister on Unsplash
The bane of using bleeding edge technology is very less or hidden information of new features in the latest version. We at Unnati use bleeding edge releases of many data science tools for various research and production systems. In this post we explain how to add external jars
to Apache Spark 2.x application.
Starting Spark 2.x, we can use the --package
option to pass additional jars to spark-submit
. Spark will look through the local ivy2
repository for the jar, if it is missing, it will pull the dependency from the central maven server.
$SPARK_HOME/bin/spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.10:2.0.0 <py-file>
In the above example, we are adding mongodb-spark
connector. This works perfectly fine. However, there are scenarios where spark is used as part of the python application. In this case, we will use SparkContext
to specify the configuration.
There is no way to set packages option using
*SparkConf*
We need to use the spark-defaults.conf
to specify the external jar. Add the following to the file
spark.jars.packages org.mongodb.spark:mongo-spark-connector_2.10:2.0.0
Now run your pyspark application as usual
python <py-file>