Spark - Simple App with SBT Config, SBT Assembly & some time saving advice

SBT is an acronym for Simple Build Tool, used for building tools(used for build structure, config management, library management, documentation, and so on). Please check here for more details.

Goal of this post is to provide you a basic build.sbt config and a simple script for testing install Spark.

I have already set up a Spark(Spark 2.1.0 built for Hadoop 2.7.3) on Mac OS X El Capiton - 10.11.6

Brew Installations

If you are a mac user like me, you can use brew to install scala & sbt.

[code lang=text] SDS-bash3.2$ javac -version javac 1.8.0_111 [/code]

Scala Details

Spark Details

[code lang=text] SDS-bash3.2$ sparkR –version Welcome to __ __ / __/ _ __/ /__ _\ \/ _ \/ _ `/ / ‘_/ // ._/_,// //_\ version 2.1.0 /_/

Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_111 Branch Compiled by user jenkins on 2016-12-16T02:04:48Z [/code]

SBT Config for Spark Project

File: build.sbt

[code lang=text] name := “hello”

version := “1.0”

scalaVersion := “2.11.7”

logLevel := Level.Error

organization := “sampath.sbt.hello”

logLevel := Level.Error

libraryDependencies ++= Seq( “org.apache.spark” % “spark-core_2.11” % “2.1.0”, “org.apache.spark” % “spark-sql_2.11” % “2.1.0”, “org.apache.spark” % “spark-mllib_2.11” % “2.1.0” withSources() ) [/code]

Test run with Spark

File: SimpleApp.scala

[code lang=text] import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf

object TextFileOp extends Serializable { def filterAB(sc: SparkContext) = { val logFile = “build.sbt” // Should be some file on your system val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line => line.contains(“a”)).count() val numBs = logData.filter(line => line.contains(“b”)).count() println(“===================\n\n”) println(s”Lines with a: $numAs, Lines with b: $numBs”) println(“===================\n\n”) } }

object SimpleApp { def main(args: Array[String]) {

val conf = new SparkConf() .setAppName(“Simple Application”) .setMaster(“local[2]”) val sc = new SparkContext(conf) TextFileOp.filterAB(sc) sc.stop() } } [/code]

SBT Amendments for SBT Assembly

File: build.sbt

[code lang=text] name := “hello”

version := “1.0”

scalaVersion := “2.11.7”

logLevel := Level.Error

organization := “sampath.sbt.hello”

logLevel := Level.Error

libraryDependencies ++= Seq( “org.apache.spark” % “spark-core_2.11” % “2.1.0” % “provided”, // changed “org.apache.spark” % “spark-sql_2.11” % “2.1.0”, “org.apache.spark” % “spark-mllib_2.11” % “2.1.0” withSources() )

// To handle decoupling issues

assemblyMergeStrategy in assembly := { case PathList(“jj2000”, “j2k”, xs @_*) => MergeStrategy.first case x => val oldStrategy = (assemblyMergeStrategy in assembly).value oldStrategy(x) }

assemblyMergeStrategy in assembly := { case PathList(“META-INF”, xs @ _*) => MergeStrategy.discard case x => MergeStrategy.first } [/code]

File: project.plugins.sbt

[code lang=text]
addSbtPlugin(“com.eed3si9n” % “sbt-assembly” % “0.14.4”)

addSbtPlugin(“net.virtual-void” % “sbt-dependency-graph” % “0.8.2”) [/code]

Best Advices:

Major & Minors Version do make their differences. Be mindful while installations.
Stick to official websites for adding configuration like above plugins.