How to use SBT with Eclipse scala IDE

Scala is most used language in big data programming.Programmer from java background has used an eclipse IDE for programming. so eclipse also have there scala-ide (http://scala-ide.org/). As eclipse supports more of the maven as a build engine in-order to use SBT here are some tricks.

Those who are not aware about sbt (Simple build tool). SBT is a default build engine tool for scala.

i will show  you  an example of Apache spark. I will try to import a code in eclipse  using sbteclipse plugin.

Lets get started, how to use SBT with Eclipse scala IDE

I have created a shell script which will create default project structure of SCALA language, it will also create a build.sbt file which is require for SBT build engine

#!/bin/sh
mkdir -p src/{main,test}/{java,resources,scala}
mkdir lib project target

# create an initial build.sbt file
echo 'name := "SubscriptionExpiry"
version := "1.0"
scalaVersion := "2.11.8"' > build.sbt

you can also run a shell script on windows machine using CYGWIN (https://www.cygwin.com/)

below script will create you a project structure.

now installing a SBT Eclipse plugin in sbt (https://github.com/typesafehub/sbteclipse).

Go to C:\Users\[username]\.sbt\0.13\plugins\

create a plugins.sbt file. open it with notepad, go to link (https://github.com/typesafehub/sbteclipse),
copy line

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "5.0.1")
paste it to the plugins.sbt. save it.

open your command prompt
now go to project folder

cd c:\Users\[username]\workspace\[projectname]
sbt reload

note that your internet should be on this time, it will download a plugin.

now open plugins.sbt and add your require dependencies

name := "SubscriptionExpiry"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.0"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-RC1"

open the command prompt


cd c:\Users\[username]\workspace\[projectname]
sbt eclipse

it will download all your dependencies.

for open you scala-ide (Eclipse) and import a project…

that it’s you are done with it

Apache Spark : ERROR : socket.gaierror: [Errno -2] Name or service not known

Getting Error “socket.gaierror: [Errno -2] Name or service not known”  while executing “./bin/pyspark”


[root@ip-10-0-0-28 spark]# ./bin/pyspark
Python 2.6.6 (r266:84292, Aug 18 2016, 15:13:37)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
/root/spark/spark/python/pyspark/sql/context.py:487: DeprecationWarning: HiveContext is deprecated in Spark 2.0.0. Please use SparkSession.builder.enableHiveSupport().getOrCreate() instead.
DeprecationWarning)
16/12/06 13:58:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/root/spark/spark/python/pyspark/shell.py", line 43, in
spark = SparkSession.builder\
File "/root/spark/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/root/spark/spark/python/pyspark/context.py", line 294, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/root/spark/spark/python/pyspark/context.py", line 115, in __init__
conf, jsc, profiler_cls)
File "/root/spark/spark/python/pyspark/context.py", line 174, in _do_init
self._accumulatorServer = accumulators._start_update_server()
File "/root/spark/spark/python/pyspark/accumulators.py", line 259, in _start_update_server
server = AccumulatorServer(("localhost", 0), _UpdateRequestHandler)
File "/usr/lib64/python2.6/SocketServer.py", line 412, in __init__
self.server_bind()
File "/usr/lib64/python2.6/SocketServer.py", line 423, in server_bind
self.socket.bind(self.server_address)
File "", line 1, in bind
socket.gaierror: [Errno -2] Name or service not known

Then,

Just go to “/etc/hosts” of the server and add line
127.0.0.1 localhost

From the logs it looks like pyspark is unable to understand host localhost.Please check your /etc/hosts file , if localhost is not available , add an entry it should resolve this issue.

e.g:

[Ip] [Hostname] localhost

In case you are not able to change host entry of the server edit /python/pyspark/accumulators.py line number 269 as below

server = AccumulatorServer((“[server host name from hosts file]”, 0), _UpdateRequestHandler)

Apache Spark Components

spark
Spark Core
Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems,
and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a
collection of items distributed across many compute nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these
collections.

Spark SQL
Spark SQL is Spark’s package for working with structured data. It allows querying data via SQL as well as the Apache Hive variant of SQL—called the Hive Query Language
(HQL)—and it supports many sources of data, including Hive tables, Parquet, and JSON. Beyond providing a SQL interface to Spark, Spark SQL allows developers
to intermix SQL queries with the programmatic data manipulations supported by RDDs in Python, Java, and Scala, all within a single application, thus combining SQL
with complex analytics. This tight integration with the rich computing environment provided by Spark makes Spark SQL unlike any other open source data warehouse
tool. Spark SQL was added to Spark in version 1.0.Shark was an older SQL-on-Spark project out of the University of California, Berkeley,that modified Apache Hive to run on Spark. It has now been replaced by Spark SQL to provide better integration with the Spark engine and language APIs.

Spark Streaming
Spark Streaming is a Spark component that enables processing of live streams of data.Examples of data streams include logfiles generated by production web servers, or
queues of messages containing status updates posted by users of a web service. Spark Streaming provides an API for manipulating data streams that closely matches the
Spark Core’s RDD API, making it easy for programmers to learn the project and move between applications that manipulate data stored in memory, on disk, or arriving
in real time. Underneath its API, Spark Streaming was designed to provide the same degree of fault tolerance, throughput, and scalability as Spark Core.

MLlib
Spark comes with a library containing common machine learning (ML) functionality,called MLlib. MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as supporting functionality such as model evaluation and data import. It also provides
some lower-level ML primitives, including a generic gradient descent optimization algorithm. All of these methods are designed to scale out across a cluster.

GraphX
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations. Like Spark Streaming and Spark SQL,
GraphX extends the Spark RDD API, allowing us to create a directed graph with arbitrary properties attached to each vertex and edge. GraphX also provides various operators
for manipulating graphs (e.g., subgraph and mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle counting).

Cluster Managers
Under the hood, Spark is designed to efficiently scale up from one to many thousands of compute nodes. To achieve this while maximizing flexibility, Spark can run over a
variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler. If you are
just installing Spark on an empty set of machines, the Standalone Scheduler provides an easy way to get started; if you already have a Hadoop YARN or Mesos cluster,
however, Spark’s support for these cluster managers allows your applications to also run on them.