Posts

Showing posts from December, 2017

How to read a tab separated text file and display the details in a table using spark and Scala

How to read a tab separated text file and display the details in a table using spark and Scala 1. Read a text (tab separated) file which needs to be processed. scala> val election2014 = sc.textFile("file:///home/hadoop/workspace/scala/rajudata/data-master/electionresults/ls2014.tsv") election2014: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/workspace/scala/rajudata/data-master/electionresults/ls2014.tsv MapPartitionsRDD[59] at textFile at <console>:24 Note: You can get the data file at below location. https://github.com/dgadiraju/data/tree/master/electionresults 2. The input text file is having a header. Which needs to be filtered before actual processing. scala> election2014.first res72: String = state        constituency        candidate_name        sex        age        category        partyname        partysymbol        general        postal        total        pct_of_total_votes        pct_of_polled_votes        totalvo

Different ways of creating DataFrame in Spark using Scala

Create DataFrame using existing RDD. Method 1: Create  DataFrame using sqlContext [SparkSession.createDataFrame(RDD obj)]. 1. Create DataFrame using List RDD scala> val mapRDD = sc.parallelize(List((1, "value1"),(2, "value2"),(3, "value3"))) mapRDD: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[17] at parallelize at <console>:24 Check the data type of RDD. scala> mapRDD res16: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[17] at parallelize at <console>:24 2. Create sqlContext using spark context. scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc) warning: there was one deprecation warning; re-run with -deprecation for details sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@35a792 3. Create DataFrame using sqlContext. scala> val rddToDF = sqlContext.createDataFrame(mapRDD) rddToDF: org.apache.spark.sql.DataF

Read data from MYSQL using sqlContext and filter records.

Excercise 3: Read data from MYSQL using sqlContext and filter records. Problem statement: Find all the Person having age greater than 30 years from MYSQL database table PERSON using SPARK 1. Create a table called person mysql> CREATE TABLE PERSON( ID INT NOT NULL AUTO_INCREMENT, FNAME VARCHAR(100), AGE INT, PRIMARY KEY(ID) ); 2. Insert records into PERSON table. INSERT INTO PERSON (FNAME, AGE) VALUES('Vinayak', 35); INSERT INTO PERSON (FNAME, AGE) VALUES('Nilesh', 37); INSERT INTO PERSON (FNAME, AGE) VALUES('Raju', 30); INSERT INTO PERSON (FNAME, AGE) VALUES('Karthik', 28); INSERT INTO PERSON (FNAME, AGE) VALUES('Shreshta',1); INSERT INTO PERSON (FNAME, AGE) VALUES('Siddhish', 2); 3. Check the data in PERSON table. mysql> SELECT * FROM PERSON; +----+----------+------+ | ID | FNAME    | AGE  | +----+----------+------+ |  1 | Vinayak  |   35 | |  2 | Nilesh   |   37 | |  3

Spark: Read HDFS json file and filter records based on certain criteria.

Spark - Exercise 2: Read HDFS json file and filter records based on certain criteria. Problem statement: Find all the Person records(from JSON input file) having age greater than 30 years. 1. Create a file on local file system with name Person.txt >vi Person.json 2. Add below records. {"Name" : "Vinayak", "Age":35} {"Name" : "Nilesh", "Age":37} {"Name" : "Raju", "Age":30} {"Name" : "Karthik", "Age":28} {"Name" : "Shreshta","Age":1} {"Name" : "Siddhish", "Age":2} 3. Create a directory on HDFS file system. hadoop fs -mkdir /user/spark/PersonJSONExample/ 4. Put Person.json file onto HDFS system hadoop fs -put Person.json /user/spark/PersonJSONExample/ 5. Check whether the file is been uploaded? [root@localhost PersonExample]# hadoop fs -ls /user/spark/Pe

Spark: Read HDFS text file and filter records based on certain criteria.

Spark - Exercise 1: Read HDFS text file and filter records based on certain criteria. Problem statement: Find all the Person records having age greater than 30 years. 1. Create a file on local file system with name Person.txt >vi Person.txt. 2. Add below records. Name, Age Vinayak, 35 Nilesh, 37 Raju, 30 Karthik, 28 Shreshta,1 Siddhish, 2 3. Create a directory on HDFS file system. hadoop fs -mkdir /user/spark/PersonExample/ 4. Put Person.txt file onto HDFS system hadoop fs -put Person.txt /user/spark/PersonExample/ 5. Check whether the file is been uploaded? [root@localhost PersonExample]# hadoop fs -ls /user/spark/PersonExample/ Found 1 items -rw-r--r--   1 root supergroup         77 2017-12-17 14:34 /user/spark/PersonExample/Person.txt 6. Start spark shell using command spark-shell $>spark-shell 7. Load the file using spark context. scala> var persons = sc.textFile("/user/spark/PersonExample/P