
Showing posts from 2018

Conversion from one file format to other in Apache Spark

Read --> Write | V Text file sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera \ --table orders \ --target-dir /user/cloudera/ReadDiffFileFormat/text \ --as-textfile Read: scala> val textFile = sc.textFile("/user/cloudera/ReadDiffFileFormat/text") textFile: org.apache.spark.rdd.RDD[String] = /user/cloudera/ReadDiffFileFormat/text MapPartitionsRDD[279] at textFile at <console>:30  Text file textFile.saveAsTextFile("/user/cloudera/ReadDiffFileFormat/textout") Using compression textFile.saveAsTextFile("/user/cloudera/ReadDiffFileFormat/text/textoutput/compressed", classOf[]) Sequence file For sequence file we need to have a key. val textMap = => (e.split(",")(0).toInt, e)) textMap.saveAsSequenceFile("/user/cloudera/ReadDiffF...

Problem: Find the top 50 voted movies using Spark RDD, DataFrame and SQL

Problem: 1. Download data from below site. 2. Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz 3. Find the top 50 voted movies 4. Storage details Columns: titleId,title,region,language,averageRating,numVotes Store the result at below location: /home/cloudera/workspace/movies/<Method>/<formatname> Store the result in following format. a. Text file Columns to be seperated with tab "\t" Compression: BZip2Codec b. Sequence file. Compression: BZip2Codec c. JSON file. Compression: BZip2Codec d. Parquet. Compression:  uncompressed e. ORC file. f. Avro file. Compression:  uncompressed Use following methods: Method 1: Use RDD Method 2: Use DF Method 3: Use SQL query. Pre work: hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -put /home/cloudera/Downloads/movies/title.ratings.tsv.gz /h...

Problem: Find top rated movie using HIVE and store the result to HDFS

1. Download data from below site. 2. Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz 3. Find the top 50 rated movies with more than 100000 votes 4. Find the top 50 voted movies 5. Storage details Columns: titleId,title,region,language,averageRating,numVotes Store the result at below location: /home/cloudera/workspace/movies/hive/<formatname> Store the result in following format. a. Text file Columns to be seperated with tab "\t" b. Sequence file. c. RC file. d. Parquet. e. ORC file. Compression: SNAPPY f. Avro file. Use Hive to load data and output data to required location. Pre work: hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -put /home/cloudera/Downloads/movies/title.ratings.tsv.gz /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -ls /home/cloudera/works...

Problems on Apache spark rdd, DataFrame, SQL query using SQLContext with solution

Problem:  1. Download data from below site. 2. Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz 3. Find the top 50 rated movies with more than 100000 votes 4. Storage details Columns: titleId,title,region,language,averageRating,numVotes Store the result at below location: /home/cloudera/workspace/movies/<Method>/<formatname> Store the result in following format. a. Text file Columns to be seperated with tab "\t" Compression: Bzip2cOdec b. Sequence file. Compression: Bzip2cOdec c. JSON file. Compression: Bzip2cOdec d. Parquet. Compression:  uncompressed e. ORC file. f. Avro file. Compression:  uncompressed Use following methods: Method 1: Use RDD Method 2: Use DF Method 3: Use SQL query. Pre work: hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -put /home/cloudera/Downlo...