Posts

Showing posts from 2018

Conversion from one file format to other in Apache Spark

Read --> Write | V Text file sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera \ --table orders \ --target-dir /user/cloudera/ReadDiffFileFormat/text \ --as-textfile Read: scala> val textFile = sc.textFile("/user/cloudera/ReadDiffFileFormat/text") textFile: org.apache.spark.rdd.RDD[String] = /user/cloudera/ReadDiffFileFormat/text MapPartitionsRDD[279] at textFile at <console>:30  Text file textFile.saveAsTextFile("/user/cloudera/ReadDiffFileFormat/textout") Using compression textFile.saveAsTextFile("/user/cloudera/ReadDiffFileFormat/text/textoutput/compressed", classOf[org.apache.hadoop.io.compress.BZip2Codec]) Sequence file For sequence file we need to have a key. val textMap = textFile.map(e => (e.split(",")(0).toInt, e)) textMap.saveAsSequenceFile("/user/cloudera/ReadDiffF

Problem: Find the top 50 voted movies using Spark RDD, DataFrame and SQL

Problem: 1. Download data from below site. https://datasets.imdbws.com/ 2. Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz 3. Find the top 50 voted movies 4. Storage details Columns: titleId,title,region,language,averageRating,numVotes Store the result at below location: /home/cloudera/workspace/movies/<Method>/<formatname> Store the result in following format. a. Text file Columns to be seperated with tab "\t" Compression: BZip2Codec b. Sequence file. Compression: BZip2Codec c. JSON file. Compression: BZip2Codec d. Parquet. Compression:  uncompressed e. ORC file. f. Avro file. Compression:  uncompressed Use following methods: Method 1: Use RDD Method 2: Use DF Method 3: Use SQL query. Pre work: hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -put /home/cloudera/Downloads/movies/title.ratings.tsv.gz /h

Problem: Find top rated movie using HIVE and store the result to HDFS

1. Download data from below site. https://datasets.imdbws.com/ 2. Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz 3. Find the top 50 rated movies with more than 100000 votes 4. Find the top 50 voted movies 5. Storage details Columns: titleId,title,region,language,averageRating,numVotes Store the result at below location: /home/cloudera/workspace/movies/hive/<formatname> Store the result in following format. a. Text file Columns to be seperated with tab "\t" b. Sequence file. c. RC file. d. Parquet. e. ORC file. Compression: SNAPPY f. Avro file. Use Hive to load data and output data to required location. Pre work: hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -put /home/cloudera/Downloads/movies/title.ratings.tsv.gz /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -ls /home/cloudera/works

Problems on Apache spark rdd, DataFrame, SQL query using SQLContext with solution

Problem:  1. Download data from below site. https://datasets.imdbws.com/ 2. Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz 3. Find the top 50 rated movies with more than 100000 votes 4. Storage details Columns: titleId,title,region,language,averageRating,numVotes Store the result at below location: /home/cloudera/workspace/movies/<Method>/<formatname> Store the result in following format. a. Text file Columns to be seperated with tab "\t" Compression: Bzip2cOdec b. Sequence file. Compression: Bzip2cOdec c. JSON file. Compression: Bzip2cOdec d. Parquet. Compression:  uncompressed e. ORC file. f. Avro file. Compression:  uncompressed Use following methods: Method 1: Use RDD Method 2: Use DF Method 3: Use SQL query. Pre work: hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -put /home/cloudera/Downloads/