Posts

Featured Post

Spark: Read HDFS text file and filter records based on certain criteria.

Spark - Exercise 1: Read HDFS text file and filter records based on certain criteria. Problem statement: Find all the Person records having age greater than 30 years. 1. Create a file on local file system with name Person.txt >vi Person.txt. 2. Add below records. Name, Age Vinayak, 35 Nilesh, 37 Raju, 30 Karthik, 28 Shreshta,1 Siddhish, 2 3. Create a directory on HDFS file system. hadoop fs -mkdir /user/spark/PersonExample/ 4. Put Person.txt file onto HDFS system hadoop fs -put Person.txt /user/spark/PersonExample/ 5. Check whether the file is been uploaded? [root@localhost PersonExample]# hadoop fs -ls /user/spark/PersonExample/ Found 1 items -rw-r--r--   1 root supergroup         77 2017-12-17 14:34 /user/spark/PersonExample/Person.txt 6. Start spark shell using command spark-shell $>spark-shell 7. Load the file using spark context. scala> var persons = sc.textFile("/user/spark/PersonExample/P

aggregation vs composition

Image
Aggregation (collection) differs from ordinary composition in that it does not imply ownership. In composition, when the owning object is destroyed, so are the contained objects. In aggregation, this is not necessarily true Composition (mixture) is a way to combine simple objects or data types into more complex ones. Compositions are a critical building block of many basic data structures Both denotes relationship between object and only differ in their strength. UML notations for different kind of dependency between two classes Composition : Since Engine is part-of Car, relationship between them is Composition. Here is how they are implemented between Java classes.   public class Car { //final will make sure engine is initialized private final Engine engine; public Car(){ engine = new Engine(); } } class Engine { private String type; } Aggregation   : Since Organization has Person as employees, relationship between them is Aggregation. Here is how they

Generate Public, Private key and Certificates using openssl.

Generate Public, Private key and Certificates using openssl. Here’s some openssl commands from our discussion earlier about private/pubic keys. 1. Generate a private key openssl genrsa -out private.pem 2048 2. Create CSR - certificate signing request openssl req -new -key private.pem -out csr.pem 3. Create self signed certificate (sign with private key instead of CA) from the csr (1 year expiry) openssl x509 -req -days 365 -in csr.pem -signkey private.pem -sha256 -out cert.pem -outform PEM 4. Looks at details of certificate openssl x509 -in cert.pem -noout -text 5. Extract public key from certificate - to std out openssl x509 -in cert.pem -noout -pubkey 6. Extract public key from private key - to file openssl rsa -in private.pem -outform PEM -pubout -out public.pem 7. How to check if a certificate and csr matches with your private key Compare: openssl rsa -noout -modulus -in private.pem | openssl md5 openssl x509 -noout -modulus -in cert.pem | openssl md5 openssl req -noout -modulus -i

Git Aliases

# ---------------------- # Git Aliases # ---------------------- alias gclone='git clone' alias ga='git add' alias gaa='git add .' alias gaaa='git add --all' alias gau='git add --update' alias gf='git fetch' alias gb='git branch' alias gbd='git branch --delete ' alias gc='git commit' alias gca='git commit --amend' alias gcm='git commit --message' alias pgacm='git add . && git commit -m' alias gacm='./gradlew goJF && git add . && git commit -m' alias gcf='git commit --fixup' alias gp='git push' alias gpf='git push --force' alias gps='git push --set-upstream origin' alias gco='git checkout' alias gcob='git checkout -b' alias gcom='git checkout master' alias gcos='git checkout staging' alias gcod='git checkout develop' alias gd='git diff' alias gda='git diff HEAD' alias gi='

Conversion from one file format to other in Apache Spark

Read --> Write | V Text file sqoop import --connect jdbc:mysql://quickstart:3306/retail_db --username retail_dba --password cloudera \ --table orders \ --target-dir /user/cloudera/ReadDiffFileFormat/text \ --as-textfile Read: scala> val textFile = sc.textFile("/user/cloudera/ReadDiffFileFormat/text") textFile: org.apache.spark.rdd.RDD[String] = /user/cloudera/ReadDiffFileFormat/text MapPartitionsRDD[279] at textFile at <console>:30  Text file textFile.saveAsTextFile("/user/cloudera/ReadDiffFileFormat/textout") Using compression textFile.saveAsTextFile("/user/cloudera/ReadDiffFileFormat/text/textoutput/compressed", classOf[org.apache.hadoop.io.compress.BZip2Codec]) Sequence file For sequence file we need to have a key. val textMap = textFile.map(e => (e.split(",")(0).toInt, e)) textMap.saveAsSequenceFile("/user/cloudera/ReadDiffF

Problem: Find the top 50 voted movies using Spark RDD, DataFrame and SQL

Problem: 1. Download data from below site. https://datasets.imdbws.com/ 2. Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz 3. Find the top 50 voted movies 4. Storage details Columns: titleId,title,region,language,averageRating,numVotes Store the result at below location: /home/cloudera/workspace/movies/<Method>/<formatname> Store the result in following format. a. Text file Columns to be seperated with tab "\t" Compression: BZip2Codec b. Sequence file. Compression: BZip2Codec c. JSON file. Compression: BZip2Codec d. Parquet. Compression:  uncompressed e. ORC file. f. Avro file. Compression:  uncompressed Use following methods: Method 1: Use RDD Method 2: Use DF Method 3: Use SQL query. Pre work: hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -put /home/cloudera/Downloads/movies/title.ratings.tsv.gz /h

Problem: Find top rated movie using HIVE and store the result to HDFS

1. Download data from below site. https://datasets.imdbws.com/ 2. Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz 3. Find the top 50 rated movies with more than 100000 votes 4. Find the top 50 voted movies 5. Storage details Columns: titleId,title,region,language,averageRating,numVotes Store the result at below location: /home/cloudera/workspace/movies/hive/<formatname> Store the result in following format. a. Text file Columns to be seperated with tab "\t" b. Sequence file. c. RC file. d. Parquet. e. ORC file. Compression: SNAPPY f. Avro file. Use Hive to load data and output data to required location. Pre work: hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -put /home/cloudera/Downloads/movies/title.ratings.tsv.gz /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -ls /home/cloudera/works

Problems on Apache spark rdd, DataFrame, SQL query using SQLContext with solution

Problem:  1. Download data from below site. https://datasets.imdbws.com/ 2. Download the movies data title.ratings.tsv.gz and title.akas.tsv.gz 3. Find the top 50 rated movies with more than 100000 votes 4. Storage details Columns: titleId,title,region,language,averageRating,numVotes Store the result at below location: /home/cloudera/workspace/movies/<Method>/<formatname> Store the result in following format. a. Text file Columns to be seperated with tab "\t" Compression: Bzip2cOdec b. Sequence file. Compression: Bzip2cOdec c. JSON file. Compression: Bzip2cOdec d. Parquet. Compression:  uncompressed e. ORC file. f. Avro file. Compression:  uncompressed Use following methods: Method 1: Use RDD Method 2: Use DF Method 3: Use SQL query. Pre work: hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -mkdir /home/cloudera/workspace/movies [root@quickstart movies]# hadoop fs -put /home/cloudera/Downloads/