Posts

Showing posts from January, 2018

How to use map, filter, groupByKey.

How to use map, filter, groupByKey Problem statement: Get daily revenue by product with CLOSED and COMPLETED orders. Save the output to a text, CSV, TSV and parguet file. The example is using spark-shell for executing all the steps. 1. Read all the files from local system, The same can be read from HDFS as well.  For simplicity of the example I am reading it from local file system. val orders = sc.textFile("file:///home/cloudera/workspace/scala/rajudata/data-master/retail_db/orders") val order_items = sc.textFile("file:///home/cloudera/workspace/scala/rajudata/data-master/retail_db/order_items") val products = sc.textFile("file:///home/cloudera/workspace/scala/rajudata/data-master/retail_db/products") All the data files can be found at GitHub. https://github.com/dgadiraju/data/tree/master/retail_db 2. Check the content of each RDD. This will help to understand the data structure and position of each element. orders.take(3).f...

How to use DataFrame registerTempTable in spark SQL.

How to use DataFrame registerTempTable in spark SQL. Problem statement: Get daily revenue by product with CLOSED and COMPLETED orders. Save the output to a text, CSV, TSV and parguet file. 1. Load all the json file into spark DataSet using sqlContext. All the data files can be found at GitHub. https://github.com/dgadiraju/data/tree/master/retail_db_json scala> val products = sqlContext.read.json("file:///home/cloudera/workspace/scala/rajudata/data-master/retail_db_json/products/*") scala> val orders = sqlContext.read.json("file:///home/cloudera/workspace/scala/rajudata/data-master/retail_db_json/orders/*") scala> val order_items = sqlContext.read.json("file:///home/cloudera/workspace/scala/rajudata/data-master/retail_db_json/order_items/*") scala> products.limit(2).show +-------------------+-------------------+----------+--------------------+--------------------+-------------+ |product_category_id|product_d...

How to use (inner) JOIN and group by in apache spark SQL.

How to use (inner) JOIN and group by in spark SQL. Problem statement: Get daily revenue by product with CLOSED and COMPLETED orders. Save the output to a text, CSV, TSV and parguet file. 1. Load all the json file into spark DataSet using sqlContext. All the data files can be found at GitHub. https://github.com/dgadiraju/data/tree/master/retail_db_json scala> val products = sqlContext.read.json("file:///home/cloudera/workspace/scala/rajudata/data-master/retail_db_json/products/*") scala> val orders = sqlContext.read.json("file:///home/cloudera/workspace/scala/rajudata/data-master/retail_db_json/orders/*") scala> val order_items = sqlContext.read.json("file:///home/cloudera/workspace/scala/rajudata/data-master/retail_db_json/order_items/*") scala> products.limit(2).show +-------------------+-------------------+----------+--------------------+--------------------+-------------+ |product_category_id|product_desc...