Posts

sqoop file format

Problem 4: Import orders table from mysql as text file to the destination /user/cloudera/problem5/text. Fields should be terminated by a tab character ("\t") character and lines should be terminated by new line character ("\n"). Import orders table from mysql  into hdfs to the destination /user/cloudera/problem5/avro. File should be stored as avro file. Import orders table from mysql  into hdfs  to folders /user/cloudera/problem5/parquet. File should be stored as parquet file. Transform/Convert data-files at /user/cloudera/problem5/avro and store the converted file at the following locations and file formats save the data to hdfs using snappy compression as parquet file at /user/cloudera/problem5/parquet-snappy-compress save the data to hdfs using gzip compression as text file at /user/cloudera/problem5/text-gzip-compress save the data to hdfs using no compression as sequence file at /user/cloudera/problem5/sequence save the data to hdfs using snappy compres...

Apache Spark

******************************************Spark APS*********************************************** In order to proces the data in haddop we need to convert the data in RDD (Resilient distributed Dataset) and then we can apply neccessary transformation and action on the RDD. The transformation is to process the data and the actions are used to process tha data. When the transfortaion is applied on the rdd, another RDD will be created, once the RDD is created we cannot odiyf it, but we can create another RDD using the provious RDD. Action : first() : this is use to get the first row for the RDD (Eg : orderItems.first()) take(n) : gets the first n row from the datasets collect() : it is used t convert the rdd into python collections, this is used when we have to apply the aps which is not present in    spark-aps but in the python-aps. parallelize() : this is used to convert the collection into the RDD. When we read the data from the local file and if we want to proces...

Apache Sqoop

INTRODUCTION: In today’s world there are tons of data generated from different source, for example from the social networking site, e-commerce website, sensors etc. Many companies store these data to analyze these data and get some insight from these processed data to improve their business. To process these set of data we need an environment, which can help in processing huge amount of data. Hadoop is a kind of an environment where the huge amount of data can be processed. Hadoop is a kind of a frame-work application which provide distributed storage and help in computing the data across different cluster of computers. Now we know that we can get faster computation of these huge data in Hadoop environment, now the next challenge is getting the data from outside world to Hadoop environment. This can be done by many tools like Apache Flumes, Apache Kafka , Apache Sqoop. In this blog I will go through some of the concepts and operation that we use in Sqoop. ...