sqoop file format

Problem 4:
Import orders table from mysql as text file to the destination /user/cloudera/problem5/text. Fields should be terminated by a tab character ("\t") character and lines should be terminated by new line character ("\n").
Import orders table from mysql into hdfs to the destination /user/cloudera/problem5/avro. File should be stored as avro file.
Import orders table from mysql into hdfs to folders /user/cloudera/problem5/parquet. File should be stored as parquet file.
Transform/Convert data-files at /user/cloudera/problem5/avro and store the converted file at the following locations and file formats
save the data to hdfs using snappy compression as parquet file at /user/cloudera/problem5/parquet-snappy-compress
save the data to hdfs using gzip compression as text file at /user/cloudera/problem5/text-gzip-compress
save the data to hdfs using no compression as sequence file at /user/cloudera/problem5/sequence
save the data to hdfs using snappy compression as text file at /user/cloudera/problem5/text-snappy-compress
Transform/Convert data-files at /user/cloudera/problem5/parquet-snappy-compress and store the converted file at the following locations and file formats
save the data to hdfs using no compression as parquet file at /user/cloudera/problem5/parquet-no-compress
save the data to hdfs using snappy compression as avro file at /user/cloudera/problem5/avro-snappy
Transform/Convert data-files at /user/cloudera/problem5/avro-snappy and store the converted file at the following locations and file formats
save the data to hdfs using no compression as json file at /user/cloudera/problem5/json-no-compress
save the data to hdfs using gzip compression as json file at /user/cloudera/problem5/json-gzip
Transform/Convert data-files at /user/cloudera/problem5/json-gzip and store the converted file at the following locations and file formats
save the data to as comma separated text using gzip compression at /user/cloudera/problem5/csv-gzip
Using spark access data at /user/cloudera/problem5/sequence and stored it back to hdfs using no compression as ORC file to HDFS to destination /user/cloudera/problem5/orc

sqoop import \
--connect jdbc:mysql://ms.itversity.com:3306/retail_db \
--username retail_user \
--password itversity \
--table orders \
--warehouse-dir /user/pramod882/problem5/text \
--as-textfile \
--lines-terminated-by '\t' \
--fields-terminated-by '\n'

sqoop import \
--connect jdbc:mysql://ms.itversity.com:3306/retail_db \
--username retail_user \
--password itversity \
--table orders \
--warehouse-dir /user/pramod882/problem5/avro \
--as-avrodatafile

sqoop import \
--connect jdbc:mysql://ms.itversity.com:3306/retail_db \
--username retail_user \
--password itversity \
--table orders \
--warehouse-dir /user/pramod882/problem5/parquet \
--as-parquetfile

data = sqlContext.read.format("com.databricks.spark.avro").load("/user/pramod882/problem5/avro/orders")
sqlContext.setConf("spark.sql.parquet.compression.codec","snappy")
data.write.parquet("/user/pramod882/problem5/parquet-snappy-compress") // store parquet with compression

data_map = data.rdd
data_map.saveAsTextFile("/user/pramod882/problem5/text-gzip-compress1",compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")// save textfile with gzipcodec

//import while saving the data in squence you need to have key value pairs
data_mapped = data.map(lambda i : (str(i[0]),(str(i[0])+"\t"+str(i[1])+"\t"+str(i[2])+"\t"+str(i[3]))))
data_mapped.saveAsSequenceFile("/user/pramod882/problem5/sequence") // save the file in sequence

//save the file in textfile with snappy compression
data_mapped.saveAsTextFile("/user/pramod882/problem5/text-snappy-compress",compressionCodecClass="org.apache.hadoop.io.compress.SnappyCodec")

or

sqoop import \
--connect jdbc:mysql://ms.itversity.com:3306/retail_db \
--username retail_user \
--password itversity \
--table orders \
--warehouse-dir /user/pramod882/problem5/text-snappy-compress \
--as-textfile \
--compress \
--compression-codec org.apache.hadoop.io.compress.SnappyCodec

data = sqlContext.read.parquet("/user/pramod882/problem5/parquet-snappy-compress")
sqlContext.setConf("spark.sql.parquet.compression.codec","uncompressed");
data.write.parquet("/user/pramod882/problem5/parquet-no-compress")

sqlContext.setConf("spark.sql.avro.compression.codec","Snappy")
data.write.format("com.databricks.spark.avro").save("/user/pramod882/problem5/avro-snappy")

data = sqlContext.read.format("com.databricks.spark.avro").load("/user/pramod882/problem5/avro-snappy")
data.toJSON().saveAsTextFile("/user/pramod882/problem5/json-no-compress")
data.toJSON().saveAsTextFile("/user/pramod882/problem5/json-gzipcompress",compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

data = sqlContext.read.json("/user/pramod882/problem5/json-gzipcompress")
data.map(lambda i: (str(i[0])+","+str(i[1])+","+str(i[2])+","+str(i[3]))).saveAsTextFile(" /user/pramod882/problem5/csv-gzip",compressionCodecClass="org.apache.hadoop.io.compress.SnappyCodec")

get the file in the local

hadoop fs -get /user/pramod882/problem5/sequence
cut -c-300 part-00000

data = sc.sequenceFile("/user/pramod882/problem5/sequence")
data_map = data.map(lambda i: (i[1].split("\t")[0],i[1].split("\t")[1],i[1].split("\t")[2],i[1].split("\t")[3]))
data_map.toDF().write.orc("/user/pramod882/problem5/orc")

my notepad++:
#Notes :

# 1. Reading avro file format:

# Before reading the avro file format we need to use the below command to get into pyspark:

pyspark --master yarn --conf spark.ui.port=12789 --num-executors 10 --executor-cores 2 --executor-memory 3g --packages com.databricks:spark-avro_2.10:2.0.1

data = sqlContext.read.format("com.databricks.spark.avro").load("/user/pramodpn/problem2/avro")

# 2. saving file and reading the parquet file:

data.saveAsParquetFile("/user/pramodpn/problem2/parquet-snappy")
data = sqlContext.read.parquet("/user/pramodpn/problem3/customer/parquet")
data.write.parquet("/user/pramodpn/problem2/parquet-nocompress")

#the below will read the parquet and save it in dataframe format
data = sqlContext.load("/user/pramodpn/problem2/customer/parquet","parquet")

# 3. saving file in text format and reading text file:

orders = sc.textFile("/user/pramodpn/practice4/question3/orders/")
joined_filter.saveAsTextFile(/user/pramodpn/p1/q7/output)

# save the textfile and also compress the file
data_map.saveAsTextFile("/user/pramodpn/problem2/customer_text1_gzip",compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec")

#for getting the CompressionCodecClass value follow the belos steps:
#step 1. cd /etc/hadoop/conf
#step 2. vi core-site.xml
#step 3. then /codec and then you will get neccessary compreesed file that will be supported

#save the dataframe in textfile:
data.save("/user/pramodpn/pratice4/output")

#4. Converting the RDD to dataframe. Make sure while converting the RDD to dataframe the rdd is mapped properly. Removed the comma seprated:

data_map = data_group.rdd.map(lambda i :(str(i[0])+'|'+str(i[1])+'|'+str(i[2])))

#5. save the file in CSV file format:

# save the RDD in the cs file format
groupedData.write.csv("/user/cloudera/problem2/customer_csv_new")

#save the DataFrame in csv file format:
data_map.write.format("com.databricks.spark.csv").save("/user/pramodpn/problem2/customer_csv_new")

#6. save the rdd to JSON file format:
data_filter.save("/user/pramodpn/problem3/orders_pending","json")

#reading the json file format
sqlContext.read.json("/public/retail_db_json/order_items")
data = sqlContext.jsonFile("/user/pramod882/data1/data1")

test.toJSON().saveAsTextFile("employeeJson1")
test.write.json("/user/pramod882/result")

#7.save in orc file format:
df.write.orc("/user/pramodpn/problem4_ques7/output")

#8. convert the RDD to DF:

item_df=Items.map(lambda i:(int(i.split(",")[0]),float(i.split(",")[4]))).toDF(schema=["item_product_id","subtotal"])

#convert df to rdd:
result_rdd=result.rdd

#Convert the above dataframe to temp table:

item_df.registerTempTable("item")

# save the file as a sequence file:
data_map.saveAsSequenceFile("/user/pramodpn/problem4_ques6/output")
#import while saving the data in squence you need to have key value pairs
data_mapped = data.map(lambda i : (str(i[0]),(str(i[0])+"\t"+str(i[1])+"\t"+str(i[2])+"\t"+str(i[3]))))
data_mapped.saveAsSequenceFile("/user/pramod882/problem5/sequence")

Search This Blog

Learning center

sqoop file format

Comments

Post a Comment

Popular posts from this blog

Apache Spark