Apache Sqoop
INTRODUCTION:
In today’s world there
are tons of data generated from different source, for example from the social
networking site, e-commerce website, sensors etc. Many companies store these
data to analyze these data and get some insight from these processed data to improve
their business.
To process these set of
data we need an environment, which can help in processing huge amount of data.
Hadoop is a kind of an environment where the huge amount of data can be
processed. Hadoop is a kind of a frame-work application which provide
distributed storage and help in computing the data across different cluster of
computers.
Now we know that we can
get faster computation of these huge data in Hadoop environment, now the next
challenge is getting the data from outside world to Hadoop environment. This
can be done by many tools like Apache Flumes, Apache Kafka , Apache Sqoop. In
this blog I will go through some of the concepts and operation that we use in
Sqoop.
SQOOP:
Apache Sqoop is the part
of the Hadoop ecosystem, which is used to transfer the data from the RDBMS to
the Hadoop and also from Hadoop to the RDBMS. The RDBMS can be MySQL, Oracle,
Teradata etc. Sqoop is also used to import the data to the Hadoop ecosystem’s
tools like HIVE and HBASE. Now let dive into the some of the operation that
Sqoop can perform.
- List – database : Below is the simple Sqoop command which is used to connect the RDBMS database and find the list of databases
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \
--username <username> \
--password <password>
The above script is similar to the query that we run
the MySQL which is : show databases;
- List-tables : Below is the command used to find the list of tables in the select database
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name>
\
--username <username> \
--password <password>
The above script is similar to the query that we run
the MySQL which is : show tables;
- Eval : Eval is used to run the queries against the database and the result will be printed on the console. Below are the set of eval command that can be used in the Sqoop.
sqoop eval \
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \
--username <username> \
--password <password> \
--query "SELECT * FROM orders LIMIT 10"
sqoop eval \
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \
--username <username> \
--password <password> \
--query "create table hello123 (i int)"
sqoop eval \
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \
--username <username> \
--password <password> \
--query "insert into hello123 values (33)"
sqoop eval \
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \
--username <username> \
--password <password> \
--query "select * from hello123"
'--query' is the clause that we are suppose to use where we can write the query that we want to run against the database.
- Sqoop- import: This tool is used to import the data from the RDBMS to HDFS. Each row in the table of RDBMS is consider or stored as singe record in HDFS. The data can be stored in text file format or in the binary representation such as AVRO or sequence file.
sqoop import \
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \
--username <username> \
--password <password> \
--table <table-name>\
--warehouse-dir <path>
'--table' : in this clause we need to mention the
source table in the RDBMS from which the data has to be imported to HDFS.
There are two way to store in the data in
the HDFS:
1. --warehouse-dir : The new directory will be
created in the specified path with the name of the table that will be imported.
2. --target-dir : Here no directory will be created.
The files will be stored in the specified path,
'--num-mapper': this tool is used to set the number of mappers
required to run the import. The mapper will run parallel. When the number of
mappers is set the data that has to be imported will be divided by the number
of mapper and each mapper will be getting the same amount of data and each map
will import the data. For example, if we are trying to import 4000 MB of data
and if we set num-mappers to 2 then each mapper will get 1000 MB and also their
will be 4 files stored in the HDFS directory.
sqoop import \
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \
--username <username> \
--password <password> \
--table <table-name>\
--warehouse-dir <path> \
--num-mappers 4
'--append' : This tool is used to append the new file into the same directory. Because by default, when we run the sqoop import we have make sure that the given path is empty.
sqoop import \
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \
--username <username> \
--password <password> \
--table <table-name>\
--num-mappers 4 \
--append
'--delete-target-dir' : we can use this tool to empty the path before we run the import. If we have some files in the directory to which data will be imported, then the import script will fail.
sqoop import \
--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \
--username <username> \
--password <password> \
--table <table-name>\
--num-mappers 4 \
--delete-target-dir
Note : we cannot use '--append' and '--delete-target-dir' together, we need to use any one of them based on the requirement.
Comments
Post a Comment