Apache Sqoop

INTRODUCTION:

In today’s world there are tons of data generated from different source, for example from the social networking site, e-commerce website, sensors etc. Many companies store these data to analyze these data and get some insight from these processed data to improve their business.

To process these set of data we need an environment, which can help in processing huge amount of data. Hadoop is a kind of an environment where the huge amount of data can be processed. Hadoop is a kind of a frame-work application which provide distributed storage and help in computing the data across different cluster of computers.

Now we know that we can get faster computation of these huge data in Hadoop environment, now the next challenge is getting the data from outside world to Hadoop environment. This can be done by many tools like Apache Flumes, Apache Kafka , Apache Sqoop. In this blog I will go through some of the concepts and operation that we use in Sqoop.

SQOOP:

Apache Sqoop is the part of the Hadoop ecosystem, which is used to transfer the data from the RDBMS to the Hadoop and also from Hadoop to the RDBMS. The RDBMS can be MySQL, Oracle, Teradata etc. Sqoop is also used to import the data to the Hadoop ecosystem’s tools like HIVE and HBASE. Now let dive into the some of the operation that Sqoop can perform.

List – database : Below is the simple Sqoop command which is used to connect the RDBMS database and find the list of databases

sqoop list-databases \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password>

The above script is similar to the query that we run the MySQL which is : show databases;

List-tables : Below is the command used to find the list of tables in the select database

sqoop list-tables \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password>

The above script is similar to the query that we run the MySQL which is : show tables;

Eval : Eval is used to run the queries against the database and the result will be printed on the console. Below are the set of eval command that can be used in the Sqoop.

sqoop eval \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password> \

--query "SELECT * FROM orders LIMIT 10"

sqoop eval \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password> \

--query "create table hello123 (i int)"

sqoop eval \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password> \

--query "insert into hello123 values (33)"

sqoop eval \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password> \

--query "select * from hello123"

'--query' is the clause that we are suppose to use where we can write the query that we want to run against the database.

Sqoop- import: This tool is used to import the data from the RDBMS to HDFS. Each row in the table of RDBMS is consider or stored as singe record in HDFS. The data can be stored in text file format or in the binary representation such as AVRO or sequence file.

sqoop import \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password> \

--table <table-name>\

--warehouse-dir <path>

'--table' : in this clause we need to mention the source table in the RDBMS from which the data has to be imported to HDFS.

There are two way to store in the data in the HDFS:

1. --warehouse-dir : The new directory will be created in the specified path with the name of the table that will be imported.

2. --target-dir : Here no directory will be created. The files will be stored in the specified path,

'--num-mapper': this tool is used to set the number of mappers required to run the import. The mapper will run parallel. When the number of mappers is set the data that has to be imported will be divided by the number of mapper and each mapper will be getting the same amount of data and each map will import the data. For example, if we are trying to import 4000 MB of data and if we set num-mappers to 2 then each mapper will get 1000 MB and also their will be 4 files stored in the HDFS directory.

sqoop import \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password> \

--table <table-name>\

--warehouse-dir <path> \

--num-mappers 4

'--append' : This tool is used to append the new file into the same directory. Because by default, when we run the sqoop import we have make sure that the given path is empty.

sqoop import \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password> \

--table <table-name>\

--warehouse-dir <path> \
--num-mappers 4 \
--append

'--delete-target-dir' : we can use this tool to empty the path before we run the import. If we have some files in the directory to which data will be imported, then the import script will fail.

sqoop import \

--connect jdbc:mysql://<Example.com:PORT_NUMBER>/<database_name> \

--username <username> \

--password <password> \

--table <table-name>\

--warehouse-dir <path> \
--num-mappers 4 \
--delete-target-dir

Note : we cannot use '--append' and '--delete-target-dir' together, we need to use any one of them based on the requirement.

Search This Blog

Learning center

Apache Sqoop

Comments

Post a Comment

Popular posts from this blog

Apache Spark