What is and Why Hive?
What is Hive?
- Provides data warehousing solution built on top of Hadoop
- ช่วยอำนวยความสะดวกในการค้นหาและจัดการชุดข้อมูลขนาดใหญ่ที่อยู่ใน Hadoop Storage (HDFS หรือ HBase)
- มาพร้อมกับภาษา SQL ทีเรียกว่า HiveQL
- SQL has huge developer base
- นำความสามารถของ Hadoop มาใข้ (querying, analyzing, and summarizing large amounts of data)
- Allows you to project structure to any data formats
- Can handle non-structured data
- Open source Apache project
- Started at Facebook
Hive is NOT designed for
- Hive is not for real-time queries
- Hive is still a batch operation: Hadoop jobs incur substantial overheads in job submission and scheduling – not suitable for real-time queries.
- ผลที่ตามมา – ความล่าช้าในการค้นหาของ Hive ซึ่งโดยทั่วไปมีสูงมากแม้ว่าข้อมูลนั้นจะมีขนาดเพียงหลักร้อยเมกะไบต์
- Hive ได้รับการออกแบบมาเพื่อความสามารถในการปรับขนาดและใช้งานได้ง่าย
- Hive is not for Online Transaction Processing (which requires real time and frequent write operations)
- Hive is more for Online Transaction Analysis (in Batch mode), which requires mostly read operations
- Hive does not support UPDATE/DELETE/INSERT a single row
Example Hive Applications
- Log processing
- Text mining
- Document indexing
- Customer-facing business intelligence
- Example: Google Analytics
- Predictive modeling
- Hypothesis testing
Hive | VS | RDMS |
---|---|---|
Similarities | Similarities | |
– HiveQL (SQL-like language) – Similar data model-database, table, view |
– SQL – Similar data model-database, table, view |
|
Differences | Differences | |
– Designed for OLAP – Batch oriented and high latency operations – Schema on Read |
– Design for OLTP – Real-time and low latency operations – Schema on Write |
Hive | VS | PIG |
---|---|---|
Similarities | Similarities | |
– Batch oriented – HiveQL gets translated into MapReduce jobs – Targeted for non-Java developers (SQL developers) |
– Batch oriented – PIG statement gets translated into MapReduce jobs – Targeted for non-Java developers |
|
Differences | Differences | |
– HiveQL (like SQL) is a declarative language with a single result set | – PIG Latin is procedural data flow language with input and out for each step |
Hive Architecture
Hive Internals
- Hive แปลคำสั่ง HiveQL เป็นชุดงานของ MapReduce ซึ่งจะดำเนินการนี้อยู่ใน Hadoop Cluster
- Abstracted notions of traditional RDBMS are provided over Hadoop data sets through “metadata” store
- Database – namespace containing a set of tables
- Tables
- Schemas
- Indexing
- View
- Hive specific data models are also supported
- Partitions
- Buckets
- Hive Interface options
- Command Line Interface (CLI)
- Web interface
- Thrift Server – Any JDBC an ODBC client apps can access Hive
- The “metadata” store is maintained internally via embedded Derby database (MySQL and other RDBMS are also supported)
- Table definitions
- Mapping to HDFS
HiveQL
- Types of HiveQL operations
- DDL operations
- DML operations
- SQL operations
- How to execute HiveQL
- Use Hive shell – just type “hive”
- Run Hive script file – type “hive –f <script-file>”
HiveQL – DDL operations
- Create a table
- Show tables
- Describe a table
- Drop a table
HiveQL – DML operations
- Loading flat files into Hive
- No verification of incoming data
- If the loaded data is not compliant with the schema specified forthe table, it will be set to NULL
HiveQL – SQL operations
HiveQL – Built-in Functions
- HiveQL comes with a bunch of built-in functions
- Hive -> SHOW FUNCTIONS ;
Physical Layout
- Hive warehouse directory in HDFS
- /user/hive/warehouse (default) – the default can be changed via hive.metastore.warehouse.dir
- Hive Table is represented as a directory under Hive warehouse directory
- /user/hive/warehouse/students directory – represents “students” Hive table
- /user/hive/warehouse/book directory – represents “book” Hive table
- Actual data is stored in flat files under the directory
Example: Physical Layout
- In the example below, there are 5 directories representing 5 tables
Loading Data into Hive Tables
Multiple Schemes of Loading Data
#1: Use Hive LOAD DATA command (Most common: our focus here)
- To load data from flat files
- Data files can be loaded from Local or HDFS (default)
- Data files can be loaded Internal (default) vs EXTERNAL
#2: Manually copying data files into the “/user/hive/warehouse/<tabledirectory>” Hive directory
- Hive automatically recognize them
#3: Use Hive INSERT command
- To load data from another HIVE table using SELECT (It is not for inserting a new record)
#4: Use Apache Sqoop to move RDMBS data to Hadoop
#5: Use Apache Flume to move large data sets to Hadoop
Loading data from Local vs HDFS
- Load data from local file system
- Load data from HDFS (default)
Loading data Internal vs EXTERNAL
- Load data Internally
- Data is moved to Hive storage (/user/hive/warehouse)
- การลบตาราง Hive ข้อมูลก็จะถูกลบด้วย
- Load data with EXTERNAL … LOCATION
- ใช้ข้อมูลที่อยู่ใน HDFS
- Hive just has pointers to the existing data in HDFS when a table gets created
- ลบตาราง Hive ข้อมูลใน HDFS จะไม่ถูกลบ
- Load data Internal
- Load data EXTERNAL (done at CREATE time, no LOAD required)
Load Data from Existing Table using INSERT … SELECT
- Use it when you want to load data from existing table or tables
Hive Partitions
What is and Why Partitions?
- ข้อมูลภายในตารางถูกแบ่งออกเป็นหลาย partitions
- เพื่อเพิ่มประสิทธิภาพการทำงานของการค้นหาข้อมูล
- Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS
- When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O required by the query
- For example, you might want to partition weather data base on year – when query operation is done per year, only that partition need to be accessed
Partitioning Data Mechanism
- One or more partition columns may be specified
- Creates a sub-directory for each value of the partition column
- Queries with the partition columns in WHERE clause will scan through only a subset of the data
Creating Partitions
- Specify the partition column
Load data into Partitions
- Each partition is loaded with data specific for the partition column
Querying Partitions
- By specifying the where condition with the partition, only that specific partition will be queried
Hive Joins
JOIN
- Hive supports
- Inner Join (default)
- Outer Join – Left Outer, Right Outer, Full Outer
- การ JOIN สามารถทำได้ในสองรูปแบบ
- Using SELECT … INNER JOIN.., SELECT… LEFT OUTER JOIN, etc
- Create a joined table from
What is and Why Buckets?
- เป็นกลไกในการสืบค้นและตรวจสอบจากการสุ่มข้อมูลตัวอย่าง
- Breaks data into a set of buckets based on a hash function of a “bucket column” and allows execution of queries on a subset of random data
- Bucketing must be configured
- hive> hive.enforce.bucketing = true;