Hive in practice

Why choose Hive?

Hive is a Hadoop-based data warehouse tool open sourced by Facebook and maintained by Apache. It is also one of the most widely used big data solutions currently.

You can think of it as a translator: you only need to write familiar SQL, and it can "translate" these statements into MapReduce, Spark or Tez tasks and run them on a distributed cluster. Even if you don’t know a word of Java or Scala, you can still process petabytes of data.

The essential difference between Hive and traditional databases

Many novices will regard Hive as a replacement for MySQL. In fact, they are completely two sets of things:

DimensionsHiveMySQL/Oracle
PositioningData warehouse (OLAP)relational-database (OLTP)
Data volumePB levelGB/TB level
LatencyHigh (minutes/hours)Low (milliseconds/seconds)
UPDATERow-level update/deletion is not supportedYes
IndexWeak (generally not recommended)Strong (core performance dependency)

A simple note: **Hive is for analyzing historical data, not for processing online business. **

Core Features

  1. SQL-like syntax: HiveQL, zero threshold to get started
  2. Scalability: Easily support petabytes of data
  3. Flexible storage: Text, ORC, Parquet, Avro and other formats are accepted
  4. Multi-engine support: The bottom layer can be switched to MapReduce, Tez or Spark

Get started quickly in ten minutes (quick overview of core syntax)

This article assumes that you already have a working Hive environment, and the installation steps are not discussed.

1. Library operations

These statements are basically the same as traditional databases. Note thatCASCADEAll tables under the library will be forcibly deleted:

-- 创建库
CREATE DATABASE IF NOT EXISTS eshop;

-- 切换库
USE eshop;

-- 删除库(慎用 CASCADE,会强制删除库下所有表)
DROP DATABASE IF EXISTS eshop CASCADE;

2. Table creation: external table vs partitioned table

These are the two most important table types in Hive and must be understood.

External table: Data files are usually stored in a directory in HDFS. Deleting the table only deletes the metadata, leaving the data files intact. Suitable for multiple applications to share the same data.

CREATE EXTERNAL TABLE IF NOT EXISTS dim_user_info 
(
  user_id           STRING,
  user_name         STRING, 
  sex               STRING,
  age               INT,
  city              STRING,
  first_active_time STRING,
  extra2            MAP<STRING, STRING>
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE;

Partitioned table: by a certain field (most often datedt) stores the data in the molecular directory. If partition conditions are included in the query, Hive will only scan relevant folders, and the performance can be improved by several orders of magnitude.

CREATE TABLE IF NOT EXISTS fact_user_trade 
(
  user_name      STRING,
  piece          INT,
  pay_amount     DOUBLE,
  goods_category STRING
)  
PARTITIONED BY (dt STRING)   -- 分区字段
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

It is not recommended to extract the partition field from business data. It is essentially just the directory name. Generally use date string, such as'2019-04-15'

3. Data loading and extraction

-- 加载本地数据
LOAD DATA LOCAL INPATH '/home/hadoop/data/user.txt' OVERWRITE INTO TABLE dim_user_info;

-- 开启动态分区(必备配置)
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

-- 查询:北京女用户前10名
SELECT user_name 
FROM dim_user_info
WHERE city='beijing' AND sex='female'
LIMIT 10;

-- 统计:2019年4月支付金额Top5用户
SELECT user_name, SUM(pay_amount) AS total
FROM fact_user_trade
WHERE dt BETWEEN '2019-04-01' AND '2019-04-30'
GROUP BY user_name
ORDER BY total DESC
LIMIT 5;

Is it almost the same as traditional SQL? That’s the beauty of Hive.


High-frequency weapon: complex types and LATERAL VIEW

Hive supportARRAYMAPSTRUCTThese complex data types, combined withLATERAL VIEWIt can realize the operation of "converting one row to multiple rows", which is very suitable for flattening aggregated data.

Scenario: Count the categories purchased by each user

We can use it firstCOLLECT_SETAggregate the categories of each user into an array, and then useEXPLODEExplode:

-- 第一步:聚合得到每个用户的品类数组
CREATE OR REPLACE VIEW v_user_categories AS
SELECT user_name, COLLECT_SET(goods_category) AS categories
FROM fact_user_trade
GROUP BY user_name;

-- 第二步:LATERAL VIEW + EXPLODE 横向展开数组
SELECT user_name, category
FROM v_user_categories
LATERAL VIEW EXPLODE(categories) t AS category;

in the case ofMAP<STRING, INT>This kind of key-value pair can also expand Key and Value at the same time:

SELECT user_name, category, cnt
FROM v_user_category_cnt_map
LATERAL VIEW EXPLODE(category_cnt_map) t AS category, cnt;

LATERAL VIEWIt can be understood as "temporarily splitting a row into multiple rows during query", which is very practical.


Performance Optimization: Five Core Skills

Hive optimization involves all aspects, but mastering the following five points is enough to cover 80% of actual scenarios.

1. Column storage: ORC or Parquet is preferred

defaultTextFileAlthough intuitive, it has low reading efficiency, poor compression, and takes up a lot of space. Strongly recommended for production environmentsORCorParquet

CREATE TABLE ... STORED AS ORC;

-- 或者对已有表进行转换
INSERT OVERWRITE TABLE orc_table SELECT * FROM text_table;

2. Partition pruning and column pruning

  • Column clipping: onlySELECTDon’t be lazy to write the required columns.SELECT *
  • Partition Cropping:WHEREThe partition field must be included in the condition (such asdt

These two are the most easily overlooked but the most profitable optimization methods.

3. MapJoin: A powerful tool for joining small tables to large tables

When a table is small enough to fit completely into memory, Hive can broadcast it to all Map nodes, eliminating the costly Shuffle process.

SELECT /*+ MAPJOIN(dim) */ * 
FROM fact_user_trade fact
JOIN dim_user_info dim ON fact.user_id = dim.user_id;

The new version of Hive usually automatically recognizes small tables and enables MapJoin. Generally, there is no need to manually add Hint, but it is always good to understand the principle.

4. Controlling parallelism

-- 设置 Reduce 任务个数(一般让系统自动调整,但可以设上限)
SET mapreduce.job.reduces=10;

It is not recommended to write it to death as it can easily lead to resource waste or uneven load.

5. Enable dynamic partitioning (no less important than the first four items)

Automatically create partitions based on data content, which is almost certainly used in the ETL process:

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

Quick check of commonly used functions (only the most useful ones are listed)

String function

FunctionEffectExample
substr(str, start, len)Intercept substringsubstr('2025-09-15', 1, 7) = '2025-09'
concat_ws(sep, str1, ...)Splicing with delimitersconcat_ws('-', 'a', 'b') = 'a-b'
split(str, regex)Split into arrayssplit('a,b,c', ',')

Date function

FunctionEffectExample
from_unixtime(ts, fmt)Convert timestamp to datefrom_unixtime(1694784000, 'yyyy-MM-dd')
datediff(dt1, dt2)Date differencedatediff('2025-09-15', '2025-09-10') = 5
date_add(dt, n)Date plus n daysdate_add('2025-09-15', 10)

Aggregation and window functions

FunctionEffect
collect_set(x)Remove duplicates and aggregate into arrays
row_number() over(...)Ranking within groups (no ties)
rank() over(...)Ranking within groups (tied jumps)
lag(col, n) over(...)Get the first n row values ​​

A complete list of functions can be found at Hive 官方文档.


Summarize

Hive is a classic introductory tool for big data analysis. Its core value lies in using SQL to shield the complexity of distributed computing.

This article takes you through the core process quickly:

  1. Build database and tables (focus on external tables and partition tables)
  2. Data loading and query
  3. Complex types and LATERAL VIEW
  4. Five performance optimization tips
  5. Quick check of high-frequency functions

I hope this article can help you get started with Hive quickly! If you find it useful, please feel free to collect and forward it.