Hive in practice
Why choose Hive?
Hive is a Hadoop-based data warehouse tool open sourced by Facebook and maintained by Apache. It is also one of the most widely used big data solutions currently.
You can think of it as a translator: you only need to write familiar SQL, and it can "translate" these statements into MapReduce, Spark or Tez tasks and run them on a distributed cluster. Even if you don’t know a word of Java or Scala, you can still process petabytes of data.
The essential difference between Hive and traditional databases
Many novices will regard Hive as a replacement for MySQL. In fact, they are completely two sets of things:
A simple note: **Hive is for analyzing historical data, not for processing online business. **
Core Features
- SQL-like syntax: HiveQL, zero threshold to get started
- Scalability: Easily support petabytes of data
- Flexible storage: Text, ORC, Parquet, Avro and other formats are accepted
- Multi-engine support: The bottom layer can be switched to MapReduce, Tez or Spark
Get started quickly in ten minutes (quick overview of core syntax)
This article assumes that you already have a working Hive environment, and the installation steps are not discussed.
1. Library operations
These statements are basically the same as traditional databases. Note thatCASCADEAll tables under the library will be forcibly deleted:
2. Table creation: external table vs partitioned table
These are the two most important table types in Hive and must be understood.
External table: Data files are usually stored in a directory in HDFS. Deleting the table only deletes the metadata, leaving the data files intact. Suitable for multiple applications to share the same data.
Partitioned table: by a certain field (most often datedt) stores the data in the molecular directory. If partition conditions are included in the query, Hive will only scan relevant folders, and the performance can be improved by several orders of magnitude.
It is not recommended to extract the partition field from business data. It is essentially just the directory name. Generally use date string, such as
'2019-04-15'。
3. Data loading and extraction
Is it almost the same as traditional SQL? That’s the beauty of Hive.
High-frequency weapon: complex types and LATERAL VIEW
Hive supportARRAY、MAP、STRUCTThese complex data types, combined withLATERAL VIEWIt can realize the operation of "converting one row to multiple rows", which is very suitable for flattening aggregated data.
Scenario: Count the categories purchased by each user
We can use it firstCOLLECT_SETAggregate the categories of each user into an array, and then useEXPLODEExplode:
in the case ofMAP<STRING, INT>This kind of key-value pair can also expand Key and Value at the same time:
LATERAL VIEWIt can be understood as "temporarily splitting a row into multiple rows during query", which is very practical.
Performance Optimization: Five Core Skills
Hive optimization involves all aspects, but mastering the following five points is enough to cover 80% of actual scenarios.
1. Column storage: ORC or Parquet is preferred
defaultTextFileAlthough intuitive, it has low reading efficiency, poor compression, and takes up a lot of space. Strongly recommended for production environmentsORCorParquet。
2. Partition pruning and column pruning
- Column clipping: only
SELECTDon’t be lazy to write the required columns.SELECT * - Partition Cropping:
WHEREThe partition field must be included in the condition (such asdt)
These two are the most easily overlooked but the most profitable optimization methods.
3. MapJoin: A powerful tool for joining small tables to large tables
When a table is small enough to fit completely into memory, Hive can broadcast it to all Map nodes, eliminating the costly Shuffle process.
The new version of Hive usually automatically recognizes small tables and enables MapJoin. Generally, there is no need to manually add Hint, but it is always good to understand the principle.
4. Controlling parallelism
It is not recommended to write it to death as it can easily lead to resource waste or uneven load.
5. Enable dynamic partitioning (no less important than the first four items)
Automatically create partitions based on data content, which is almost certainly used in the ETL process:
Quick check of commonly used functions (only the most useful ones are listed)
String function
Date function
Aggregation and window functions
A complete list of functions can be found at Hive 官方文档.
Summarize
Hive is a classic introductory tool for big data analysis. Its core value lies in using SQL to shield the complexity of distributed computing.
This article takes you through the core process quickly:
- Build database and tables (focus on external tables and partition tables)
- Data loading and query
- Complex types and LATERAL VIEW
- Five performance optimization tips
- Quick check of high-frequency functions
I hope this article can help you get started with Hive quickly! If you find it useful, please feel free to collect and forward it.

