在本文中,我们将介绍如何使用Impala创建各种类型的表。Impala是一个用于大规模数据分析的分布式SQL查询引擎,它支持多种数据类型、文件格式和表属性。通过掌握这些知识,您可以根据需要创建和管理Impala表,以满足您的数据分析需求。
从CDH 5.5 / Impala 2.3开始,Impala支持复杂类型(STRUCT、ARRAY或MAP)。由于您可以将这些类型嵌套(例如,创建一个map数组或具有数组字段的struct),因此这些类型有时也被称为嵌套类型。
Impala可以创建包含复杂类型列的表,支持任何支持的文件格式。由于目前Impala只能查询Parquet表中的复杂类型列,因此创建其他文件格式(如文本)的表意义有限。
语法:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
(col_name data_type
[COMMENT 'col_comment']
[, ...]
)
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
[SORT BY ([column [, column ...]])]
[COMMENT 'table_comment']
[ROW FORMAT row_format]
[WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[STORED AS file_format]
[LOCATION 'hdfs_path']
[CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED]
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
此语句用于创建一个[外部]表,其中包含指定的列和数据类型。表可以根据需要进行分区、排序和存储。以下是各部分的解释:
CREATE [EXTERNAL] TABLE:创建一个[外部]表。外部表与内部表的区别在于,外部表的数据存储在HDFS或其他数据存储系统上,而不是在Impala内部。
[IF NOT EXISTS]:如果表已存在,则不执行任何操作。这可以避免因尝试创建已存在的表而导致的错误。
[db_name.]table_name:指定表的名称,其中db_name是数据库名称,table_name是表名称。
(col_name data_type):定义表中的列及其数据类型。可以指定多个列,用逗号分隔。
[COMMENT ‘col_comment’]:为列添加注释。
[, …]:表示列的续行。
PARTITIONED BY (col_name data_type [COMMENT ‘col_comment’], …):根据指定的列对表进行分区。分区字段可以是任意数据类型,并为分区添加注释。
[SORT BY ([column [, column …]])]:根据指定的列对表进行排序。可以指定多个列,用逗号分隔。
[COMMENT ‘table_comment’]:为表添加注释。
[ROW FORMAT row_format]:指定表的行格式。行格式影响数据在文件中的存储方式。
[WITH SERDEPROPERTIES (‘key1’=‘value1’, ‘key2’=‘value2’, …)]:为表指定SERDE(数据压缩和编码插件)的属性。
[STORED AS file_format]:指定表存储为哪种文件格式,如Parquet、ORC等。
[LOCATION ‘hdfs_path’]:指定表数据的存储位置。
[CACHED IN ‘pool_name’ [WITH REPLICATION = integer] | UNCACHED]:指定表是否在内存中缓存,以及缓存池的名称。如果指定REPLICATION,则表示缓存池中的副本数量。
[TBLPROPERTIES (‘key1’=‘value1’, ‘key2’=‘value2’, …)]:为表添加元数据属性。
示例:
创建排序表
CREATE TABLE census_data (
last_name STRING,
first_name STRING,
state STRING,
address STRING)
SORT BY (last_name, state)
STORED AS PARQUET;
CREATE TABLE sorted_census_data
SORT BY (last_name, state)
STORED AS PARQUET
AS SELECT last_name, first_name, state, address
FROM unsorted_census_data;
语法:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE PARQUET 'hdfs_path_of_parquet_file'
[PARTITIONED BY (col_name data_type [COMMENT 'col_comment'], ...)]
[SORT BY ([column [, column ...]])]
[COMMENT 'table_comment']
[ROW FORMAT row_format]
[WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
[STORED AS file_format]
[LOCATION 'hdfs_path']
[CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED]
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
语法:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] db_name.]table_name
[PARTITIONED BY (col_name[, ...])]
[SORT BY ([column [, column ...]])]
[COMMENT 'table_comment']
+ [ROW FORMAT row_format]
[WITH SERDEPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
+ [STORED AS ctas_file_format]
[LOCATION 'hdfs_path']
+ [CACHED IN 'pool_name' [WITH REPLICATION = integer] | UNCACHED]
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
AS
select_statement
注意:如果创建分区表,则分区键列必须在
SELECT
列列表中列出,且顺序与PARTITIONED BY
子句中的顺序相同。否则,您将收到关于列名称不匹配的错误。
示例:
CREATE TABLE clone_of_t1 AS SELECT * FROM t1;
CREATE TABLE parquet_version_of_t1 STORED AS PARQUET AS SELECT * FROM t1;
语法:
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
(
col_name1 data_type1 [kudu_column_attribute ...] [COMMENT 'col_comment'] [, ...]
col_name2 data_type2 [kudu_column_attribute ...] [COMMENT 'col_comment'] [, ...]
[PRIMARY KEY (col_name[, ...])]
...
)
[PARTITION BY kudu_partition_clause]
[COMMENT 'table_comment']
STORED AS KUDU
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
Kudu列属性(kudu_column_attribute):
PRIMARY KEY:指定表中的主键列。主键是唯一标识表中每行数据的列,它确保了数据的唯一性和完整性。
[NOT] NULL:用于定义列是否允许为空值。如果使用 “NULL” 关键字,则表示该列可以包含空值;如果使用 “NOT NULL” 关键字,则表示该列不允许为空值。
ENCODING codec:指定在存储或传输过程中对数据进行编码时所使用的编解码器(codec)。编码器将原始数据转换为特定格式,以便更高效地存储和处理。
COMPRESSION algorithm:指定在存储过程中对数据进行压缩时所采用的算法。压缩算法可减小存储空间占用,并提高读写性能。
DEFAULT constant:为某个字段设置默认值。当插入新记录时,如果没有显式提供该字段的值,则会自动填充默认值。
BLOCK_SIZE number:定义文件块大小(单位为字节),即将文件划分为固定大小块以便管理和处理。较大块大小可能有助于提高读取效率,但也会增加内部碎片化风险。
语法:
HASH [ (pk_col [, ...]) ]
PARTITIONS n
示例:
CREATE TABLE kudu_t1 (
id BIGINT PRIMARY key,
s STRING,
b BOOLEAN)
PARTITION BY HASH (id) PARTITIONS 20 STORED AS KUDU;
语法:
RANGE [ (pk_col [, ...]) ]
(
{
PARTITION constant_expression range_comparison_operator VALUES range_comparison_operator constant_expression
| PARTITION VALUE = constant_expression_or_tuple
}
[, ...]
)
range_comparison_operator ::= { < | <= }
示例:
CREATE TABLE kudu_t3 (
id BIGINT,
year INT,
s STRING,
b BOOLEAN,
PRIMARY KEY (id,year))
PARTITION BY HASH (id) PARTITIONS 20,
RANGE (year) (PARTITION 1980 <= VALUES < 1990,
PARTITION 1990 <= VALUES < 2000,
PARTITION VALUE = 2001,
PARTITION 2001 < VALUES)
STORED AS KUDU;
create table range_t4 (
x bigint,
s string,
s2 string,
primary key (x, s))
partition by range (x,s)(partition value = (0,'zero'),
partition value = (1,'one'),
partition value = (2,'two'))
stored as kudu;
语法:
CREATE EXTERNAL TABLE [IF NOT EXISTS] [db_name.]table_name
[COMMENT 'table_comment']
STORED AS KUDU
[TBLPROPERTIES ('kudu.table_name'='internal_kudu_name')]
示例:
CREATE EXTERNAL TABLE external_t1 STORED AS KUDU
TBLPROPERTIES ('kudu.table_name'='kudu_tbl_created_via_api');
语法:
CREATE TABLE [IF NOT EXISTS] db_name.]table_name
[PRIMARY KEY (col_name[, ...])]
[PARTITION BY kudu_partition_clause]
[COMMENT 'table_comment']
STORED AS KUDU
[TBLPROPERTIES ('key1'='value1', 'key2'='value2', ...)]
AS
select_statement
示例:
CREATE TABLE ctas_t1
PRIMARY KEY (id) PARTITION BY HASH (id) PARTITIONS 10
STORED AS KUDU
AS SELECT id, s FROM kudu_t1;
CREATE TABLE ... LIKE
形式允许一组受限的子句,目前仅支持LOCATION、COMMENT和STORED AS子句。
语法:
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
LIKE { [db_name.]table_name | PARQUET 'hdfs_path_of_parquet_file' }
[COMMENT 'table_comment']
[STORED AS file_format]
[LOCATION 'hdfs_path']
在本教程中,我们学习了如何使用Impala创建不同类型的表,包括基本数据类型表、复杂数据类型表、行格式、数据文件格式、CTAS数据文件格式、Kudu表、Kudu分区表、Kudu外部表以及通过CTAS创建Kudu表。此外,我们还介绍了如何克隆表的结构。
希望本教程对您有所帮助!如有任何疑问或问题,请随时在评论区留言。感谢阅读!