在纯 hive SQL 中创建时间维度表

Hive 专栏收录该内容
12 篇文章 1 订阅

Without further ado, here is the full SQL to create a table giving you a table with one row per day, with date, year, mont, day, day and name of the week, day of the year. If you want the hours as well, look at the bottom of this post.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

set hivevar:start_day=2010-01-01;

set hivevar:end_day=2050-12-31;

set hivevar:timeDimTable=default.timeDim;

 

create table if not exists ${timeDimTable} as

with dates as (

select date_add("${start_day}", a.pos) as d

from (select posexplode(split(repeat("o", datediff("${end_day}", "${start_day}")), "o"))) a

)

select

    d as d

  , year(d) as year

  , month(d) as month

  , day(d) as day

  , date_format(d, 'u') as daynumber_of_week

  , date_format(d, 'EEEE') as dayname_of_week

  , date_format(d, 'D') as daynumber_of_year

 

from dates

sort by d

;

Note that I use d as date column because date is a reserved keyword.

The biggest issue is to generate one row per day. The trick here is to use a clever combination of posexplode, split and reapeat. This is what the first CTE does:

1

2

3

4

5

-- just 10 days for the example

set hivevar:start_day=2010-01-01;

set hivevar:end_day=2010-01-10;

select date_add("${start_day}", a.pos) as d

from (select posexplode(split(repeat("o", datediff("${end_day}", "${start_day}")), "o"))) a

We can break it down in a few parts:

1

2

select datediff("${end_day}", "${start_day}");

-- output: 9

Just computes the difference between start and end day in days.

1

2

select repeat("o", 9);

-- output: ooooooooo

Will output a string with 9 ‘o’. The actual character does not matter at all.

1

2

select split("ooooooooo", "o");

-- output:  ["","","","","","","","","",""]

Creates a hive array of 9 (empty) strings.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

select posexplode(split("ooooooooo", "o"));

-- output:

-- +------+------+--+

-- | pos | val |

-- +------+------+--+

-- | 0 | |

-- | 1 | |

-- | 2 | |

-- | 3 | |

-- | 4 | |

-- | 5 | |

-- | 6 | |

-- | 7 | |

-- | 8 | |

-- | 9 | |

-- +------+------+--+

Actually create a row per array element, with the index (0 to 9) and the value (nothing) of each element.

That was the tricky part, the rest is easy. The first CTE creates a row with each date, adding the array index (in day) to the start_day:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

with dates as (

select date_add("${start_day}", a.pos) as d

from (select posexplode(split(repeat("o", datediff("${end_day}", "${start_day}")), "o"))) a)

select * from dates;

-- +-------------+--+

-- | dates.d |

-- +-------------+--+

-- | 2010-01-01 |

-- | 2010-01-02 |

-- | 2010-01-03 |

-- | 2010-01-04 |

-- | 2010-01-05 |

-- | 2010-01-06 |

-- | 2010-01-07 |

-- | 2010-01-08 |

-- | 2010-01-09 |

-- | 2010-01-10 |

-- +-------------+--+

From there on, you can just create whatever column you feel like. Quarter column? floor(1+ month(d)/4) as quarter. Long name of the week? date_format(d, 'EEEE') as dayname_of_week_long.

As a bonus, I give you the same table but with hours added. The principles are exactly the same, with a cartesian join beween dates and hour:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

set hivevar:start_day=2010-01-01;

set hivevar:end_day=2010-01-02;

set hivevar:timeDimTable=default.timeDim;

 

create table if not exists ${timeDimTable} as<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;">&#65279;</span>

with dates as (

  select date_add("${start_day}", a.pos) as d

  from (select posexplode(split(repeat("o", datediff("${end_day}", "${start_day}")), "o"))) a

),

hours as (

  select a.pos as h

  from (select posexplode(split(repeat("o", 23), "o"))) a

)

select

    from_unixtime(unix_timestamp(cast(d as timestamp)) + (h * 3600)) as dt

  , d as d

  , year(d) as year

  , month(d) as month

  , day(d) as day

  , h as hour

  , date_format(d, 'u') as daynumber_of_week

  , date_format(d, 'EEEE') as dayname_of_week

  , date_format(d, 'D') as daynumber_of_year

 

from dates

join hours

sort by dt

;

转载:https://thisdataguy.com/2018/04/24/create-a-time-dimension-table-in-pure-hive-sql/

展开阅读全文
  • 3
    点赞
  • 1
    评论
  • 6
    收藏
  • 一键三连
    一键三连
  • 扫一扫,分享海报

相关推荐
©️2020 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值