今天我們來看一道SQL題目:

店鋪 銷售日期 銷售額
A 2017-10-11 300
A 2017-10-12 200
B 2017-10-11 400
B 2017-10-12 200
A 2017-10-13 100
A 2017-10-15 100
C 2017-10-11 350
C 2017-10-15 400
C 2017-10-16 200
D 2017-10-13 500
E 2017-10-14 600
E 2017-10-15 500
D 2017-10-14 600
B 2017-10-13 300
C 2017-10-17 100

需求:求出連續三天有銷售記錄的店鋪

思路一:

自然的想法,尋找每個店鋪是否連續三天都有銷售額。利用現有的表,構造一個中間表,中間表既有當前日期的銷售額,又有當前日期後兩天的銷售額,然後篩選銷售額大於0的店鋪名稱即可。這種思路可以有(至少)兩種實現方式。

一是通過自連接來實現,join兩次。連接的條件是店鋪名稱相同並且天數相差1天。這種方式無論是在MySQL中還是Hive中都適用。我們以mysql為例說明,寫法如下:

註:左右滑動查看全部代碼

mysql> create table sales (
name char(1),
day char(10),
amount int
);
mysql> insert into sales values(A, 2017-10-11, 300);
mysql> insert into sales values(A, 2017-10-12, 200);
mysql> insert into sales values(B, 2017-10-11, 400);
mysql> insert into sales values(B, 2017-10-12, 200);
mysql> insert into sales values(A, 2017-10-13, 100);
mysql> insert into sales values(A, 2017-10-15, 100);
mysql> insert into sales values(C, 2017-10-11, 350);
mysql> insert into sales values(C, 2017-10-15, 400);
mysql> insert into sales values(C, 2017-10-16, 200);
mysql> insert into sales values(D, 2017-10-13, 500);
mysql> insert into sales values(E, 2017-10-14, 600);
mysql> insert into sales values(E, 2017-10-15, 500);
mysql> insert into sales values(D, 2017-10-14, 600);
mysql> insert into sales values(B, 2017-10-13, 300);
mysql> insert into sales values(C, 2017-10-17, 100);

我們看一下我們想要構造的中間表,大概是長這個樣子:

mysql> select *
-> from sales a
-> left join sales b
-> on a.name = b.name and
-> datediff(str_to_date(b.day, %Y-%m-%d), str_to_date(a.day, %Y-%m-%d)) = 1
-> left join sales c
-> on b.name = c.name and
-> datediff(str_to_date(c.day, %Y-%m-%d), str_to_date(b.day, %Y-%m-%d)) = 1;
+------+------------+--------+------+------------+--------+------+------------+--------+
| name | day | amount | name | day | amount | name | day | amount |
+------+------------+--------+------+------------+--------+------+------------+--------+
| A | 2017-10-11 | 300 | A | 2017-10-12 | 200 | A | 2017-10-13 | 100 |
| B | 2017-10-11 | 400 | B | 2017-10-12 | 200 | B | 2017-10-13 | 300 |
| C | 2017-10-15 | 400 | C | 2017-10-16 | 200 | C | 2017-10-17 | 100 |
| A | 2017-10-12 | 200 | A | 2017-10-13 | 100 | NULL | NULL | NULL |
| E | 2017-10-14 | 600 | E | 2017-10-15 | 500 | NULL | NULL | NULL |
| D | 2017-10-13 | 500 | D | 2017-10-14 | 600 | NULL | NULL | NULL |
| B | 2017-10-12 | 200 | B | 2017-10-13 | 300 | NULL | NULL | NULL |
| C | 2017-10-16 | 200 | C | 2017-10-17 | 100 | NULL | NULL | NULL |
| A | 2017-10-13 | 100 | NULL | NULL | NULL | NULL | NULL | NULL |
| A | 2017-10-15 | 100 | NULL | NULL | NULL | NULL | NULL | NULL |
| C | 2017-10-11 | 350 | NULL | NULL | NULL | NULL | NULL | NULL |
| E | 2017-10-15 | 500 | NULL | NULL | NULL | NULL | NULL | NULL |
| D | 2017-10-14 | 600 | NULL | NULL | NULL | NULL | NULL | NULL |
| B | 2017-10-13 | 300 | NULL | NULL | NULL | NULL | NULL | NULL |
| C | 2017-10-17 | 100 | NULL | NULL | NULL | NULL | NULL | NULL |
+------+------------+--------+------+------------+--------+------+------------+--------+
15 rows in set (0.00 sec)

可以看到需要藉助str_to_datedatediff函數處理日期的差值,每一條記錄相鄰兩個日期在天數上依次加一,不滿足這樣條件的為NULL值。我們在此基礎上增加where條件過濾amount>0,並篩選出店鋪名稱即可,如下所示:

mysql> select a.name
-> from sales a
-> left join sales b
-> on a.name = b.name and
-> datediff(str_to_date(b.day, %Y-%m-%d), str_to_date(a.day, %Y-%m-%d)) = 1
-> left join sales c
-> on b.name = c.name and
-> datediff(str_to_date(c.day, %Y-%m-%d), str_to_date(b.day, %Y-%m-%d)) = 1
-> where a.amount > 0 and b.amount > 0 and c.amount > 0;
+------+
| name |
+------+
| A |
| B |
| C |
+------+
3 rows in set (0.01 sec)

接下來我們思考,同樣的思路放在Hive中能不能實現呢?有沒有什麼差別呢?通過join的方式當然沒有問題。但能夠聯想到,Hive中提供了窗口函數,其中有一個lead函數可以獲得當前記錄的下一條記錄,我們如果按照日期升序排列,借用lead函數是不是也可以得到同樣結構的中間表了呢?我們看下代碼:

hive> create external table sales(name string, day string, amount int) row format delimited fields terminated by ;
OK
Time taken: 0.043 seconds
hive> load data local inpath sales.txt into table sales;
Loading data to table learn.sales
Table learn.sales stats: [numFiles=1, totalSize=255]
OK
Time taken: 0.152 seconds
hive> select * from sales;
OK
A 2017-10-11 300
A 2017-10-12 200
B 2017-10-11 400
B 2017-10-12 200
A 2017-10-13 100
A 2017-10-15 100
C 2017-10-11 350
C 2017-10-15 400
C 2017-10-16 200
D 2017-10-13 500
E 2017-10-14 600
E 2017-10-15 500
D 2017-10-14 600
B 2017-10-13 300
C 2017-10-17 100
Time taken: 0.042 seconds, Fetched: 15 row(s)

使用lead函數構造中間表的代碼如下:

hive> select *
> from
> (
> select a.name, a.day day, a.amount,
> lead(a.name, 1, null) over (partition by a.name order by day) as name2,
> lead(a.day, 1, null) over (partition by a.name order by day) as day2,
> lead(a.amount, 1, null) over (partition by a.name order by day) as amount2,
> lead(a.name, 2, null) over (partition by a.name order by day) as name3,
> lead(a.day, 2, null) over (partition by a.name order by day) as day3,
> lead(a.amount, 2, null) over (partition by a.name order by day) as amount3
> from
> (
> select * from sales order by name, day
> ) a
> ) b ;
OK
A 2017-10-11 300 A 2017-10-12 200 A 2017-10-13 100
A 2017-10-12 200 A 2017-10-13 100 A 2017-10-15 100
A 2017-10-13 100 A 2017-10-15 100 NULL NULL NULL
A 2017-10-15 100 NULL NULL NULL NULL NULL NULL
B 2017-10-11 400 B 2017-10-12 200 B 2017-10-13 300
B 2017-10-12 200 B 2017-10-13 300 NULL NULL NULL
B 2017-10-13 300 NULL NULL NULL NULL NULL NULL
C 2017-10-11 350 C 2017-10-15 400 C 2017-10-16 200
C 2017-10-15 400 C 2017-10-16 200 C 2017-10-17 100
C 2017-10-16 200 C 2017-10-17 100 NULL NULL NULL
C 2017-10-17 100 NULL NULL NULL NULL NULL NULL
D 2017-10-13 500 D 2017-10-14 600 NULL NULL NULL
D 2017-10-14 600 NULL NULL NULL NULL NULL NULL
E 2017-10-14 600 E 2017-10-15 500 NULL NULL NULL
E 2017-10-15 500 NULL NULL NULL NULL NULL NULL
Time taken: 18.652 seconds, Fetched: 15 row(s)

這樣的結果和剛剛類似,但按照店鋪名稱排了序。同樣我們需要處理日期的差值,然後使用where條件過濾amount>0的記錄,並篩選出店鋪名稱即可,這裡使用了datediff函數和to_date函數。

hive> select b.name
> from
> (
> select a.name, a.day day, a.amount,
> lead(a.name, 1, null) over (partition by a.name order by day) as name2,
> lead(a.day, 1, null) over (partition by a.name order by day) as day2,
> lead(a.amount, 1, null) over (partition by a.name order by day) as amount2,
> lead(a.name, 2, null) over (partition by a.name order by day) as name3,
> lead(a.day, 2, null) over (partition by a.name order by day) as day3,
> lead(a.amount, 2, null) over (partition by a.name order by day) as amount3
> from
> (
> select * from sales order by name, day
> ) a
> ) b
> where b.amount > 0 and b.amount2 > 0 and b.amount3 > 0
> and datediff(to_date(b.day3), to_date(b.day2)) = 1 and datediff(to_date(b.day2), to_date(b.day)) = 1;
OK
A
B
C
Time taken: 17.858 seconds, Fetched: 3 row(s)

思路二:

上面的思路雖然比較自然,但稍微多想一下,如果連續的日期不是3天,是7天,15天呢,是不是就要多寫好多join,多寫好多lead了呢,一方面join的效率是個問題,而且代碼上會比較繁瑣。所以有沒有更好的思路呢?答案是肯定的,這種思路有點尋找規律的意思,要對每個店鋪的銷售記錄按天進行組內排序,並求序號和銷售「日」的和,和的值是有規律的,但不需要用到join。請看在hive中執行的代碼(省略了MapReduce的日誌):

hive> select *, substr(day, 9, 2), row_number() over (partition by name order by day desc),
> cast(substr(day, 9, 2) as int) + row_number() over (partition by name order by day desc) as plus
> from sales;
OK
A 2017-10-15 100 15 1 16
A 2017-10-13 100 13 2 15
A 2017-10-12 200 12 3 15
A 2017-10-11 300 11 4 15
B 2017-10-13 300 13 1 14
B 2017-10-12 200 12 2 14
B 2017-10-11 400 11 3 14
C 2017-10-17 100 17 1 18
C 2017-10-16 200 16 2 18
C 2017-10-15 400 15 3 18
C 2017-10-11 350 11 4 15
D 2017-10-14 600 14 1 15
D 2017-10-13 500 13 2 15
E 2017-10-15 500 15 1 16
E 2017-10-14 600 14 2 16

上面的結果中,倒數第三列是「年月日」的「日」,倒數第二列是對於每一個店鋪內部,按照日期降序排列的序號,最後一列是二者的和。可以觀察到,店鋪有連續銷售日期的記錄,這個「和」列是一致的,且有連續幾天,同樣的和就會出現幾次。銷售日期如果不連續,則和的值也不一樣。這樣如果是連續3天,我們只需要篩選出這樣的「和」出現3次的,同時選出店鋪名稱即可,7天就是7次,以此類推。代碼如下(省略了MapReduce的日誌):

hive> select b.name
> from
> (
> select a.name, plus, count(*)
> from
> (
> select *, substr(day, 9, 2), row_number() over (partition by name order by day desc),
> cast(substr(day, 9, 2) as int) + row_number() over (partition by name order by day desc) as plus
> from sales
> ) a
> group by
> a.name, plus
> having count(*) >=3
> ) b;
OK
A
B
C

可以看到,也實現了我們想要的結果。

總結

我們用兩種思路,三種方法,實現了「求連續三天有銷售記錄的店鋪」,其中前兩種方法容易理解,但第三種方法可能不太容易想到,但容易擴展,希望對大家有所啟示。需要注意的是,我們只是使用了自己構造的數據,沒有在專業的OJ上測,所以可能也並不是最優的解法。如果你有更好的解決思路,歡迎關注我的微信公眾號一起交流~

推薦閱讀:

相关文章