hive hsql 漏斗模型_数据分析之SQL：常用模型

最新推荐文章于 2025-03-03 14:14:02 发布

很圆的方块

最新推荐文章于 2025-03-03 14:14:02 发布

阅读量692

点赞数

CC 4.0 BY-SA版权

文章标签： hive hsql 漏斗模型

本文链接：https://blog.csdn.net/weixin_36256978/article/details/112409249

本文详细介绍了SQL中的各种操作，如case when、内连接与左连接的区别、distinct的使用技巧、order by的注意事项、group by和having的运用，以及SQL优化策略。特别强调了在Hive中实现漏斗模型的关键在于左连接，并讨论了如何利用case when进行多条件统计。此外，还探讨了SQL中的聚合函数在数据分析中的重要作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

以下介绍常用的SQL写法：

case when的用法---不管偏不偏，你可能真没见过这种写法

内连接VS左连接---80%的业务代码都与之相关

distinct的用法--你可能真的错怪distinct了

order by的注意事项---order by一般放在主查询后，子查询无效！

group by---新手小白，总是group by时报错！

having--having有时真的很牛逼

topN问题---分组取最大，分组取前几

标准SQL与基于hive的SQL---最常见的区别

不得不知的聚合函数---数据库中的聚合函数真的比excel快很多！

SQL优化--路漫漫其修远兮……

做需求写SQL需要注意的问题---一家之言

case when

1、不同的岗位调不同等级的薪水；

select last_name,job_id,salary,case job_id
                                                 when 'IT_PROG' then 1.1*salary 
                                                 when 'ST_CLERK' then 1.15*salary 
                                                  when 'SA_REP' then 1.2*salary 
                                          else salary 
                                          end 
“zhangngongzi”
from employees;

2、行列转换之case when

select deptno,sum(clerk) as clerk ,sum(salesman) as salesman ,sum(manager) as manager,sum(analyst) as analyst ,sum(president) as president
from 
(select deptno,case job when 'CLERK' THEN SAL  end as  clerk, case job WHEN 'SALESMAN' THEN SAL  end as salesman, case job WHEN 'MANAGER' THEN SAL  end as manager,
case job WHEN 'ANALYST' THEN SAL   end as analyst, 
case job WHEN 'PRESIDENT' THEN SAL  END as president   FROM EMP ) group by deptno;

顺便提一下SQL语句的执行顺序：from--where---group by--having--select --distinct ---order by --limit

加上多表连接后，执行顺序：先执行子查询，再执行主查询；先对两个表执行笛卡尔乘积--join--where---group by--having--select --distinct ---order by --limit

带Left Join的SQL语句的执行顺序

最终实现的效果：

3、当你想得到多个指标的数据，又不想写多条语句，那么case when 可以帮到你：

create table qiuzhiliao as 
    select week_begin_date_id,u_user,
    (case 
        when channel = 'oppokeke' then  'oppokeke'
        when channel = 'huawei' then  'huawei'
        when channel = 'xiaomi' then 'xiaomi'
        when channel = 'yingyongbao' then  'yingyongbao'
        when channel = 'yingyongbaozx' then  'yingyongbaozx'
        when channel = 'AppStore' then  'AppStore'
        when channel = 'baidu' then  'baidu'
        --when channel = '360zhushou' then channel ='360zhushou'
        when channel = 'wandoujia'  then 'wandoujia'
        when channel like 'bdsem%' then 'bdsem'
        when channel like 'sgsem%' then 'sgsem'
        when channel like 'smsem%' then 'smsem'
        else '360zhushou'
        end 
    ) AS channel
    from tmp.qzl_1
    where (channel in ('oppokeke','huawei','xiaomi','yingyongbao','yingyongbaozx','AppStore','baidu','360zhushou','wandoujia')
or channel like 'bdsem%'
or channel like 'sgsem%'
or channel like 'smsem%')


select week_begin_date_id,channel ,count (distinct u_user)
    from qiuzhiliao
group by week_begin_date_id,channel 
order by week_begin_date_id,channel

这样可以将具有相同特征信息的聚成一类，然后统计这类的数据；
case 字段 when 条件 then =case when 字段 =条件 then ；

4、case when完成多条件统计

select zc.stab,
    sum(case when toppop>10000 then 1 else  0 end ) as num_10000,
    --sum(case when toppop>10000 then 1  end ) as num_10000
    sum(case when toppop >1000 then 1 else 0 end) as num_10000
from ZipCensus  zc
group by zc.stab

toppop>10000的计数统计为num_10000,toppop>1000的计数统计为num_1000;---case when 完成多条件统计；上面注释行：当不满足条件时返回null，第一个case when不满足条件时，返回0；通常在计数时，推荐返回值返回数字而不是null！并且在聚合函数中使用case when，适用的函数sum（）或max（），avg（），很少情况使用count(distinct ).

内连接VS左连接

内连接：连接键如果匹配上，就连接，没有匹配上，就丢掉！

左连接：以左表为准（左表中的所有数据都会出现），右表匹配主表，匹配上就写入，没匹配上就写null。---因为null的存在，就有了业务上最终的漏斗模型！

以下表格完整解释了左连接与内连接的区别！

举个例子：

create table tmp.qiuzhiliao as  
select t1.u_user,t1.date_sk
from (
select u_user,date_sk
    from dw.order
    where date_sk between 20190211 and 20190318
    and status in  ('支付成功')
)t1
left join (
    select u_user
    from dw.order
    where date_sk between 20190211 and 20190318
    and status = '退款成功'
)t2
on t1.u_user=t2.u_user
where t2.u_user is null

业务中最常用的模型：主表t1是支付成功的用户（包括先支付成功，然后又退款的用户），t2表是退款成功的用户；通过左连接，并where限制，就可以求出实际付款成功的用户（支付成功-退款成功）

我们什么时候使用内连接了：因为表设计的原因，我们一张表不可能容纳所有的数据项，从io性能角度出发，表维度越大，查询时间越长；所以你想跨表引用字段时，就可以使用内连接!

举个例子：A表有用户，视频，支付类型；B表有视频，视频时长；

你现在想知道：用户看了多久的视频；你就得获取A表的用户，视频名，B表的视频时长。

你就得使用内连接。

表设计中：不是所有的表都有主键，但表之间要发生连接，就一定需要主键；那么没有外键约束的表，称为父表；有外键约束的表称为子表；并且子表的外键就连接在主表的主键上。

主键：非空，唯一约束；主键在一张表中，只有一个！但一个主键可以包括多个字段（联合主键）。

回到本题：A表中的视频就是主键，B表中的视频就是外键!

distinct的用法

1、distinct有去除重复值的效果，但查询字段>=2时，就是对这两个字段联合去重（两个字段同时相同，才会被当作重复值）

2、distinct只能放在首列，否则报错！

3、去除重复值最好用group by，distinct更多时候出现在count(distinct 字段）中用于统计去除重复值的条数。

MySql中distinct的用法 - 苔苔以苔苔以苔 - 博客园

order by 的注意事项

1、order by放在语句的最后，同时执行顺序也是最后。

2、order by放在子查询中会失效，一般放在主查询中最后执行！

又经过其它侧面证明，order by放在主查询中得到的u_user才是我们想要的前30名用户！

group by

1、group by+聚合键；

2、select 后只能出现聚合键，或是聚合函数；不能出现其它的字段，否则报错！

SQL中GROUP BY用法示例

having 有时真的很牛逼

1、having只能跟在group by后，不能单独使用；

2、having是对group by分组后的数据进行筛选判断。

SQL中GROUP BY用法示例

topN问题

1、分组取最大，最小，平均值；group by +聚合函数；但无法得到聚合键之外的数据，这时可以使用关联子查询。

2、求得每组前两名数据；limit+union all

select 课程号,max(成绩) as 最大成绩
from score 
group by 课程号;

分组取最大，但得不到聚合键之外的数据；

select * 
from score as a 
where 成绩 = (
select max(成绩) 
from score as b 
where b.课程号 = a.课程号);

使用关联子查询实现；

每组最大的N条记录

1、先求出最大记录所在的组

2、union连接

select 课程号,max(成绩) as 最大成绩
from score 
group by 课程号;

(select * from score where 课程号 = '0001' order by 成绩  desc limit 2)
union all
(select * from score where 课程号 = '0002' order by 成绩  desc limit 2)
union all
(select * from score where 课程号 = '0003' order by 成绩  desc limit 2);

标准SQL与基于hive的SQL

在标准SQL中，我们有以下语句：查询所有课程成绩小于60分学生的学号、姓名

select 学号,姓名
from student
where  学号 in (
select 学号 
from student
where 成绩 < 60
);

在Hive中，in后只能跟字符串；比如in ('huawei','oppo')，不允许出现上述的语法结构！

现在通过内连接，对上述语言进行改写：

select t1.学号,t1.姓名
from (
select 学号,姓名
from student 
)t1
inner join (
      select 学号 
from student
where 成绩 < 60
)t2
on t1.学号=t2.学号

不得不知的聚合函数

1,sum与case when的结合

2,count与case when的结合

使用分段[100-85],[85-70],[70-60],[<60]来统计各科成绩，分别统计：各分数段人数，课程号和课程名称；

只为讲清逻辑，想了解表格结构，具体思路，请点击文末提供的参考文献。

select a.课程号,b.课程名称,
sum(case when 成绩 between 85 and 100 
	 then 1 else 0 end) as '[100-85]',
sum(case when 成绩 >=70 and 成绩<85 
	 then 1 else 0 end) as '[85-70]',
sum(case when 成绩>=60 and 成绩<70  
	 then 1 else 0 end) as '[70-60]',
sum(case when 成绩<60 then 1 else 0 end) as '[<60]'
from score as a right join course as b 
on a.课程号=b.课程号
group by a.课程号,b.课程名称;

接下来给一个企业级别的应用（已经去除敏感信息）：基于Hive

--产品路径统计可以这么写
--春节活动入口统计
 SELECT day,
        count(CASE when x2_1.event_key = 'initApp'  then x2_1.u_user ELSE NULL end) as ad_pv,--开屏曝光的pv
        count(DISTINCT CASE when x2_1.event_key = 'initApp'  then x2_1.u_user ELSE NULL end) as ad_uv,--开屏曝光的uv
FROM
    (
        SELECT from_unixtime(unix_timestamp(cast(day as string),'yyyyMMdd'),'yyyy-MM-dd') as day,
               event_key,u_user,status,button,device
        from table
        WHERE from_unixtime(unix_timestamp(cast(day as string),'yyyyMMdd'),'yyyy-MM-dd') BETWEEN '2019-03-28' and date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),1) --int
            AND
                    --开屏曝光
                    event_key = 'initApp'
group by day 
order by day