"How Mysql Works" Reading Notes 3

Welcome Big Brother to Little Brother Blog 2022-09-23 07:46:26 阅读数:524

mysqlworksreadingnotes

《Mysql是怎样运行的》读书笔记三

一、两个表的连接-连接的原理

The essence of connection is toThe records in each connection table are taken out and matched in sequenceJoin the result set and return it to the user.

请添加图片描述

1.1连接的过程

​ Arbitrary connection of several tables,If you don't have any constraints,这些表连接起来产生的 笛卡尔积 可能是非常巨大的.So it is necessary to filter specific record combinations when connecting,The filter conditions in the join query can be divided into:

  • 单表过滤条件
  • Involve multiple tables
SELECT * FROM t1, t2 WHERE t1.m1 > 1 AND t1.m1 = t2.m2 AND t2.n2 < 'd';

The approximate execution process of this connection query:

  1. Determine the first table to query,称之为 驱动表(选取代价最小的那种访问方法去执行单表查询语句)
  2. Drive table for the previous stept1Each record in the resulting result set,分别需要对t2表(Driven to find matching records,so-called matching records,指符合被驱动表t2filter records.

请添加图片描述

1.2 内连接和外连接

​ 驱动表中的记录即使在被驱动表中没有匹配的记录,也仍然需要加入到结果集.According to this idea, it is divided into driving table and driven table

对于外连接而言,I don't want to add all the records of the driver table to the final result set.有时候匹配失败要加入结果集,有时候又不要加入结果集.怎么办?Divide the filter conditions into two where 和 on

  • where子句的过滤条件

不论是内连接还是外连接,凡是不符合 WHERE 子句中的filter records都不会被加入最后的结果集.

  • on子句的过滤条件

对于外连接的驱动表的记录来说,如果无法在被驱动表中找到匹配 ON 子句中的filter records,那么该记录仍然会被加入到结果集中,对应的被驱动表记录的各个字段使用 NULL 值填充.

内连接的where和on等价,So inner joins do not require mandatory specificationON子句.Inside connection driver table and table driven are interchangeable,Does not affect the final query result.Externally connected driver table and driven table are not easily interchangeable

1.3 连接的原理

1.3.1嵌套循环连接

SELECT * FROM t1, t2 WHERE t1.m1 > 1 AND t1.m1 = t2.m2 AND t2.n2 < 'd';

For this inner join containing two tablesSQL 查询的步骤:

  1. 选取驱动表,使用与驱动表相关的过滤条件,选取代价最低的单表访问方法来执行对驱动表的单表查询.
  2. 对上一步骤中查询驱动表得到的结果集中每一条记录,都分别到被驱动表中查找匹配的记录.

如果有3个表进行连接,那么步骤2 result set from as new drive table,The third table becomes the driven table

请添加图片描述

What is a nested loop join:One visit to the drive table,But the driven table may be accessed multiple times,The number of accesses depends on the connection execution method of the number of records in the result set after performing a single-table query on the driver table,It is also the dumbest join query algorithm.

1.3.2 使用索引加快连接速度

​ 在嵌套循环连接的步骤2 中可能需要 Multiple access to the driven table,如果访问被驱动表的方式都是全表扫描,that takes many times.

SELECT * FROM t1, t2 WHERE t1.m1 > 1 AND t1.m1 = t2.m2 AND t2.n2 < 'd';

解决方案:给被驱动表 Column addition of search criteria 索引

  • 在m2列上建立索引,此时用到ref的访问方法,After returning to the table, only need to judget2.n2 < ‘d’
  • 给n2列建立索引,则使用到range访问方法

1.3.3 基于块的嵌套循环连接

​ using the first two methods,The driven table is still accessed many times,If the data in the table is very large and indexes are not applicable,Need to read the driven table from disk multiple times,这个I/O代价非常大,So we have to find one Minimize the number of times the driven table is accessed 方法

Join Buffer:

把被驱动表的记录加载到内存的时候,一次性和多条 Drive table records to match,这样就可以大大减少重复从磁盘上加载被驱动表的代价,这个时候Join Buffer油然而生,join buffer 就是执行连接查询前申请的一块固定大小的内存,先把若干条驱动表结果集中的记录装在这个 join buffer 中

请添加图片描述

注意:,驱动表的记录并不是所有列都会被放到 join buffer 中,只有Columns in the query list and columns in the filter才会被放到 join buffer 中

二、Mysql基于成本的优化

​ mysqlThe cost mainly consists of the following two aspects

  • I/O成本

Our watch is often usedMyISAM、InnoDBBoth storage engines store data and indexes on disk,When we want to query the records in the table,Need to load data or index into memory first and then operate.The process consumption time of this disk to memory load isI/O成本

  • CPU成本

读取以及检测记录是否满足对应的搜索条件、The time consumed by these operations such as sorting multiple result sets is calledCPU成本

2.1基于成本的优化步骤

mentioned earlier Secondary index and return table actually talked about, This is what the query optimizer should do,Here we repeat it again

  • 根据搜索条件,找出所有可能使用的索引

  • 计算全表扫描的代价

  • 计算使用The cost of executing queries with different indexes

  • 对比各种执行方案的代价,找出成本最低的那一个

主要分为:The cost of executing queries with different indexes、连接查询的成本

对于连接查询,由于外,The driving table and the driven table of the join query have been fixed,Only inner join queries are discussed here,The query optimizer needs to consider separately Choose which table is the optimal query cost to drive the table,然后选取那个成本更低的连接顺序以及该连接顺序下各个表的最优访问方法作为最终的查询计划.

三、Mysql基于规则的优化(Contains the query optimization)

​ We can't avoid writing some stupid and performance-intensive statements every day,Mysql制定了一些规则,Go to great lengths to get this stupid and bad can statement into a higher form of execution,这个过程称之为 查询重写,Let's discuss some of the more important rewrite rules

3.1条件化简

  • 移除不必要的括号
((a = 5 AND b = c) OR ((a > c) AND (c < 5)))
#The query optimizer will remove these parentheses
(a = 5 and b = c) OR (a > c AND c < 5)
  • 常量传递 (和AND连接)
a = 5 AND b > a
被转化为
a = 5 AND b > 5

为什么OrIn the constant, can not?

a=5 Or b>a #aconditions are not necessarilya=5 
  • 移除没用的条件

对于一些明显永远为 TRUE 或者 FALSE 的表达式,优化器会移除掉它们

(a < 1 and b = b) OR (a = 6 OR 5 != 5)
#优化为
(a < 1 and TRUE) OR (a = 6 OR FALSE)
最终
a < 1 OR a = 6
  • 表达式计算

在查询开始执行之前,如果表达式中只包含常量的话,它的值会被先计算出来

a = 5 + 1 化简为 a=6;

Don't forget the first article,查询优化器 For a column is not in the form of a single expression as operands,such as in functions or more complex expressions,The query optimizer will not simplify these expressions,

ABS(a) > 5
-a < -8 //will not be simplified

so 最好让索引列以单独的形式出现在表达式中

  • Having子句和Where子句合并

​ 如果查询语句中没有出现诸如SUM、MAX等聚合函数 以及Group by子句,优化器会把Having子句与Where子句合并起来

  • 常量表检测

在mysqlThere are two kinds of queries that are particularly fast

  1. There is no record in the queried table,或者只有一条记录
  2. 使用主键等值匹配或者唯一二级索引等值匹配作为搜索条件来查询某个表

A table queried in these two ways is called 常量表(const tables)

  • 外连接消除

首先我们要明白一个问题,Why inner joins are more efficient than outer joins?

​ 之前讲过,Inner connected driver table and driven table can be interchanged with each other,This results in an inner join 可能通过优化表的连接顺序来降低整体的查询成本.

​ 之前讲过,对于外连接的驱动表的记录来说,如果无法在被驱动表中找到匹配On子句的filter records,then the record will still be added to the result set,Each field of the corresponding driven table record will beNULL填充;And if the record of the inner connected driver package cannot find a match in the driven tableon子句的filter records,那么该记录会被舍弃.

那么InnodbHow to eliminate the connection?

​ 用到where条件过滤,凡是不符合WHERENone of the conditional records in the clause will participate in the join.

只要我们在搜索条件中指定关于被驱动表相关列的值不为 NULL ,那么外连接中在被驱动表中找不到符合 ON 子句条件的驱动表记录也就被排除出最后的结果集了,也就是说:在这种情况下:外连接和内连接也就没有什么区别了

SELECT * FROM t1 LEFT JOIN t2 ON t1.m1 = t2.m2 WHERE t2.n2 IS NOT NULL;

At this point, this outer join can be converted into an inner join

SELECT * FROM t1 INNER JOIN t2 ON t1.m1 = t2.m2 WHERE t2.m2 = 2;

空值拒绝

指定whereClause contains a column of the driven table is notNULL值的条件称之为 空值拒绝.在被驱动表的WHERE子句符合空值拒绝的条件后,外连接和内连接可以相互转换.这种转换带来的好处就是查询优化器可以通过评估表的不同连接顺序的成本,选出成本最低的那种连接顺序来执行查询.

3.2子查询优化

3.2.1 按返回的结果集区分子查询

  • 标量子查询:只返回一个单一值的子查询称之为标量子查询
 SELECT * FROM t1 WHERE m1 = (SELECT MIN(m2) FROM t2);
  • 行子查询: Subquery that returns a record,This record contains multiple columns
 SELECT * FROM t1 WHERE (m1, n1) = (SELECT m2, n2 FROM t2 LIMIT 1);
  • 列子查询:return data for a column,Column data needs to contain multiple records
 SELECT * FROM t1 WHERE m1 IN (SELECT m2 FROM t2);
  • 表子查询:Is the result of the subquery contains many records,又包含很多个列
 SELECT * FROM t1 WHERE (m1, n1) IN (SELECT m2, n2 FROM t2);

3.2.2 按与外层查询关系来区分子查询

  • 不相关子查询:如果子查询可以单独运行出结果,而不依赖于外层查询的值,
  • 相关子查询:如果子查询的执行需要依赖于外层查询的值
#相关子查询
SELECT * FROM t1 WHERE m1 IN (SELECT m2 FROM t2 WHERE n1 = n2);

3.2.3 子查询在布尔表达式中的使用

  • The most common place for subqueries is to use them as part of a Boolean expression as a search condition. WHERE 子句或者 ON 子句里

    • 使用 = 、 > 、 < 、 >= 、 <= 、 <> 、 != 、 <=> 作为布尔表达式的操作符
    操作数 操作符(子查询),The subquery here can only be scalar subquery or 行子查询
    
    • 【NOT】IN/ANY/SOME/ALL子查询

    For column subqueries and 表子查询,Their result set contains many records,These records are equivalent to a collection

    • IN 或者 NOT IN
    SELECT * FROM t1 WHERE (m1, n2) IN (SELECT m2, n2 FROM t2);
    
    • ANY/SOME (ANY和SOME语义一致)
    SELECT * FROM t1 WHERE m1 > ANY(SELECT m2 FROM t2);
    
    • ALL
    SELECT * FROM t1 WHERE m1 > ALL(SELECT m2 FROM t2);
    
    • EXISTS子查询

    Sometimes we just need to determine whether the subquery result set has records,regardless of the specific form of his record,可以把EXISTS或 NOT EXISTSbefore the subquery statement

    SELECT * FROM t1 WHERE EXISTS (SELECT 1 FROM t2);
    

3.2.4子查询注意事项

  1. 子查询必须用小括号括起来

     SELECT SELECT m1 FROM t1; #报错
    
  2. 在Select clause's subquery must be a scalar subquery

     SELECT (SELECT m1, n1 FROM t1); #报错
    
  3. 在想要得到scalar subquery or者行子查询,但又不能保证子查询的结果集只有一条记录时,应该使用 LIMIT 1 语句来限制记录数量.

  4. 对于[NOT] IN/ANY/SOME/ALL子查询来说,子查询中不允许有LIMIT,Statements appearing in the following subqueries are also redundant

    1. Order By子句
    2. DISTINCT语句
    3. no aggregate functions and Having子句的Group子句
  5. It is not allowed to delete and modify records of a table in one statement and also perform subqueries on the table

 DELETE FROM t1 WHERE m1 < (SELECT MAX(m1) FROM t1);

3.3 子查询在Mysql中的执行过程

3.3.1 标量子查询、行子查询的执行方式

  • For uncorrelated scalar subqueries 或 Query statement for row subquery,MYSQL会分别独立的执行外层查询和子查询,Equivalent to two single-table queries
  • 对于相关子查询(标量或行子查询),其执行方式:
    • Get a record from the outer query
    • Get the records from the previous step to find the values ​​involved in the subquery,进行子查询

3.3.2 IN 子查询优化

  • 对于不相关子查询

    • If there are few records in the result set of the subquery,那么把子查询和外层查询分别看成两个单独的单表查询

    • If the result set of the subquery is too large to fit in the memory.For foreign queries,Too many subquery result sets,意味着INClause with more parameters,导致

      • 无法有效使用索引,只能对外层查询进行全表扫描

      • 由于 IN 子句中的参数太多,这会导致检测一条记录是否符合和 IN 子句

        Argument matching in is taking too long.

Proposal of temporary table:

​ 对于上述IN子查询(不相关子查询) There are too many subquery result sets,MysqlTemporary table is designed.Do not directly use the results of uncorrelated subqueries as parameters of the outer query,而是将该结果集写入一个临时表里.

The process of writing to a temporary table:

  1. The columns of the temporary table are the columns in the result set of the subquery
  2. 写入临时表的记录会被去重
  3. Any query result set won't big,So build for him based on memory usageMemory存储引擎的临时表,and build a hash index for the table(是hash索引,不是B+树索引),if it's really too big,超过了指定的阈值,Temporary tables instead use a disk-based storage engine for record keeping,此时就用到B+索引
materialized concept

Mysql将子查询结果集中的记录保存到临时表的过程称之为物化,我们就把那个存储子查询结果集的临时表称之为 物化表 .

Materialized table to connect

​ After materializing the result set of the subquery,In fact, it is equivalent to an inner join query,The outer query table and materialized form of join query.The query optimizer can optimize the internal connection,By evaluating the cost of different join sequences,Select the query method with the lowest cost to query

Half a subquery can be converted to links(Semi-join)

​ 子查询转化为连接,The result set of the subquery needs to be materialized(Each query will have the cost of creating a temporary table).为此MysqlIs there a way to convert subqueries into joins without materialization??

SELECT * FROM s1
WHERE key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a');

上面sqlThe execution is very similar to the execution of the following joint query

SELECT s1.* FROM s1 INNER JOIN s2
ON s1.key1 = s2.common_field
WHERE s2.key3 = 'a';

但是还是有区别:我们不能保证s1表的某条记录,在s2How many records in the table satisfys1.key1 = s2.common_field 这个筛选条件:

情况一:对于 s1 表的某条记录来说, s2 表中没有任何记录满足 s1.key1 = s2.common_field 这个条件,那么该记录自然也不会加入到最后的结果集.

情况二:对于 s1 表的某条记录来说, s2 表中有且只有记录满足 s1.key1 = s2.common_field 这个条件,那么该记录会被加入最终的结果集.

情况三:对于 s1 表的某条记录来说, s2 表中至少有2条记录满足 s1.key1 = s2.common_field 这个条件,那么该记录会被多次加入最终的结果集.

For the above sub queries,our child relationships1的key1 在s2Whether there is a record in the tables1.key1 = s2.common_field这个条件,don't care how many records match it,And because of the existence of situation three,So the two of ussql并不完全等价.

这里Mysql提出了 半连接:表s1与表s2进行半连接,refers tos1表的某条记录来说,我们只关心在s2表中是否存在与之匹配的记录是否存在,而不关心具体有多少条记录与之匹配

SELECT s1.* FROM s1 SEMI JOIN s2
ON s1.key1 = s2.common_field
WHERE key3 = 'a';
is really equivalent to
SELECT * FROM s1
WHERE key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a');

注:半连接semi-join 只是在MysqlA sub-query method used internally,not available to users

How to implement semi-join

  1. Table pullout(子查询中的表上拉)
  • When the subquery's query list has only When primary key or unique index column,The table in the subquery can be pulled up to the outer query,并把子查询中的搜索条件合并到外层查询的搜索条件中

     SELECT * FROM s1
    WHERE key2 IN (SELECT key2 FROM s2 WHERE key3 = 'a'); //key3为s2 的唯一索引
    等价于
    SELECT s1.* FROM s1 INNER JOIN s2
    ON s1.key2 = s2.key2
    WHERE s2.key3 = 'a';
    
  1. DuplicateWeedout execution strategy(重复值消除)
对于查询
SELECT * FROM s1
WHERE key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a');
#After converting to semi-join,s1A record in the table may be ins2 表中有多条匹配的记录,Index which record may have been added to the final result set,为了消除重复,可以建立一个临时表
CREATE TABLE tmp (
id PRIMARY KEY
);

​ 某条 s1 表中的记录要加入结果集时,就首先把这条记录的 id 值加入到这个临时表里,如果添加成功,说明之前这条 s1 表中的记录并没有加入最终的结果集,现在把该记录添加到最终的结果集;如果添加失败,说明这条之前这条 s1 表中的记录已经加入过最终的结果集,这里直接把它丢弃就好了,这种使用临时表消除 semi-join 结果集中的重复值的方式称之为 DuplicateWeedout .

  1. LooseScan execution strategy (松散索引扫描)
 SELECT * FROM s1
WHERE key3 IN (SELECT key1 FROM s2 WHERE key1 > 'a' AND key1 < 'b');

After converting to inner join,Make the subquery's table as drive table,执行过程如下:

请添加图片描述

  • 驱动表idx_key1The secondary index value is‘aa’There are three records,Just take the first value tos1表中查找s1.key3=‘aa’ 对应的记录.一次类推,Other secondary indexes of the same value,Also just go to the value of the first record tos1表匹配.这种方式称为(松散索引扫描)
  1. Semi-join Materialization execution strategy

    That is, the above materialized subquery result set

  2. FirstMatch execution strategy(first match strategy)

FirstMatch As the most primitive semi-join implementation,is to first fetch a record in the outer query,然后到子查询的表中寻找符合匹配条件的记录,如果能找到一条,then put the records of the outer query into the final result set and stop looking for matches 更多的记录

对于相关子查询 (In子查询)

  • Can also be converted to a semi-join query,上边介绍的table pullout(The premise is that the query list of the subquery is in the primary key or unique secondary index)、复制消除、松散索引扫描、首次匹配 都可以使用
  • 唯独 Materialized query result set can't,Because a correlated subquery is not a separate query

半连接(semi-join)的适用条件

  1. The subquery must be andIN语句组成的布尔表达式,并在外层查询的WHERE或ON子句中出现
  2. 外层查询也可以有其他的搜索条件,只不过和 IN 子查询的搜索条件必须使用 AND 连接起来.
  3. 该子查询必须是一个单一的查询,不能是由若干查询由 UNION 连接起来的形式.
  4. 该子查询不能包含 GROUP BY 或者 HAVING 语句或者聚集函数.

半连接(semi-join)exclusions

  1. 外层查询的WHERE条件中有其他搜索条件与IN子查询组成的布尔表达式使用 OR 连接起来
  2. 使用 NOT IN 而不是 IN 的情况
  3. 在 SELECT 子句中的IN子查询的情况
  4. 子查询中包含 GROUP BY 、 HAVING 或者聚集函数的情况
  5. 子查询中包含 UNION 的情况

如果IN子查询不符合转换为Semi-join的条件,The query optimizer performs two strategies

  1. 先将子查询物化之后再执行查询

     SELECT * FROM s1
    WHERE key1 NOT IN (SELECT common_field FROM s2 WHERE key3 = 'a')
    

    虽然用不了semi-join,But after materialization,Execution efficiency is still significantly improved

  2. 执行IN to EXISTS转换

outer_expr IN (SELECT inner_expr FROM ... WHERE subquery_where)
可以转换为
EXISTS (SELECT inner_expr FROM ... WHERE subquery_where AND outer_expr=inner_expr)

3.3.3 ANY/ALL子查询优化

请添加图片描述

3.3.4 【NOT】EXISTS子查询的执行

​ 如果 [NOT] EXISTS 子查询是不相关子查询,可以先执行子查询,得出该 [NOT] EXISTS 子查询的结果是 TRUE 还 是 FALSE ,并重写原先的查询语句

SELECT * FROM s1
WHERE EXISTS (SELECT 1 FROM s2 WHERE key1 = 'a')
OR key2 > 100;

因为这个语句里的子查询是不相关子查询,所以优化器会首先执行该子查询,假设该EXISTS子查询的结果为

TRUE ,那么接着优化器会重写查询为:

SELECT * FROM s1
WHERE TRUE OR key2 > 100

3.3.5 对于派生表的优化

派生表:Firing a query on the outerFROM子句后,then the result of this subquery is like for a derived table

SELECT * FROM (
SELECT id AS d_id, key3 AS d_key3 FROM s2 WHERE key1 = 'a'
) AS derived_s1 WHERE d_key3 = 'a';

对于派生表的优化,Mysql有两种策略

  1. 将派生表物化

    At this point, a delayed materialization strategy is involved.,That is, when the derived table is actually used in the query, go back and try to materialize the derived table.

  2. 将派生表和外层的表合并,i.e. the query is rewritten to have no derived tables

copyright:author[Welcome Big Brother to Little Brother Blog],Please bring the original link to reprint, thank you. https://en.javamana.com/2022/266/202209230624511144.html