Cassandra 文档

版本

您正在查看预发布版本的文档。

使用 SAI 查询

SAI 快速入门 仅侧重于定义基于非主键列的多个索引(一个非常有用的功能)。让我们通过一些示例来探索其他选项,这些示例演示了如何在具有不同定义的 SAI 索引的表上运行查询。

SAI 仅支持 SELECT 查询,不支持 UPDATEDELETE 查询。

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX IF NOT EXISTS ann_index
  ON cycling.comments_vs(comment_vector) USING 'sai';

使用 CQL 查询向量数据

要使用向量搜索查询数据,请使用 SELECT 查询

  • CQL

  • 结果

SELECT * FROM cycling.comments_vs
    ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
    LIMIT 3;
 id                                   | created_at                      | comment                                | comment_vector               | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex | 616e77e0-22a2-11ee-b99d-1f350647414a
 c7fceba0-c141-4207-9494-a29f9809de6f | 2017-02-17 08:43:20.234000+0000 |      Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] |       Amy | 6170c1d0-22a2-11ee-b99d-1f350647414a
 c7fceba0-c141-4207-9494-a29f9809de6f | 2017-04-01 13:43:08.030000+0000 |                Last climb was a killer |   [0.3, 0.75, 0.2, 0.2, 0.5] |       Amy | 62105d30-22a2-11ee-b99d-1f350647414a

限制必须为 1,000 或更少。

向右滚动结果显示与用于查询的嵌入最匹配的表中的评论。

单个索引匹配列

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_idx
    ON cycling.comments_vs (commenter)
    USING 'sai';
CREATE INDEX created_at_idx
    ON cycling.comments_vs (created_at)
    USING 'sai';
CREATE INDEX ann_index
    ON cycling.comments_vs (comment_vector)
    USING 'sai';

commenter 不是此表中的分区键,因此需要索引才能对其进行查询。

查询该列上的匹配项

  • 查询

  • 结果

    SELECT * FROM cycling.comments_vs
        WHERE commenter = 'Alex';
 id                                   | created_at                      | comment                                | comment_vector               | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] |      Alex | 6d0cdaa0-272b-11ee-859f-b9098002fcac
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 |      Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] |      Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
索引失败

请注意,如果您尝试在创建索引之前运行此查询,将会发生错误

  • 查询

  • 结果

    SELECT * FROM cycling.comments_vs
        WHERE commenter = 'Alex';
InvalidRequest: Error from server: code=2200
[Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.
If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

单个索引匹配列(带选项)

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_cs_idx ON cycling.comments_vs (commenter)
USING 'sai'
WITH OPTIONS = {'case_sensitive': 'true', 'normalize': 'true', 'ascii': 'true'};

区分大小写

commenter 不是此表中的分区键,因此需要索引才能对其进行查询。如果我们想将 commenter 作为区分大小写的值进行检查,我们可以将 case_sensitive 选项设置为 true

请注意,如果您在查询中使用不合适的区分大小写的值,则不会返回任何结果

  • 查询

  • 结果

SELECT * FROM comments_vs WHERE commenter ='alex';
 id | created_at | comment | comment_vector | commenter | record_id
----+------------+---------+----------------+-----------+-----------

(0 rows)

当我们将骑车人姓名的案例更改为与索引中的案例匹配时,查询成功

  • 查询

  • 结果

SELECT comment,commenter FROM comments_vs WHERE commenter ='Alex';
 comment                                | commenter
----------------------------------------+-----------
 LATE RIDERS SHOULD NOT DELAY THE START |      Alex
      Second rest stop was out of water |      Alex
(2 rows)

索引匹配复合分区键列

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name (
  race_year int,
  race_name text,
  cyclist_name text,
  rank int,
  PRIMARY KEY ((race_year, race_name), rank)
);
CREATE INDEX race_name_idx
    ON cycling.rank_by_year_and_name (race_name)
    USING 'sai';
CREATE INDEX race_year_idx
    ON cycling.rank_by_year_and_name (race_year)
    USING 'sai';

复合分区键具有由表中多个列定义的分区。通常,您需要在 WHERE 子句中指定分区键中的所有列才能查询表。但是,SAI 索引使您可以使用表中复合分区键中的单个列定义索引。如果需要根据单个列进行查询,可以在复合分区键中的每个列上创建 SAI 索引。

SAI 索引还允许您在不使用效率低下的 ALLOW FILTERING 指令的情况下查询表。ALLOW FILTERING 指令需要扫描表中的所有分区,导致性能低下。

race_yearrace_name 列构成了 cycling.rank_by_year_and_name 表的复合分区键。

查询 race_name 列上的匹配项

  • 查询

  • 结果

    SELECT * FROM cycling.rank_by_year_and_name
        WHERE race_name = 'Tour of Japan - Stage 4 - Minami > Shinshu';
 race_year | race_name                                  | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 |        Daniel MARTIN
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    2 | Johan Esteban CHAVES
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |      Benjamin PRADES
      2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 |      Benjamin PRADES
      2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    2 |          Adam PHELAN
      2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |         Thomas LEBAS

查询 race_year 列上的匹配项

  • 查询

  • 结果

    SELECT * FROM cycling.rank_by_year_and_name
        WHERE race_year = 2014;
 race_year | race_name                                  | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
      2014 |                        4th Tour of Beijing |    1 |    Phillippe GILBERT
      2014 |                        4th Tour of Beijing |    2 |        Daniel MARTIN
      2014 |                        4th Tour of Beijing |    3 | Johan Esteban CHAVES
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 |        Daniel MARTIN
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    2 | Johan Esteban CHAVES
      2014 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |      Benjamin PRADES

使用 AND 匹配多个索引

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_idx
    ON cycling.comments_vs (commenter)
    USING 'sai';
CREATE INDEX created_at_idx
    ON cycling.comments_vs (created_at)
    USING 'sai';
CREATE INDEX ann_index
    ON cycling.comments_vs (comment_vector)
    USING 'sai';

为表创建了多个索引,以演示如何查询多个列上的匹配项。

查询多个列上的匹配项,并且所有列都必须匹配

  • 查询

  • 结果

SELECT * FROM cycling.comments_vs
    WHERE
    created_at='2017-03-21 21:11:09.999000+0000'
    AND commenter = 'Alex';
 id                                   | created_at                      | comment                           | comment_vector               | commenter | record_id
--------------------------------------+---------------------------------+-----------------------------------+------------------------------+-----------+--------------------------------------
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] |      Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac

使用 OR 匹配多个索引

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_idx
    ON cycling.comments_vs (commenter)
    USING 'sai';
CREATE INDEX created_at_idx
    ON cycling.comments_vs (created_at)
    USING 'sai';
CREATE INDEX ann_index
    ON cycling.comments_vs (comment_vector)
    USING 'sai';

为表创建了多个索引,以演示如何查询多个列上的匹配项。

查询任一列上的匹配项

  • 查询

  • 结果

SELECT * FROM cycling.comments_vs
    WHERE
    created_at='2017-03-21 21:11:09.999000+0000'
    OR created_at='2017-03-22 01:16:59.001000+0000';
 id                                   | created_at                      | comment                           | comment_vector               | commenter | record_id
--------------------------------------+---------------------------------+-----------------------------------+------------------------------+-----------+--------------------------------------
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] |      Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
 c7fceba0-c141-4207-9494-a29f9809de6f | 2017-03-22 01:16:59.001000+0000 |     Great snacks at all reststops |  [0.1, 0.4, 0.1, 0.52, 0.09] |       Amy | 6d0fc0d0-272b-11ee-859f-b9098002fcac

使用 IN 匹配多个索引

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  record_id timeuuid,
  id uuid,
  commenter text,
  comment text,
  comment_vector VECTOR <FLOAT, 5>,
  created_at timestamp,
  PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_idx
    ON cycling.comments_vs (commenter)
    USING 'sai';
CREATE INDEX created_at_idx
    ON cycling.comments_vs (created_at)
    USING 'sai';
CREATE INDEX ann_index
    ON cycling.comments_vs (comment_vector)
    USING 'sai';

为表创建了多个索引,以演示如何查询多个列上的匹配项。

查询值列表中列值的匹配项

  • 查询

  • 结果

SELECT * FROM cycling.comments_vs
    WHERE created_at IN
    ('2017-03-21 21:11:09.999000+0000'
    ,'2017-03-22 01:16:59.001000+0000');
 id                                   | created_at                      | comment                           | comment_vector               | commenter | record_id
--------------------------------------+---------------------------------+-----------------------------------+------------------------------+-----------+--------------------------------------
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] |      Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
 c7fceba0-c141-4207-9494-a29f9809de6f | 2017-03-22 01:16:59.001000+0000 |     Great snacks at all reststops |  [0.1, 0.4, 0.1, 0.52, 0.09] |       Amy | 6d0fc0d0-272b-11ee-859f-b9098002fcac

用户定义类型

SAI 可以索引用户定义类型 (UDT) 或 UDT 列表。此示例演示了如何索引 UDT 列表。

此示例使用以下用户定义类型 (UDT)、表和索引

CREATE TYPE IF NOT EXISTS cycling.race (
  race_title text,
  race_date timestamp,
  race_time text
);
CREATE TABLE IF NOT EXISTS cycling.cyclist_races (
  id UUID PRIMARY KEY,
  lastname text,
  firstname text,
  races list<FROZEN <race>>
);
CREATE INDEX races_idx
    ON cycling.cyclist_races (races)
    USING 'sai';

cycling.cyclist_races 表的 UDT 列表列 races 上创建索引。

使用 races 列列表中的 CONTAINS 进行查询

  • CQL

  • 结果

SELECT * FROM cycling.cyclist_races
    WHERE races CONTAINS {
       race_title:'Rabobank 7-Dorpenomloop Aalburg',
       race_date:'2015-05-09',
       race_time:'02:58:33'};
 id                                   | firstname | lastname | races
--------------------------------------+-----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |      VOS | [{race_title: 'Rabobank 7-Dorpenomloop Aalburg', race_date: '2015-05-09 00:00:00.000000+0000', race_time: '02:58:33'}, {race_title: 'Ronde van Gelderland', race_date: '2015-04-19 00:00:00.000000+0000', race_time: '03:22:23'}]

(1 rows)

使用集合进行 SAI 索引

SAI 支持 maplistset 类型的集合。集合允许您将数据分组并存储在同一列中。

在关系数据库中,例如用户的多封电子邮件地址等分组是通过 (例如) user 表和 email 表之间的多对一连接关系实现的。Apache Cassandra 通过将用户的电子邮件地址存储在 user 表中的集合列中来避免两个表之间的连接。每个集合都指定了所保存数据的类型。

如果要存储的集合数据有限,则集合是合适的。如果数据具有无限的增长潜力,例如每秒发送的消息或注册的传感器事件,请不要使用集合。相反,请使用具有复合主键的表,其中数据存储在聚簇列中。

在具有 SAI 索引的数据库表的 CQL 查询中,CONTAINS 子句受支持,并且特定于

  • SAI 集合映射,带有 keysvaluesentries

  • SAI 集合,带有 listset 类型

使用 set 类型

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.cyclist_career_teams (
  id UUID PRIMARY KEY,
  lastname text,
  teams set<text>
);
CREATE INDEX teams_idx
    ON cycling.cyclist_career_teams (teams)
    USING 'sai';

cyclist_career_teams 表的 set 列 teams 上创建索引。

使用 teams 列中的 CONTAINS 进行查询

  • CQL

  • 结果

SELECT * FROM cycling.cyclist_career_teams
   WHERE teams CONTAINS 'Rabobank-Liv Giant';
 id                                   | lastname | teams
--------------------------------------+----------+------------------------------------------------------------------------------------------------------
 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |      VOS | {'Nederland bloeit', 'Rabobank Women Team', 'Rabobank-Liv Giant', 'Rabobank-Liv Woman Cycling Team'}

使用 list 类型

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.upcoming_calendar (
  year int,
  month int,
  events list<text>,
  PRIMARY KEY (year, month)
);
CREATE INDEX events_idx
    ON cycling.upcoming_calendar (events)
    USING 'sai';

upcoming_calendar 表的 list 列 events 上创建索引。

使用 events 列中的 CONTAINS 进行查询

  • CQL

  • 结果

SELECT * FROM cycling.upcoming_calendar
   WHERE events CONTAINS 'Criterium du Dauphine';
 year | month | events
------+-------+-----------------------------------------------
 2015 |     6 | ['Criterium du Dauphine', 'Tour de Sui\nsse']

一个稍微复杂的查询选择包含特定事件或具有特定月份日期的行

  • CQL

  • 结果

SELECT * FROM cycling.upcoming_calendar
    WHERE events CONTAINS 'Criterium du Dauphine'
          OR month = 7;
 year | month | events
------+-------+-----------------------------------------------
 2015 |     6 | ['Criterium du Dauphine', 'Tour de Sui\nsse']
 2015 |     7 |                            ['Tour de France']

使用 map 类型

此示例使用以下表和索引

CREATE TABLE IF NOT EXISTS cycling.cyclist_teams (
  id uuid PRIMARY KEY,
  firstname text,
  lastname text,
  teams map<int, text>
);
CREATE INDEX IF NOT EXISTS team_year_keys_idx
ON cycling.cyclist_teams ( KEYS (teams) );
CREATE INDEX IF NOT EXISTS team_year_entries_idx
ON cycling.cyclist_teams ( ENTRIES (teams) );
CREATE INDEX IF NOT EXISTS team_year_values_idx
ON cycling.cyclist_teams ( VALUES (teams) );

cyclist_career_teams 表的 map 列 teams 上创建的索引针对列数据的键、值和完整条目。

使用 teams 列中的 KEYS 进行查询

  • CQL

  • 结果

SELECT * FROM cyclist_teams WHERE teams CONTAINS KEY 2014;
 id                                   | firstname | lastname   | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |        VOS |                                                                                          {2014: 'Rabobank-Liv Woman Cycling Team', 2015: 'Rabobank-Liv Woman Cycling Team'}

从映射 teams 列中查询一个值,注意只包含关键字 CONTAINS

  • CQL

  • 结果

SELECT * FROM cyclist_teams WHERE teams CONTAINS 'Team Garmin - Cervelo';
 id                                   | firstname | lastname   | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}

从映射 teams 列中查询条目,注意 WHERE 子句中的差异

  • CQL

  • 结果

SELECT * FROM cyclist_teams
WHERE
    teams[2014] = 'Boels:Dolmans Cycling Team'
    AND teams[2015] = 'Boels:Dolmans Cycling Team';
 id                                   | firstname | lastname   | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}

此示例查找映射 teams 列中存在两个条目的行。

有关更多信息,请参见