使用存储附加索引 (SAI)
创建 SAI 索引
要创建 SAI 索引,您必须定义索引名称、表名称和要索引的列的列名称。
要创建简单的 SAI 索引
CREATE INDEX lastname_sai_idx ON cycling.cyclist_semi_pro (lastname)
USING 'sai'
WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};
CREATE INDEX age_sai_idx ON cycling.cyclist_semi_pro (age)
USING 'sai';
CREATE INDEX country_sai_idx ON cycling.cyclist_semi_pro (country)
USING 'sai'
WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'};
CREATE INDEX registration_sai_idx ON cycling.cyclist_semi_pro (registration)
USING 'sai';
对于大多数 SAI 索引,列名称在 CREATE INDEX
语句中定义,该语句还使用 USING 'sai'
。SAI 索引选项在 WITH OPTIONS
子句中定义。case_sensitive
选项设置为 false
以允许不区分大小写的搜索。normalize
选项设置为 true
以允许对 Unicode 字符进行规范化搜索。ascii_only
选项设置为 true
以允许搜索仅限于 ASCII 字符。
map
集合数据类型是一个例外,如 下面的示例 所示。
SAI 索引中的 map
集合
映射集合与其他 SAI 索引具有不同的格式
// Create an index on a map key to find all cyclist/team combos for a year
// tag::keysidx[]
CREATE INDEX IF NOT EXISTS team_year_keys_idx
ON cycling.cyclist_teams ( KEYS (teams) );
// end::keysidx[]
// Create an index on a map key to find all cyclist/team combos for a year
// tag::valuesidx[]
CREATE INDEX IF NOT EXISTS team_year_values_idx
ON cycling.cyclist_teams ( VALUES (teams) );
// end::valuesidx[]
// Create an index on a map key to find all cyclist/team combos for a year
// tag::entriesidx[]
CREATE INDEX IF NOT EXISTS team_year_entries_idx
ON cycling.cyclist_teams ( ENTRIES (teams) );
// end::entriesidx[]
向量搜索的相似度函数
此示例使用以下表
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
要检查 comment_vector
是否设置了特定相似度函数,请使用 similarity-function
选项设置为支持的相似度函数之一:DOT_PRODUCT、COSINE 或 EUCLIDEAN。默认相似度函数为 COSINE。
此索引在 comment_vector
列上创建索引,并将相似度函数设置为 DOT_PRODUCT
CREATE INDEX sim_comments_idx
ON cycling.comments_vs (comment_vector)
USING 'sai'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT'};
其他资源
有关创建 SAI 索引的更多信息,请参阅 CREATE CUSTOM INDEX。
更改 SAI 索引
无法更改 SAI 索引。如果您需要修改 SAI 索引,则需要删除当前索引,创建新索引并重建循环。
-
删除索引
DROP INDEX IF EXISTS cycling.lastname_sai_idx;
-
创建新索引
CREATE INDEX lastname_sai_idx ON cycling.cyclist_semi_pro (lastname) USING 'sai' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE INDEX age_sai_idx ON cycling.cyclist_semi_pro (age) USING 'sai'; CREATE INDEX country_sai_idx ON cycling.cyclist_semi_pro (country) USING 'sai' WITH OPTIONS = {'case_sensitive': 'false', 'normalize': 'true', 'ascii': 'true'}; CREATE INDEX registration_sai_idx ON cycling.cyclist_semi_pro (registration) USING 'sai';
使用 SAI 查询
SAI 快速入门 仅侧重于定义基于非主键列的多个索引(一个非常有用的功能)。让我们探索其他选项,并使用一些示例来演示如何在具有不同定义的 SAI 索引的表上运行查询。
SAI 仅支持 |
向量搜索
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX IF NOT EXISTS ann_index
ON cycling.comments_vs(comment_vector) USING 'sai';
使用 CQL 查询向量数据
要使用向量搜索查询数据,请使用 SELECT
查询
SELECT * FROM cycling.comments_vs
ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
LIMIT 3;
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] | Alex | 616e77e0-22a2-11ee-b99d-1f350647414a
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-02-17 08:43:20.234000+0000 | Glad you ran the race in the rain | [0.3, 0.34, 0.2, 0.78, 0.25] | Amy | 6170c1d0-22a2-11ee-b99d-1f350647414a
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-04-01 13:43:08.030000+0000 | Last climb was a killer | [0.3, 0.75, 0.2, 0.2, 0.5] | Amy | 62105d30-22a2-11ee-b99d-1f350647414a
限制必须为 1,000 或更少。 |
向右滚动结果将显示与用于查询的嵌入最匹配的表中的评论。
单个索引匹配列
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_idx
ON cycling.comments_vs (commenter)
USING 'sai';
CREATE INDEX created_at_idx
ON cycling.comments_vs (created_at)
USING 'sai';
CREATE INDEX ann_index
ON cycling.comments_vs (comment_vector)
USING 'sai';
列 commenter
在此表中不是分区键,因此需要索引才能对其进行查询。
查询该列上的匹配项
SELECT * FROM cycling.comments_vs
WHERE commenter = 'Alex';
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | [0.9, 0.54, 0.12, 0.1, 0.95] | Alex | 6d0cdaa0-272b-11ee-859f-b9098002fcac
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] | Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
使用选项在列上进行单个索引匹配
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_cs_idx ON cycling.comments_vs (commenter)
USING 'sai'
WITH OPTIONS = {'case_sensitive': 'true', 'normalize': 'true', 'ascii': 'true'};
区分大小写
列 commenter
在此表中不是分区键,因此需要索引才能对其进行查询。如果我们想将 commenter
作为区分大小写的值进行检查,我们可以使用 case_sensitive
选项设置为 true
。
请注意,如果您在查询中使用不区分大小写的值,则不会返回任何结果
SELECT * FROM comments_vs WHERE commenter ='alex';
id | created_at | comment | comment_vector | commenter | record_id
----+------------+---------+----------------+-----------+-----------
(0 rows)
当我们将骑车人姓名的案例更改为与索引中的案例匹配时,查询成功
在复合分区键列上进行索引匹配
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name (
race_year int,
race_name text,
cyclist_name text,
rank int,
PRIMARY KEY ((race_year, race_name), rank)
);
CREATE INDEX race_name_idx
ON cycling.rank_by_year_and_name (race_name)
USING 'sai';
CREATE INDEX race_year_idx
ON cycling.rank_by_year_and_name (race_year)
USING 'sai';
复合分区键具有由表中多个列定义的分区。通常,您需要在 WHERE
子句中指定分区键中的所有列才能查询表。但是,SAI 索引使您能够使用表中复合分区键中的单个列定义索引。如果您需要根据单个列进行查询,则可以在复合分区键中的每个列上创建 SAI 索引。
SAI 索引还允许您在不使用效率低下的 ALLOW FILTERING
指令的情况下查询表。ALLOW FILTERING
指令需要扫描表中的所有分区,导致性能低下。
race_year
和 race_name
列构成了 cycling.rank_by_year_and_name
表的复合分区键。
查询 race_name
列上的匹配项
SELECT * FROM cycling.rank_by_year_and_name
WHERE race_name = 'Tour of Japan - Stage 4 - Minami > Shinshu';
race_year | race_name | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Daniel MARTIN
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Johan Esteban CHAVES
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Benjamin PRADES
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Adam PHELAN
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Thomas LEBAS
查询 race_year
列上的匹配项
SELECT * FROM cycling.rank_by_year_and_name
WHERE race_year = 2014;
race_year | race_name | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
2014 | 4th Tour of Beijing | 1 | Phillippe GILBERT
2014 | 4th Tour of Beijing | 2 | Daniel MARTIN
2014 | 4th Tour of Beijing | 3 | Johan Esteban CHAVES
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Daniel MARTIN
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Johan Esteban CHAVES
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES
使用 AND 匹配多个索引
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_idx
ON cycling.comments_vs (commenter)
USING 'sai';
CREATE INDEX created_at_idx
ON cycling.comments_vs (created_at)
USING 'sai';
CREATE INDEX ann_index
ON cycling.comments_vs (comment_vector)
USING 'sai';
为表创建了多个索引,以演示如何查询多个列上的匹配项。所有列都必须匹配。
查询多个列上的匹配项,并且所有列都必须匹配
SELECT * FROM cycling.comments_vs
WHERE
created_at='2017-03-21 21:11:09.999000+0000'
AND commenter = 'Alex';
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+-----------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] | Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
使用 OR 匹配多个索引
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_idx
ON cycling.comments_vs (commenter)
USING 'sai';
CREATE INDEX created_at_idx
ON cycling.comments_vs (created_at)
USING 'sai';
CREATE INDEX ann_index
ON cycling.comments_vs (comment_vector)
USING 'sai';
为表创建了多个索引,以演示如何查询多个列上的匹配项。所有列都必须匹配。
查询任一列上的匹配项
SELECT * FROM cycling.comments_vs
WHERE
created_at='2017-03-21 21:11:09.999000+0000'
OR created_at='2017-03-22 01:16:59.001000+0000';
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+-----------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] | Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-03-22 01:16:59.001000+0000 | Great snacks at all reststops | [0.1, 0.4, 0.1, 0.52, 0.09] | Amy | 6d0fc0d0-272b-11ee-859f-b9098002fcac
使用 IN 匹配多个索引
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
record_id timeuuid,
id uuid,
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
PRIMARY KEY (id, created_at)
)
WITH CLUSTERING ORDER BY (created_at DESC);
CREATE INDEX commenter_idx
ON cycling.comments_vs (commenter)
USING 'sai';
CREATE INDEX created_at_idx
ON cycling.comments_vs (created_at)
USING 'sai';
CREATE INDEX ann_index
ON cycling.comments_vs (comment_vector)
USING 'sai';
为表创建了多个索引,以演示如何查询多个列上的匹配项。所有列都必须匹配。
查询列值在值列表中的匹配项
SELECT * FROM cycling.comments_vs
WHERE created_at IN
('2017-03-21 21:11:09.999000+0000'
,'2017-03-22 01:16:59.001000+0000');
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+-----------------------------------+------------------------------+-----------+--------------------------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-03-21 21:11:09.999000+0000 | Second rest stop was out of water | [0.99, 0.5, 0.99, 0.1, 0.34] | Alex | 6d0b7b10-272b-11ee-859f-b9098002fcac
c7fceba0-c141-4207-9494-a29f9809de6f | 2017-03-22 01:16:59.001000+0000 | Great snacks at all reststops | [0.1, 0.4, 0.1, 0.52, 0.09] | Amy | 6d0fc0d0-272b-11ee-859f-b9098002fcac
用户定义类型
SAI 可以索引用户定义类型 (UDT) 或 UDT 列表。此示例演示如何索引 UDT 列表。
此示例使用以下用户定义类型 (UDT)、表和索引
CREATE TYPE IF NOT EXISTS cycling.race (
race_title text,
race_date timestamp,
race_time text
);
CREATE TABLE IF NOT EXISTS cycling.cyclist_races (
id UUID PRIMARY KEY,
lastname text,
firstname text,
races list<FROZEN <race>>
);
CREATE INDEX races_idx
ON cycling.cyclist_races (races)
USING 'sai';
在 cycling.cyclist_races
表中,在 UDT 列表列 races
上创建索引。
使用 CONTAINS
从列表 races
列中查询
SELECT * FROM cycling.cyclist_races
WHERE races CONTAINS {
race_title:'Rabobank 7-Dorpenomloop Aalburg',
race_date:'2015-05-09',
race_time:'02:58:33'};
id | firstname | lastname | races
--------------------------------------+-----------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS | [{race_title: 'Rabobank 7-Dorpenomloop Aalburg', race_date: '2015-05-09 00:00:00.000000+0000', race_time: '02:58:33'}, {race_title: 'Ronde van Gelderland', race_date: '2015-04-19 00:00:00.000000+0000', race_time: '03:22:23'}]
(1 rows)
使用集合进行 SAI 索引
SAI 支持类型为 map
、list
和 set
的集合。集合允许您将数据分组并存储在同一列中。
在关系数据库中,例如用户的多个电子邮件地址,是通过 (例如) user
表和 email
表之间的多对一连接关系实现的。Apache Cassandra 通过将用户的电子邮件地址存储在 user
表中的集合列中来避免两个表之间的连接。每个集合都指定了所保存数据的类型。
如果集合存储数据的数量有限,则集合是合适的。如果数据具有无限的增长潜力,例如每秒发送的消息或注册的传感器事件,请不要使用集合。相反,请使用具有复合主键的表,其中数据存储在聚类列中。
在具有 SAI 索引的数据库表的 CQL 查询中,
|
使用 set 类型
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.cyclist_career_teams (
id UUID PRIMARY KEY,
lastname text,
teams set<text>
);
CREATE INDEX teams_idx
ON cycling.cyclist_career_teams (teams)
USING 'sai';
在 cyclist_career_teams
表中,在 set 列 teams
上创建索引。
使用 CONTAINS
从 set teams
列中查询
SELECT * FROM cycling.cyclist_career_teams
WHERE teams CONTAINS 'Rabobank-Liv Giant';
id | lastname | teams
--------------------------------------+----------+------------------------------------------------------------------------------------------------------
5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | VOS | {'Nederland bloeit', 'Rabobank Women Team', 'Rabobank-Liv Giant', 'Rabobank-Liv Woman Cycling Team'}
使用 list 类型
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.upcoming_calendar (
year int,
month int,
events list<text>,
PRIMARY KEY (year, month)
);
CREATE INDEX events_idx
ON cycling.upcoming_calendar (events)
USING 'sai';
在 upcoming_calendar
表中,在 list 列 events
上创建索引。
使用 CONTAINS
从 list events
列中查询
SELECT * FROM cycling.upcoming_calendar
WHERE events CONTAINS 'Criterium du Dauphine';
year | month | events
------+-------+-----------------------------------------------
2015 | 6 | ['Criterium du Dauphine', 'Tour de Sui\nsse']
一个稍微复杂一点的查询会选择包含特定事件或具有特定月份日期的行
使用 map 类型
此示例使用以下表和索引
CREATE TABLE IF NOT EXISTS cycling.cyclist_teams (
id uuid PRIMARY KEY,
firstname text,
lastname text,
teams map<int, text>
);
CREATE INDEX IF NOT EXISTS team_year_keys_idx
ON cycling.cyclist_teams ( KEYS (teams) );
CREATE INDEX IF NOT EXISTS team_year_entries_idx
ON cycling.cyclist_teams ( ENTRIES (teams) );
CREATE INDEX IF NOT EXISTS team_year_values_idx
ON cycling.cyclist_teams ( VALUES (teams) );
在 cyclist_career_teams
表中的 map 列 teams
上创建的索引会针对列数据的键、值和完整条目。
使用 map teams
列中的 KEYS
进行查询
SELECT * FROM cyclist_teams WHERE teams CONTAINS KEY 2014;
id | firstname | lastname | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS | {2014: 'Rabobank-Liv Woman Cycling Team', 2015: 'Rabobank-Liv Woman Cycling Team'}
从 map teams
列中查询一个值,注意只包含 CONTAINS
关键字
SELECT * FROM cyclist_teams WHERE teams CONTAINS 'Team Garmin - Cervelo';
id | firstname | lastname | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
从 map teams
列中查询条目,注意 WHERE
子句中的差异
SELECT * FROM cyclist_teams
WHERE
teams[2014] = 'Boels:Dolmans Cycling Team'
AND teams[2015] = 'Boels:Dolmans Cycling Team';
id | firstname | lastname | teams
--------------------------------------+-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
cb07baad-eac8-4f65-b28a-bddc06a0de23 | Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
此示例查找 map teams
列中存在两个条目的行。
有关更多信息,请参阅