$bucket
将输入文档按照指定的表达式和边界进行分组,每个分组为一个文档,称为“桶”,每个桶都有一个唯一的_id
,其值为文件桶的下线。每个桶中至少要包含一个输入文档,也就是没有空桶。
{
$bucket: {
groupBy: <表达式>,
boundaries: [ <下边界1>, <下边界2>, ... ],
default: <literal>,
output: {
<output1>: { <$accumulator 表达式> },
...
<outputN>: { <$accumulator 表达式> }
}
}
}
对文档进行分组的表达式。若指定字段路径,需要在字段名前加上美元符号$
并用引号引起来,如:$field_name
。
除非指定了default
,否则所有输入文档的groupBy的值都必须在boundaries
指定边界的范围内。
分组边界数组,数组中相邻的两个值分别作为桶的上下边界,输入文档根据groupBy
表达式的值,确定被分配到哪个桶。数组至少要有两个元素,并按照升序从左到右排列,除数值混合类型外(如:[10, NumberLong(20), NumberInt(30)]
),数组元素类型必须一致。
举例:
一个数组 [ 0, 5, 10 ] 创建了两个桶:
[0,5),下界为 0,上界为 5。
[5,10),下界为 5,上界为 10。
可选,指定缺省桶的_id
,不符合boundaries
范围的文档都会放在缺省桶内。如果不指定default
,所有输入文档的groupBy
表达式的值必须落在boundaries
区间,否则会抛出异常。
缺省值必须小于boundaries
数组中最小的值或大于boundaries
数组中的最大值。default
值的类型可以不同于boundaries
数组元素的类型。
可选,指定输出文档内容中除_id
字段外要包含的其他字段,指定的字段必须使用汇总(累加器)表达式。
<outputfield1>: { <accumulator>: <expression1> },
...
<outputfieldN>: { <accumulator>: <expressionN> }
如果未指定output
文档,默认返回桶内文档数量count
字段,如果指定了output
文档的字段,则只返回_id
和指定的字段,count
字段默认不会输出。
创建artists
集合并插入下面的记录
db.artists.insertMany([
{ "_id" : 1, "last_name" : "Bernard", "first_name" : "Emil", "year_born" : 1868, "year_died" : 1941, "nationality" : "France" },
{ "_id" : 2, "last_name" : "Rippl-Ronai", "first_name" : "Joszef", "year_born" : 1861, "year_died" : 1927, "nationality" : "Hungary" },
{ "_id" : 3, "last_name" : "Ostroumova", "first_name" : "Anna", "year_born" : 1871, "year_died" : 1955, "nationality" : "Russia" },
{ "_id" : 4, "last_name" : "Van Gogh", "first_name" : "Vincent", "year_born" : 1853, "year_died" : 1890, "nationality" : "Holland" },
{ "_id" : 5, "last_name" : "Maurer", "first_name" : "Alfred", "year_born" : 1868, "year_died" : 1932, "nationality" : "USA" },
{ "_id" : 6, "last_name" : "Munch", "first_name" : "Edvard", "year_born" : 1863, "year_died" : 1944, "nationality" : "Norway" },
{ "_id" : 7, "last_name" : "Redon", "first_name" : "Odilon", "year_born" : 1840, "year_died" : 1916, "nationality" : "France" },
{ "_id" : 8, "last_name" : "Diriks", "first_name" : "Edvard", "year_born" : 1855, "year_died" : 1930, "nationality" : "Norway" }
])
下面的操作对文档按照year_born
字段进行分组放入桶中,并根据桶内文档数量进行筛选:
db.artists.aggregate( [
// 阶段1
{
$bucket: {
groupBy: "$year_born", // 分组字段
boundaries: [ 1840, 1850, 1860, 1870, 1880 ], // 桶边界
default: "Other", // 边界外的桶的ID
output: { // 指定桶的输出文档
"count": { $sum: 1 },
"artists" :
{
$push: {
"name": { $concat: [ "$first_name", " ", "$last_name"] },
"year_born": "$year_born"
}
}
}
}
},
// 阶段2
{
$match: { count: {$gt: 3} } //过滤出文档数量大于3的桶
}
] )
$bucket
阶段对文档根据year_born
分组把文档放入桶,桶的边界为:
1840
(含),上限1850
(不含)。1840
(含),上限1850
(不含)。1840
(含),上限1850
(不含)。1840
(含),上限1850
(不含)。year_born
字段不存在或者值在边界外,文档将被放到_id
值为"other"
的缺省桶中。阶段1的output
指定了输出文档的字段:
字段 | 描述 |
---|---|
_id | 包含了桶的边界下限 |
count | 桶内文档数量 |
artists | 文档数组,包含了桶内所有文章,每个文档的artists 字段都包含了拼接后的first_name 和last_name ,以及`year_born’字段 |
通过该阶段后,下面的文档进入下个阶段:
{ "_id" : 1840, "count" : 1, "artists" : [ { "name" : "Odilon Redon", "year_born" : 1840 } ] }
{ "_id" : 1850, "count" : 2, "artists" : [ { "name" : "Vincent Van Gogh", "year_born" : 1853 },
{ "name" : "Edvard Diriks", "year_born" : 1855 } ] }
{ "_id" : 1860, "count" : 4, "artists" : [ { "name" : "Emil Bernard", "year_born" : 1868 },
{ "name" : "Joszef Rippl-Ronai", "year_born" : 1861 },
{ "name" : "Alfred Maurer", "year_born" : 1868 },
{ "name" : "Edvard Munch", "year_born" : 1863 } ] }
{ "_id" : 1870, "count" : 1, "artists" : [ { "name" : "Anna Ostroumova", "year_born" : 1871 } ] }
$match
阶段使用count>3
的条件,对$bucket
阶段out
的文档进行筛选,筛选后的结果如下:
{ "_id" : 1860, "count" : 4, "artists" :
[
{ "name" : "Emil Bernard", "year_born" : 1868 },
{ "name" : "Joszef Rippl-Ronai", "year_born" : 1861 },
{ "name" : "Alfred Maurer", "year_born" : 1868 },
{ "name" : "Edvard Munch", "year_born" : 1863 }
]
}
$bucket
和$facet
按多个字段分类使用$facet
可以在一个阶段执行多个$bucket
聚合。使用mongosh
创建artwork
集合并添加下面的文档:
db.artwork.insertMany([
{ "_id" : 1, "title" : "The Pillars of Society", "artist" : "Grosz", "year" : 1926,
"price" : NumberDecimal("199.99") },
{ "_id" : 2, "title" : "Melancholy III", "artist" : "Munch", "year" : 1902,
"price" : NumberDecimal("280.00") },
{ "_id" : 3, "title" : "Dancer", "artist" : "Miro", "year" : 1925,
"price" : NumberDecimal("76.04") },
{ "_id" : 4, "title" : "The Great Wave off Kanagawa", "artist" : "Hokusai",
"price" : NumberDecimal("167.30") },
{ "_id" : 5, "title" : "The Persistence of Memory", "artist" : "Dali", "year" : 1931,
"price" : NumberDecimal("483.00") },
{ "_id" : 6, "title" : "Composition VII", "artist" : "Kandinsky", "year" : 1913,
"price" : NumberDecimal("385.00") },
{ "_id" : 7, "title" : "The Scream", "artist" : "Munch", "year" : 1893
/* No price*/ },
{ "_id" : 8, "title" : "Blue Flower", "artist" : "O'Keefe", "year" : 1918,
"price" : NumberDecimal("118.42") }
])
下面的操作在一个$facet
阶段中使用两个$bucket
,一个使用price
字段,另一个使用year
字段分组:
db.artwork.aggregate( [
{
$facet: { // 顶层 $facet 阶段
"price": [ // 输出字段1
{
$bucket: {
groupBy: "$price", // 分组字段
boundaries: [ 0, 200, 400 ], // 桶边界数组
default: "Other", // 缺省桶Id
output: { // 桶输出内容
"count": { $sum: 1 },
"artwork" : { $push: { "title": "$title", "price": "$price" } },
"averagePrice": { $avg: "$price" }
}
}
}
],
"year": [ // 输出字段2
{
$bucket: {
groupBy: "$year", // 分组字段
boundaries: [ 1890, 1910, 1920, 1940 ], // 桶边界数组
default: "Unknown", // 缺省桶Id
output: { // 桶输出内容
"count": { $sum: 1 },
"artwork": { $push: { "title": "$title", "year": "$year" } }
}
}
}
]
}
}
] )
第一个方面按price
对输入文档进行分组,桶的边界有:
$bucket
阶段的输出out
文档包含下面的字段:
字段 | 描述 |
---|---|
_id | 桶边界下限值 |
count | 桶内文档数量 |
artwork | 包含所有艺术品信息的文档数组 |
averagePrice | 使用$avg 运算符显示水桶中所有艺术品的平均价格。 |
第二个方面按year
对输入文档进行分组,桶的边界有:
$bucket
阶段的输出out
文档包含下面的字段:
字段 | 描述 |
---|---|
count | 桶内文档数量 |
artwork | 桶内每件艺术品信息的文件数组。 |
操作返回下面的结果:
{
"price" : [ // Output of first facet
{
"_id" : 0,
"count" : 4,
"artwork" : [
{ "title" : "The Pillars of Society", "price" : NumberDecimal("199.99") },
{ "title" : "Dancer", "price" : NumberDecimal("76.04") },
{ "title" : "The Great Wave off Kanagawa", "price" : NumberDecimal("167.30") },
{ "title" : "Blue Flower", "price" : NumberDecimal("118.42") }
],
"averagePrice" : NumberDecimal("140.4375")
},
{
"_id" : 200,
"count" : 2,
"artwork" : [
{ "title" : "Melancholy III", "price" : NumberDecimal("280.00") },
{ "title" : "Composition VII", "price" : NumberDecimal("385.00") }
],
"averagePrice" : NumberDecimal("332.50")
},
{
// Includes documents without prices and prices greater than 400
"_id" : "Other",
"count" : 2,
"artwork" : [
{ "title" : "The Persistence of Memory", "price" : NumberDecimal("483.00") },
{ "title" : "The Scream" }
],
"averagePrice" : NumberDecimal("483.00")
}
],
"year" : [ // Output of second facet
{
"_id" : 1890,
"count" : 2,
"artwork" : [
{ "title" : "Melancholy III", "year" : 1902 },
{ "title" : "The Scream", "year" : 1893 }
]
},
{
"_id" : 1910,
"count" : 2,
"artwork" : [
{ "title" : "Composition VII", "year" : 1913 },
{ "title" : "Blue Flower", "year" : 1918 }
]
},
{
"_id" : 1920,
"count" : 3,
"artwork" : [
{ "title" : "The Pillars of Society", "year" : 1926 },
{ "title" : "Dancer", "year" : 1925 },
{ "title" : "The Persistence of Memory", "year" : 1931 }
]
},
{
// Includes documents without a year
"_id" : "Unknown",
"count" : 1,
"artwork" : [
{ "title" : "The Great Wave off Kanagawa" }
]
}
]
}
跟很多阶段类似,$bucket阶段也有100M内存的限制,缺省情况下如果超出100M将会抛出异常。可使用allowDiskUse
选项,让聚合管道阶段将数据写入临时文件。