-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathquickstart.malloynb
503 lines (428 loc) · 19.8 KB
/
quickstart.malloynb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
>>>markdown
# Malloy Quickstart
This guide introduces the basics of querying data and building a semantic model with the Malloy language. By the end of this tutorial, you will understand how to use Malloy to run queries, build re-usable data models, and do analysis on your data that is nearly impossible in SQL.
Data for this notebook lives in this repository in parquet files in the `./data` directory. The queries are run using [DuckDB](https://duckdb.org/), an embedded analytical database. Each code cell in this notebook can be executed by clicking the ▶️ button that appears when you hover on the cell. You can also modify the code and re-execute the queries to play around and try things out.
## A simple `SELECT`
The following query is equivalent to <code>SELECT id, code, city FROM airports LIMIT 10</code> in SQL:
>>>malloy
run: duckdb.table('data/airports.parquet') -> {
select:
id
code
city
limit: 10
}
>>>markdown
Let's break down each part of this query.
- `run:` is the opening statement that indicates we're starting to write a query
- `duckdb.table('data/airports.parquet')` defines the source for the query. `duckdb` indicates the database connection we're using, and the `table()` function creates a source from a table or view in the database.
- A source is similar to a table or view in SQL, but Malloy sources can include additional information like joins and measures. We'll cover this in depth later on.
- The `->` operator begins the query. All queries take the form `source -> { ... }`, with the query logic specified inside of the curly braces.
- `select: ` is equivalent to `SELECT` in SQL. In this clause, we select the `id`, `code`, and `city` columns from the table. A `select` is equivalent to the [projection](https://en.wikipedia.org/wiki/Projection_(relational_algebra)) operation in Relational Algebra.
- `limit: 10` limits the resultset of the query to the first 10 items
>>>markdown
### Showing the Schema
Malloy supports `select *` just like SQL's `SELECT *` so you can interactively view any source's schema.
>>>malloy
run: duckdb.table('data/airports.parquet') -> {
select: *
limit: 10
}
>>>markdown
## Query Operators
In SQL, the <code>SELECT</code> command does two very different things. A <code>SELECT</code> with a <code>GROUP BY</code> aggregates data according to the <code>GROUP BY</code> clause and produces aggregate calculation against every calculation not in the <code>GROUP BY</code>. In Malloy, the query operator for this is `group_by`. Calculation about data in the group are made using `aggregate`.
The second type of <code>SELECT</code> in SQL does not perform any aggregation; All rows in the input table, unless filtered in some way, show up in the output table.
### Aggregate
In the query below, the data will be grouped by `state` and `county`, and will produce an aggregate calculation for `airport_count` and `average_elevation`.
>>>malloy
run: duckdb.table('data/airports.parquet') -> {
group_by:
state
county
aggregate:
airport_count is count()
average_elevation is avg(elevation)
limit: 10
}
>>>markdown
Operator statements can be placed in any order within a query. `where` can come before or after `select`, and `limit` can be placed anywhere as well. The above query could also be written:
>>>malloy
run: duckdb.table('data/airports.parquet') -> {
limit: 10
where: county = 'SANTA CRUZ'
select: code, full_name, city, county
}
>>>markdown
## Everything has a Name
In Malloy, all output fields have names. This means that any time a query
includes a field with a calculated value, like a scalar or aggregate function,
it must be named. _(unlike SQL, which allows un-named expressions)_
>>>malloy
run: duckdb.table('data/airports.parquet') -> {
aggregate: max_elevation is max(elevation)
}
>>>markdown
Notice that Malloy uses the form "**name** `is` _value_" instead of SQL's "_value_ `as` **name**".
Having the output column name written first makes it easier for someone reading
the code to visualize the resulting query structure.
Named objects (such as columns from a table and fields defined in a source) can be included in field lists without an `is`
>>>malloy
run: duckdb.table('data/airports.parquet') -> {
select:
full_name
elevation
limit: 10
}
>>>markdown
## Expressions
Many SQL expressions will work unchanged in Malloy, and many functions available in Standard SQL are usable in Malloy as well. This makes expressions fairly straightforward to understand, given a knowledge of SQL.
>>>malloy
run: duckdb.table('data/airports.parquet') -> {
group_by: county_and_state is concat(county, ', ', state)
aggregate:
airport_count is count()
max_elevation is max(elevation)
min_elevation is min(elevation)
avg_elevation is avg(elevation)
limit: 10
}
>>>markdown
The basic types of Malloy expressions are `string`, `number`, `boolean`, `date`, and `timestamp`.
>>>markdown
## Sources: the Basic Structure for Modeling and Reuse
One of the main benefits of Malloy is the ability to save common calculations into a data model. The data model is made of *sources*, which can be thought of as tables or views, but with additional information such as joins, dimensions and measures encoded into it.
In the example below, we create a *source* object named `airports`, add a `dimension` calculation for `county_and_state`, and a `measure` calculation for `airport_count`. Dimensions can be used in `group_by`, `select` and `where`. Measures can be used in `aggregate` and `having`.
>>>malloy
source: airports is duckdb.table('data/airports.parquet') extend {
dimension: county_and_state is concat(county, ', ', state)
measure: airport_count is count()
measure: average_elevation is avg(elevation)
}
run: airports -> {
group_by: county_and_state
aggregate: airport_count
limit: 10
}
>>>markdown
Sources that are defined in one file can be imported into another using `import "path/to/some/file.malloy"`. For example, if the `airports` source above were defined in a file called `flights.malloy`, you could create a new file that imports it and immediately start using the `airports` source:
```malloy
import "flights.malloy"
run: airports -> {
group_by: county_and_state
aggregate: average_elevation
}
```
>>>markdown
Sources can also contain named queries. These named queries are useful for building nested queries (covered later) or for saving a query so it can re-used again and again without having to rewrite it.
>>>malloy
source: airports_with_named_view is duckdb.table('data/airports.parquet') extend {
dimension: county_and_state is concat(county, ', ', state)
measure: airport_count is count()
measure: average_elevation is avg(elevation)
// This is a `view` defined on the source:
view: top_county_and_state is {
group_by: county_and_state
aggregate: airport_count
limit:10
}
}
// The view be referenced by name, and the unerlying query will be run without having to rewrite the query logic:
run: airports_with_named_view -> top_county_and_state
>>>markdown
## Joins
Joins are declared as part of a source. When joining a source to another, it brings with it all child joins.
>>>malloy
source: aircraft_models is duckdb.table('data/aircraft_models.parquet') extend {
primary_key: aircraft_model_code
}
source: aircraft is duckdb.table('data/aircraft.parquet') extend {
primary_key: tail_num
join_one: aircraft_models on aircraft_model_code = aircraft_models.aircraft_model_code
}
source: flights is duckdb.table('data/flights.parquet') extend {
join_one: aircraft on tail_num = aircraft.tail_num
}
run: flights -> {
where: dep_time = @2003-01
group_by: aircraft.aircraft_models.manufacturer
aggregate:
flight_count is count()
aircraft_count is aircraft.count()
average_seats_per_model is aircraft.aircraft_models.seats.avg()
}
>>>markdown
In this example, the `aircraft` source is joined to `flights`, and `aircraft_models` is joined via `aircraft`. These examples explicitly name both keys -- this same syntax can be used to write more complex joins.
Now, any query that uses the `flights` source has access to fields in both `aircraft` and `aircraft_models` without having to explicitly specify the join condition. The joins are specified once in the source, and usable by any query on `flights`.
An ad hoc join can also be specified in a query block using the `extend` keyword. In the query below, we join in the `airports` table using the `destination` column as a join key, then compute the top 5 destination airports by flight count.
>>>malloy
run: flights -> {
extend: {
join_one: airports on destination = airports.code
}
group_by: airports.full_name
aggregate: flight_count is count()
top: 5
}
>>>markdown
## Filtering
When working with data, filtering is something you do in almost every query. Malloy provides consistent syntax for filtering everywhere within a query. The most basic type of filter is applied using a `where:` clause, very similar to a <code>WHERE</code> clause in SQL.
The following query grabs the top 5 counties in California with the highest airport count:
>>>malloy
run: duckdb.table('data/airports.parquet') -> {
where: state = 'CA'
top: 5
group_by: county
aggregate: airport_count is count()
}
>>>markdown
Filters can also be applied to sources:
>>>malloy
source: airports_in_california is duckdb.table('data/airports.parquet') extend {
where: state = 'CA'
}
run: airports_in_california -> {
top: 5
group_by: county
aggregate: airport_count is count()
}
>>>markdown
Any query run on the `airports_in_california` source will run against the `airports` table, and always include the filter in `state = 'CA'`.
>>>markdown
### Filtering Measures
A filter on an aggregate calculation (a _measure_) narrows down the data used in that specific calculation. In the example below, the calculations for `airports` and `heliports` are filtered separately.
>>>malloy
run: duckdb.table('data/airports.parquet') -> {
group_by: state
aggregate:
airports is count() { where: fac_type = 'AIRPORT' }
heliports is count() { where: fac_type = 'HELIPORT' }
total is count()
limit: 5
}
>>>markdown
In SQL, this same calculation is often done using <code>CASE</code> statements inside of the aggregates, which is verbose and difficult to read. A query like the above would look like:
```sql
SELECT
state
, SUM(CASE WHEN fac_type = 'AIRPORT' THEN 1 ELSE 0 END) AS airports
, SUM(CASE WHEN fac_type = 'HELIPORT' THEN 1 ELSE 0 END) AS heliports
, COUNT(*) AS total
FROM `malloy-data.faa.airports`
GROUP BY state
```
>>>markdown
### Nested Queries
In Malloy, queries can be nested to produce subtables on each output row.
>>>malloy
run: airports -> {
group_by: state
aggregate: airport_count
nest: by_facility is {
group_by: fac_type
aggregate: airport_count
top: 3
}
limit: 5
}
>>>markdown
Here we can see that the `by_facility` column of the output table contains a nested subtable on each row. `by_facility` contains the counts for the top 3 facility types for each state, i.e., the number of airports, heliports, and stolports in Texas, the number of airports, heliports, and seaplane bases in California, etc.
When a query is nested inside another query, each output row of the outer query will have a nested table for the inner query which only includes data limited to that row.
Queries can be nested infinitely, allowing for rich, complex output structures. A query may always include another nested query, regardless of depth:
>>>malloy
run: airports -> {
group_by: state
aggregate: airport_count
nest: top_5_counties is {
top: 5
group_by: county
aggregate: airport_count
nest: by_facility is {
group_by: fac_type
aggregate: airport_count
}
}
top: 5
}
>>>markdown
### Filtering Nested Queries
Filters can be isolated to any level of nesting. In the following example, we limit the `major_facilities` query to only airports where `major` is `'Y'`. This particular filter applies _only_ to `major_facilities`, and not to other parts of the outer query.
>>>malloy
run: airports -> {
where: state = 'CA'
group_by: county
aggregate: airport_count
nest: major_facilities is {
where: major = 'Y'
group_by: name is concat(code, ' (', full_name, ')')
}
nest: by_facility is {
group_by: fac_type
aggregate: airport_count
}
top: 5
}
>>>markdown
## Dates and Timestamps
Working with time in data is often needlessly complex; Malloy has built in constructs to simplify many time-related operations. This section gives a brief introduction to some of these tools.
### Time Literals
Literals of type `date` and `timestamp` are notated with an `@`, e.g. `@2003-03-29` or `@1994-07-14 10:23:59`. Similarly, years (`@2021`), quarters (`@2020-Q1`), months (`@2019-03`), weeks (`@WK2021-08-01`), and minutes (`@2017-01-01 10:53`) can be expressed.
Time literals can be used as values, but are more often useful in filters. For example, the following query
shows the number of flights in 2003.
>>>malloy
run: duckdb.table('data/flights.parquet') {
where: dep_time ? @2003
} -> {
aggregate: flight_count is count()
}
>>>markdown
There is a special time literal `now`, referring to the current timestamp, which allows for relative time filters.
>>>malloy
run: duckdb.table('data/flights.parquet') {
where: dep_time > now - 6 hours
} -> {
aggregate: flights_last_6_hours is count()
}
>>>markdown
### Truncation
Time values can be truncated to a given timeframe, which can be `second`, `minute`, `hour`, `day`, `week`, `month`, `quarter`, or `year`.
>>>malloy
run: duckdb.table('data/flights.parquet') -> {
group_by:
flight_year is dep_time.year
flight_month is dep_time.month
aggregate: flight_count is count()
limit: 5
}
>>>markdown
### Extraction
Numeric values can be extracted from time values, e.g. `day_of_year(some_date)` or `minute(some_time)`. See the full list of extraction functions [here](https://malloydata.github.io/documentation/language/time-ranges).
>>>malloy
run: duckdb.table('data/flights.parquet') -> {
order_by: 1
group_by: day_of_week is day(dep_time)
aggregate: flight_count is count()
}
>>>markdown
<!-- TODO it may be worth having a doc describing what the JSON+Metadata
output of these look like, i.e. that the JSON just includes a regular date,
but the metadata specifies that it's in that given timeframe.
And likewise for any other data type that has interesting output metadata. -->
### Time Ranges
Two kinds of time ranges are given special syntax: the range between two times and the range starting at some time for some duration. These are represented like `@2003 to @2005` and `@2004-Q1 for 6 quarters` respectively. These ranges can be used in filters just like time literals.
>>>malloy
run: duckdb.table('data/flights.parquet') { where: dep_time ? @2003 to @2005 } -> {
aggregate: flight_count is count()
}
>>>markdown
Time literals and truncations can also behave like time ranges. Each kind of time literal has an implied duration that takes effect when it is used in a comparison, e.g. `@2003` represents the whole of the year 2003, and `@2004-Q1` lasts the whole 3 months of the quarter. Similarly, when a time value is truncated, it takes on the
timeframe from the truncation, e.g. `now.month` means the entirety of the current month.
When a time range is used in a comparison, `=` checks for "is in the range", `>` "is after", and `<` "is before." So `some_time > @2003` filters dates starting on January 1, 2004, while `some_time = @2003` filters to dates in the year 2003.
>>>malloy
run: duckdb.table('data/flights.parquet') { where: dep_time > @2003 } -> {
top: 3; order_by: departure_date asc
group_by: departure_date is dep_time.day
aggregate: flight_count is count()
}
>>>markdown
## Pipelines and Multi-stage Queries
As we saw above, the output from one stage of a query can be passed into another stage using `->`. For example, we'll start with this query which outputs, for California and New York, the total number of airports, as well as the number of airports in each county.
>>>malloy
run: airports -> {
where: state = 'CA' | 'NY'
group_by: state
aggregate: airport_count
nest: by_county is {
group_by: county
aggregate: airport_count
limit: 5
}
}
>>>markdown
Next, we'll use the output of that query as the input to another, where we determine which counties have the highest
percentage of airports compared to the whole state, taking advantage of the nested structure of the data to to so.
>>>malloy
run: airports -> {
where: state = 'CA' | 'NY'
group_by: state
aggregate: airport_count
nest: by_county is {
group_by: county
aggregate: airport_count
limit: 5
}
} -> {
top: 10; order_by: 4 desc
select:
by_county.county
airports_in_county is by_county.airport_count
airports_in_state is airport_count
percent_in_county is by_county.airport_count / airport_count
}
>>>markdown
## Aggregate Locality
When computing `sum`, `avg`, and `count` on fields in joined sources with one-to-many relationships, Malloy will automatically handle the duplication of rows that occurs in the join, and compute accurate aggregations on the fanned-out table. See the [Aggregate Locality](https://malloydata.github.io/documentation/language/aggregates.html#aggregate-locality) section for more information.
>>>malloy
run: aircraft -> {
aggregate:
// The average number of seats on models of registered aircraft
models_avg_seats is aircraft_models.seats.avg()
// The average number of seats on registered aircraft
aircraft_avg_seats is source.avg(aircraft_models.seats)
}
>>>markdown
## Comments
Malloy code can include both line and block comments. Line comments, which begin with `--` or `//`,
may appear anywhere within a line, and cause all subsequent characters on that line to be ignored.
Block comments, which are enclosed between <code>/\*</code> and <code>\*/</code>, cause all enclosed characters to be ignored
and may span multiple lines.
>>>markdown
```malloy
-- The total number of flight entries
run: flights -> {
aggregate: flight_count // Defined simply as `count()`
}
/*
* A comparison of the total number of flights
* for each of the tracked carriers.
*/
run: flights -> {
group_by: carrier
aggregate: flight_count /* , total_distance */
}
```
>>>markdown
## Ordering and Limiting
In Malloy, ordering and limiting work pretty much the same way they do in SQL, though Malloy introduces some [reasonable defaults](../language/order_by.md).
The `top:` and `limit:` statements are synonyms and limits the number of rows returned. Results below are sorted by the first measure descending--in this case, `airport_count`.
>>>malloy
run: airports -> {
top: 2
group_by: state
aggregate: airport_count is count()
}
>>>markdown
Default ordering can be overridden with `order_by:`, as in the following query, which shows the states in alphabetical order. `order_by:` can take a field index number or the name of a field.
>>>malloy
run: airports -> {
order_by: state
group_by: state
aggregate: airport_count is count()
limit: 10
}
>>>markdown
## Next Steps
This was a whirlwind tour of the syntax and features of Malloy. To continue on your Malloy journey:
- Explore sample analyses and data models built in Malloy in our [Patterns Github repo](https://github.com/malloydata/patterns).
- Learn how to [connect Malloy to your own database](https://malloydata.github.io/documentation/setup/connection_instructions.html).
- Take a look at our [guide for translating SQL to Malloy](https://malloydata.github.io/documentation/language/sql_to_malloy.html).
- Join the [Malloy community Slack channel](https://join.slack.com/t/malloy-community/shared_invite/zt-1t32mufpy-THwP1o1ADJVkd3o2L2zaZw)!
>>>markdown
<!-- ## Joins are between primary and foreign keys.
## Full graph of the data is available to query
## Sums and Counts and average are a little different.
## Calculations can correctly occur anywhere in the graph -->
<!--
## Removed things
- Commas are optional.
- Count can be written without the `*`.
-->