Skip to content

Commit

Permalink
run fix on dev and _posts
Browse files Browse the repository at this point in the history
  • Loading branch information
Mause committed Aug 18, 2023
1 parent 79e799a commit 920a98b
Show file tree
Hide file tree
Showing 32 changed files with 280 additions and 55 deletions.
4 changes: 4 additions & 0 deletions _posts/2021-01-25-full-text-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ DuckDB's FTS implementation follows the paper "[Old Dogs Are Great at New Tricks
Alright, enough about the "why", let's get to the "how".

### Preparing the Data

The TREC 2004 Robust Retrieval Track has 250 "topics" (search queries) over TREC disks 4 and 5. The data consist of many text files stored in SGML format, along with a corresponding DTD (document type definition) file. This format is rarely used anymore, but it is similar to XML. We will use OpenSP's command line tool `osx` to convert it to XML. Because there are many files, I wrote a bash script:
```bash
#!/bin/bash
Expand Down Expand Up @@ -80,6 +81,7 @@ con.close()
This is the end of my preparation script, so I closed the database connection.

### Building the Search Engine

We can now build the inverted index and the retrieval model using a `PRAGMA` statement. The extension is [documented here](/docs/extensions/full_text_search). We create an index table on table `documents` or `main.documents` that we created with our script. The column that identifies our documents is called `docno`, and we wish to create an inverted index on the fields supplied. I supplied all fields by using the '\*' shortcut.
```python
con = duckdb.connect(database='db/trec04_05.db', read_only=False)
Expand All @@ -89,6 +91,7 @@ con.execute("PRAGMA create_fts_index('documents', 'docno', '*', stopwords='engli
Under the hood, a parameterized SQL script is called. The schema `fts_main_documents` is created, along with tables `docs`, `terms`, `dict`, and `stats`, that make up the inverted index. If you're curious what this look like, take a look at our source code under the `extension` folder in DuckDB's source code!

### Running the Benchmark

The data is now fully prepared. Now we want to run the queries in the benchmark, one by one. We load the topics file as follows:
```python
# the 'topics' file is not structured nicely, therefore we need parse some of it using regex
Expand Down Expand Up @@ -138,6 +141,7 @@ with open('results', 'w+') as f:
```

### Results

Now that we have created our 'results' file, we can compare them to the relevance assessments `qrels` using `trec_eval`.
```bash
$ ./trec_eval -m P.30 -m map qrels results
Expand Down
10 changes: 10 additions & 0 deletions _posts/2021-05-14-sql-on-pandas.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ If you are reading from a file (e.g. a CSV or Parquet file) often your data will
[1] [Apache Arrow](https://arrow.apache.org) is gaining significant traction in this domain as well, and DuckDB also quacks Arrow.

## SQL on Pandas

After your data has been converted into a Pandas DataFrame often additional data wrangling and analysis still need to be performed. SQL is a very powerful tool for performing these types of data transformations. Using DuckDB, it is possible to run SQL efficiently right on top of Pandas DataFrames.

As a short teaser, here is a code snippet that allows you to do exactly that: run arbitrary SQL queries directly on Pandas DataFrames using DuckDB.
Expand All @@ -36,6 +37,7 @@ print(duckdb.query("SELECT SUM(a) FROM mydf").to_df())
In the rest of the article, we will go more in-depth into how this works and how fast it is.

## Data Integration & SQL on Pandas

One of the core goals of DuckDB is that accessing data in common formats should be easy. DuckDB is fully capable of running queries in parallel *directly* on top of a Pandas DataFrame (or on a Parquet/CSV file, or on an Arrow table, …). A separate (time-consuming) import step is not necessary.

DuckDB can also write query results directly to any of these formats. You can use DuckDB to process a Pandas DataFrame in parallel using SQL, and convert the result back to a Pandas DataFrame again, so you can then use the result in other Data Science libraries.
Expand All @@ -56,6 +58,7 @@ The SQL table name `mydf` is interpreted as the local Python variable `mydf` tha
Not only is this process painless, it is highly efficient. For many queries, you can use DuckDB to process data faster than Pandas, and with a much lower total memory usage, *without ever leaving the Pandas DataFrame binary format* ("Pandas-in, Pandas-out"). Unlike when using an external database system such as Postgres, the data transfer time of the input or the output is negligible (see Appendix A for details).

## SQL on Pandas Performance

To demonstrate the performance of DuckDB when executing SQL on Pandas DataFrames, we now present a number of benchmarks. The source code for the benchmarks is available for interactive use [in Google Colab](https://colab.research.google.com/drive/1eg_TJpPQr2tyYKWjISJlX8IEAi8Qln3U?usp=sharing). In these benchmarks, we operate *purely* on Pandas DataFrames. Both the DuckDB code and the Pandas code operates fully on a `Pandas-in, Pandas-out` basis.

### Benchmark Setup and Data Set
Expand Down Expand Up @@ -116,6 +119,7 @@ lineitem.agg(
This benchmark involves a very simple query, and Pandas performs very well here. These simple queries are where Pandas excels (ha), as it can directly call into the numpy routines that implement these aggregates, which are highly efficient. Nevertheless, we can see that DuckDB performs similar to Pandas in the single-threaded scenario, and benefits from its multi-threading support when enabled.

### Grouped Aggregate

For our second query, we will run the same set of aggregates, but this time include a grouping condition. In SQL, we can do this by adding a GROUP BY clause to the query.

```sql
Expand Down Expand Up @@ -228,6 +232,7 @@ Due to its holistic query optimizer and efficient query processor, DuckDB perfor


### Joins

For the final query, we will join (`merge` in Pandas) the lineitem table with the orders table, and apply a filter that only selects orders which have the status we are interested in. This leads us to the following query in SQL:

```sql
Expand Down Expand Up @@ -323,6 +328,7 @@ Both of these optimizations are automatically applied by DuckDB's query optimize
We see that the basic approach is extremely time consuming compared to the optimized version. This demonstrates the usefulness of the automatic query optimizer. Even after optimizing, the Pandas code is still significantly slower than DuckDB because it stores intermediate results in memory after the individual filters and joins.

### Takeaway

Using DuckDB, you can take advantage of the powerful and expressive SQL language without having to worry about moving your data in - and out - of Pandas. DuckDB is extremely simple to install, and offers many advantages such as a query optimizer, automatic multi-threading and larger-than-memory computation. DuckDB uses the Postgres SQL parser, and offers many of the same SQL features as Postgres, including advanced features such as window functions, correlated subqueries, (recursive) common table expressions, nested types and sampling. If you are missing a feature, please [open an issue](https://github.com/duckdb/duckdb/issues).

## Appendix A: There and back again: Transferring data from Pandas to a SQL engine and back
Expand Down Expand Up @@ -368,11 +374,13 @@ Nevertheless, for good measure we have run the first Ungrouped Aggregate query i
We can see that PandaSQL (powered by SQLite) is around 1000X~ slower than either Pandas or DuckDB on this straightforward benchmark. The performance difference was so large we have opted not to run the other benchmarks for PandaSQL.

## Appendix C: Query on Parquet Directly

In the benchmarks above, we fully read the parquet files into Pandas. However, DuckDB also has the capability of directly running queries on top of Parquet files (in parallel!). In this appendix, we show the performance of this compared to loading the file into Python first.

For the benchmark, we will run two queries: the simplest query (the ungrouped aggregate) and the most complex query (the final join) and compare the cost of running this query directly on the Parquet file, compared to loading it into Pandas using the `read_parquet` function.

### Setup

In DuckDB, we can create a view over the Parquet file using the following query. This allows us to run queries over the Parquet file as if it was a regular table. Note that we do not need to worry about projection pushdown at all: we can just do a `SELECT *` and DuckDB's optimizer will take care of only projecting the required columns at query time.

```sql
Expand All @@ -381,6 +389,7 @@ CREATE VIEW orders_parquet AS SELECT * FROM 'orders.parquet';
```

### Ungrouped Aggregate

After we have set up this view, we can run the same queries we ran before, but this time against the `lineitem_parquet` table.

```sql
Expand Down Expand Up @@ -413,6 +422,7 @@ result = lineitem_pandas_parquet.agg(Sum=('l_extendedprice', 'sum'), Min=('l_ext
We can see that the performance difference between doing the pushdown and not doing the pushdown is dramatic. When we perform the pushdown, Pandas has performance in the same ballpark as DuckDB. Without the pushdown, however, it is loading the entire file from disk, including the other 15 columns that are not required to answer the query.

## Joins

Now for the final query that we saw in the join section previously. To recap:

```sql
Expand Down
9 changes: 9 additions & 0 deletions _posts/2021-06-25-querying-parquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ The Parquet format has a number of properties that make it suitable for analytic
3. The columnar compression significantly reduces the file size of the format, which in turn reduces the storage requirement of data sets. This can often turn Big Data into Medium Data.

#### DuckDB and Parquet

DuckDB's zero-dependency Parquet reader is able to directly execute SQL queries on Parquet files without any import or analysis step. Because of the natural columnar format of Parquet, this is very fast!

DuckDB will read the Parquet files in a streaming fashion, which means you can perform queries on large Parquet files that do not fit in your main memory.
Expand All @@ -47,6 +48,7 @@ WHERE pickup_at BETWEEN '2019-04-15' AND '2019-04-20'
```

#### Automatic Filter & Projection Pushdown

Let us dive into the previous query to better understand the power of the Parquet format when combined with DuckDB's query optimizer.

```sql
Expand All @@ -65,13 +67,15 @@ In addition, only rows that have a `pickup_at` between the 15th and the 20th of
We can use the statistics inside the Parquet file to great advantage here. Any row groups that have a max value of `pickup_at` lower than `2019-04-15`, or a min value higher than `2019-04-20`, can be skipped. In some cases, that allows us to skip reading entire files.

#### DuckDB versus Pandas

To illustrate how effective these automatic optimizations are, we will run a number of queries on top of Parquet files using both Pandas and DuckDB.

In these queries, we use a part of the infamous New York Taxi dataset stored as Parquet files, specifically data from April, May and June 2019. These files are ca. 360 MB in size together and contain around 21 million rows of 18 columns each. The three files are placed into the `taxi/` folder.

The examples are available [here as an interactive notebook over at Google Colab](https://colab.research.google.com/drive/1e1beWqYOcFidKl2IxHtxT5s9i_6KYuNY). The timings reported here are from this environment for reproducibility.

#### Reading Multiple Parquet Files

First we look at some rows in the dataset. There are three Parquet files in the `taxi/` folder. [DuckDB supports the globbing syntax](https://duckdb.org/docs/data/parquet), which allows it to query all three files simultaneously.

```py
Expand Down Expand Up @@ -113,6 +117,7 @@ Below are the timings for both of these queries.
Pandas takes significantly longer to complete this query. That is because Pandas not only needs to read each of the three Parquet files in their entirety, it has to concatenate these three separate Pandas DataFrames together.

#### Concatenate Into a Single File

We can address the concatenation issue by creating a single big Parquet file from the three smaller parts. We can use the `pyarrow` library for this, which has support for reading multiple Parquet files and streaming them into a single large file. Note that the `pyarrow` parquet reader is the very same parquet reader that is used by Pandas internally.

```py
Expand All @@ -125,6 +130,7 @@ pq.write_table(pq.ParquetDataset('taxi/').read(), 'alltaxi.parquet', row_group_s
Note that [DuckDB also has support for writing Parquet files](https://duckdb.org/docs/data/parquet#writing-to-parquet-files) using the COPY statement.

#### Querying the Large File

Now let us repeat the previous experiment, but using the single file instead.

```py
Expand All @@ -149,6 +155,7 @@ We can see that Pandas performs better than before, as the concatenation is avoi
For DuckDB it does not really matter how many Parquet files need to be read in a query.

#### Counting Rows

Now suppose we want to figure out how many rows are in our data set. We can do that using the following code:

```py
Expand Down Expand Up @@ -184,6 +191,7 @@ len(pandas.read_parquet('alltaxi.parquet', columns=['vendor_id']))
While this is much faster, this still takes more than a second as the entire `vendor_id` column has to be read into memory as a Pandas column only to count the number of rows.

#### Filtering Rows

It is common to use some sort of filtering predicate to only look at the interesting parts of a data set. For example, imagine we want to know how many taxi rides occur after the 30th of June 2019. We can do that using the following query in DuckDB:

```py
Expand Down Expand Up @@ -250,6 +258,7 @@ print(len(df[['pickup_at']].query("pickup_at > '2019-06-30'")))
| Pandas (native) | 0.26 |

#### Aggregates

Finally lets look at a more complex aggregation. Say we want to compute the number of rides per passenger. With DuckDB and SQL, it looks like this:

```py
Expand Down
Loading

0 comments on commit 920a98b

Please sign in to comment.