-
Notifications
You must be signed in to change notification settings - Fork 73
Cookbook
Please feel free to add qsv recipes to the Cookbook!
using qsv, ckanapi, jq and xargs.
- get a CSV of datasets/users/groups/orgs in a CKAN instance
$ ckanapi -r https://demo.ckan.org dump datasets --all | qsv jsonl > datasets.csv
$ ckanapi -r https://demo.ckan.org dump users --all | qsv jsonl > users.csv
$ ckanapi -r https://demo.ckan.org dump groups --all | qsv jsonl > groups.csv
$ ckanapi -r https://demo.ckan.org dump organizations --all | qsv jsonl > organizations.csv
- get a CSV of resources for a given dataset
$ ckanapi -r https://catalog.data.gov action package_show \
id=low-altitude-aerial-imagery-obtained-with-unmanned-aerial-systems-uas-flights-over-black-beach \
| jq -c '.resources[]' \
| qsv jsonl \
> resources.csv
- get the latest version of a CSV resource for
wellstar-oil-and-gas-wells1
on https://data.cnra.ca.gov, run stats on it
$ ckanapi -r https://data.cnra.ca.gov action package_show id="wellstar-oil-and-gas-wells1" \
> wellstar-oil-and-gas-wells.json
$ cat wellstar-oil-and-gas-wells.json \
| jq -c '.resources[] | select(.name=="CSV") | .url' \
| xargs -L 1 wget -O wellstar.csv
$ qsv stats --everything wellstar.csv > wellstar-stats.csv
- add a column with Quarter information calculated from "Created Date"
Ingredients:- nyc311samp.csv
- getquarter.lua
-
date.lua
module - can be downloaded fromsrc
at https://github.com/Tieske/date.
$ qsv luajit map Quarter -x -f getquarter.lua nyc311samp.csv > result-qtr.csv
Note: Change the column name on line 7 of
getquarter.lua
to adapt it for your use.
Also, you need to have thedate.lua
module in the same directory.
- Partition 311 files by Quarter. Create the files in the
nyc311byqtr
directory
$ qsv partition Quarter nyc311byqtr result-qtr.csv
- convert "Created Date" to ISO-8601 format
$ qsv apply datefmt "Created Date" nyc311samp.csv > result-iso8601.csv
- format "Created Date" using the format string ""%a %b %e %T %Y %z"
$ qsv apply datefmt "Created Date" --formatstr "%a %b %e %T %Y %z" nyc311samp.csv > result-datefmt.csv
- create a "Created Year" column from "Created Date"
$ qsv apply datefmt "Created Date" --formatstr "%Y" --new-column "Created Year" nyc311samp.csv \
> result-year.csv
- compute Turnaround Time. Store it in a new column named "TAT"
Ingredient:
$ qsv luajit map TAT -x -f turnaroundtime.lua nyc311samp.csv > result-tat.csv
- What is the average turnaround time in Brooklyn for 311 calls?
$ qsv search --select City --ignore-case brooklyn nyc311samp.csv \
| qsv luajit map TAT -x -f turnaroundtime.lua \
| tee brooklyn-311-details.csv \
| qsv stats --everything > result-brooklyn311-stats.csv
NOTE: The
tee
command reads from stdin and writes to both stdout and one or more files at the same time. We do this so we can create thebrooklyn-311-details.csv
file and pipe the same data to theqsv stats
command.
- using the "Location" column, geocode the nearest city in a new column named "City"
$ qsv apply geocode Location --new-column City nyc311samp.csv > result-geocoded.csv
Note: The bundled static geocoder uses the
reverse-geocoder
crate. It uses the geonames cities database, and geocodes to the closest city.
- geocode the county in a new column
$ qsv apply geocode Location --new-column County --formatstr county nyc311samp.csv > result-county.csv
This example was inspired by having to combine multiple tables exported from another system, which were themselves from multiple database joins. Suppose you have have several tables (table_*.csv
) which have the same first 10 columns, and then a varying number of additional columns. The column we want to join on is column 2, and for simplicity assume the rows all match perfectly (otherwise you would explore the left and right join options).
cp table_A.csv combined.csv
for NEXT in table_B.csv table_C.csv table_D.csv; do
qsv join 2 combined.csv 1 <(qsv select 2,11- $NEXT) > new.csv
mv new.csv combined.csv
done
We use a loop to perform multiple joins. Each time we use xsv select
to pull out the index (join column 2) and the columns unique to that file (11 onwards), which could also be done with cut -s "," -f 1,11- $NEXT
if preferred. The join column becomes column 1 of the intermediate file.
The proposed qsv join --merge
option would stop duplication of the join column.