The European Research Council supports top researchers around the world, funding a number of projects that contribute greatly to our scientific understanding of the world. And happily, they make the information about what they fund public; you can go straight to their website at http://erc.europa.eu/projects-and-results and begin browsing.
Unfortunately, this sort of browsing, while delightfully open, could do a better job of letting a curious citizen see things from a top-down view. One can get a good understanding of a particular grant, but how would one answer questions like
- How has funding changed across the years?
- Which countries are funded the most?
- How does funding compare across scientific domains?
- What projects contain summaries with the word 'quantum'?
This project represents one answer. You can see the result at http://www.thingotron.com/erc_test/erc_treemap.html
Enclosed is the dump_erc.js file, a Casper program used to walk the various selections of the ERC website and download data. This process should absolutely not be used too often and is presented here mostly as an example of "something that works"; I would love to see government websites have their data available via XML SOAP or direct download, but in the absence of this, scraping will have to do. The process takes some time and has several built-in delays to let the webserver have a bit of a rest. I acknowledge that taking this much data from a website represents a fair load on its server and can only justify it by saying that if we take a snapshot once, we can do a great deal of analysis without bothering the server again.
New: The data comes from the cordis platform (http://cordis.europa.eu/projects/home_en.html) and datafiles can be downloaded directly from the EU Open Data Portal (http://open-data.europa.eu). The relevant searches come from searching for "EU Research Projects under FP7" and "EU Research Projects under Horizon 2020". Each has a csv file for projects and organizations.
This presents some difficulties, as the two sets of data do not have the same distinctions; for example, the role of "hostInstitution" in FP7 seems comparable to "Coordinator" in H2020. Likewise FP7 seemed to track call domain (found in the "topics" line, as ERC-AG-LS6 would indicate an advanced grant in "immunity and infection"), but H2020 does not include this information; additionally many items such as "participantCountries" and "subjects" are not guaranteed to be populated.
The data is dumped in a simple JSON format representing an array of items; each item is a dictionary with the various fields of the site (project name, project acronym, funding, etc). A CSV file would have done just as well.
The JSON format is
[ ...projects... ] where each project is
{
"project" : project name
"acronym" : short project acronym
"pi" : primary investigator
"hi" : host institute
"country" : Name of country according to files
"call_details" : call details
"call_year" : year of grant
"call_domain" : SH2
"summary" : textual summary of the grant
"hi_website" : website of the host institution
"erc_funding" : number (in euros)--not string
"duration" : duration (string, as "60 months")
"category" : Category of grant ("starting grant" etc)
}
I use d3.js and heavily leverage Mike Bostock's "Zoomable Treemap" example, but a fixed zoomable treemap is only so useful. I also use the NYT's Pourover library for basic filtering.
First, Pourover is used to filter the existing data set using the
year/country/domain/category dropdown boxes on the left; to make
things more tidy, I used bootstrap-multiselect
to hide all the
choices until in use, and to allow the "Domain" dropdown to have
descriptions which did not take up too much space when not in use.
The dropdown filters are created using Pourover's makeExactFilter
,
using underscore.js
to pluck unique values out of the list of data:
years = _.sortBy(_.uniq(_.pluck( arr, "call_year" )));
yearFilter = PourOver.makeExactFilter( "call_year", years );
The substring filter is implemented by extending a basic filter to cache results based on case-insensitive text search. It's a bit of a hack, but for a relatively small data set it works very well; search for the line
var SubstringFilter = PourOver.Filter.extend({
...
});
to see how it was done.
Next, the data is sorted into a tree by an ordered list of keys in
the tree_from_array
function.
This allows the list of data to be grouped by keys into an arbitrary
ordering of the categorical keys year-category-domain-country; on the
left of the navigation is a list of those keys which use jquery's
sortable
interface to be draggable in any order, and a Stop Breakdown
tile to halt the breakdown at that point (for example, if
you don't want to "drill" past year
, you could drag year
to the
top and then Stop Breakdown
).
The bulk of the display is the treemap, which shows the existing level and the basic structure of the next level down. I supplemented that with a series of bars on the right which help to visualize comparative size of budget. Once you drill down to the bottom level and are examining individual grants, hovering over a block will show a summary and clicking on the block will display the relevant data in the "Last Selected Project Details" block on the bottom.
Hopefully this will be of use to those looking to render hierarchical data in a flexible and reorderable manner. Best of luck!