Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full DB Dump #29

Open
arnaudsm opened this issue Apr 28, 2024 · 5 comments
Open

Full DB Dump #29

arnaudsm opened this issue Apr 28, 2024 · 5 comments

Comments

@arnaudsm
Copy link

arnaudsm commented Apr 28, 2024

Is there a full data dump available somewhere ?

I'm doing research and dataviz (and I suspect many here also do), which requires all the data at once.
Scraping the API is cumbersome and also uses precious CPU time from this service.

Having a giant CSV or JSON or SQL file updated once a month would be awesome.
Wikipedia and Stackoverflow provide a similar service, which are quite popular.

@scipima
Copy link

scipima commented Apr 28, 2024

Hi there, far from being the full data dump, but I started to pull data relative to Plenary here https://github.com/scipima/ep_vote_collect.git.
The README gives you indications for getting either the data for the daily Plenary, or the full mandate.
Hope this helps,
Marco

@tfrancart
Copy link

Datasets can be downloaded from the EP Open Data Portal : https://data.europarl.europa.eu/en/datasets

@arnaudsm
Copy link
Author

Thank you for the suggestion, but the dataset portal only contains a fraction of the API data, and the 236 files have to be downloaded manually.

A full dump would be greatly appreciated.

In the meantime I'm working on a JS library to dump the API similar to @scipima work, and might open-source it at some point.

@tfrancart
Copy link

Thank you for the suggestion, but the dataset portal only contains a fraction of the API data,

Can you be more specific on this ? What is in the API data that is not in the datasets ? I can understand that datasets are not as fresh as the API data, but other than that, I would expect the RDF content to be identical to the one from the API

and the 236 files have to be downloaded manually.

If one can scrape thousands of API calls, one could scrape 236 file downloads :-) (in reality, 236 * 28 languages). This could be an alternate way to recreate a full DB dump (but, as I said, probably not as fresh), without stressing the API.

@arnaudsm
Copy link
Author

arnaudsm commented May 3, 2024

@tfrancart I was thinking of /meetings/{event-id}/vote-results. Is there a way to retrieve it on the datasets page ?

Thank you for you help, I am still new to this ecosytem. I rate-limited my dump scripts for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants