Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add company dataset abstraction #218

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@
.tox/
__pycache__/
build/
data/
dist/
htmlcov/
6 changes: 0 additions & 6 deletions .landscape.yml

This file was deleted.

76 changes: 41 additions & 35 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,24 @@
:target: https://travis-ci.org/okfn-brasil/serenata-toolbox
:alt: Travis CI build status (Linux)

.. image:: https://readthedocs.org/projects/serenata-toolbox/badge/?version=latest
:target: http://serenata-toolbox.readthedocs.io/en/latest/?badge=latest
:alt: Documentation Status

.. image:: https://landscape.io/github/okfn-brasil/serenata-toolbox/master/landscape.svg?style=flat
:target: https://landscape.io/github/okfn-brasil/serenata-toolbox/master
:alt: Code Health

.. image:: https://coveralls.io/repos/github/okfn-brasil/serenata-toolbox/badge.svg?branch=master
:target: https://coveralls.io/github/okfn-brasil/serenata-toolbox?branch=master
:alt: Coveralls

.. image:: https://badge.fury.io/py/serenata-toolbox.svg
:alt: PyPI package version

.. image:: https://img.shields.io/pypi/pyversions/serenata_toolbox
:alt: PyPI - Python Version

.. image:: https://img.shields.io/badge/donate-apoia.se-EB4A3B.svg
:target: https://apoia.se/serenata
:alt: Donation Page

Serenata de Amor Toolbox
========================

`pip <https://pip.pypa.io/en/stable/>`_ installable package to support `Serenata de Amor <https://github.com/okfn-brasil/serenata-de-amor>`_
and `Rosie <https://github.com/okfn-brasil/serenata-de-amor/blob/master/rosie/README.md>`_ development.

Serenata_toolbox is compatible with Python 3.6+
Python package to support `Serenata de Amor <https://github.com/okfn-brasil/serenata-de-amor>`_ development. ``serenata_toolbox`` is compatible with Python 3.6+.

Installation
------------
Expand All @@ -36,39 +28,50 @@ Installation

$ pip install -U serenata-toolbox

If you are a regular user you are ready to get started after `pip install`.

If you are a core developer willing to upload datasets to the cloud you need to configure `AMAZON_ACCESS_KEY` and `AMAZON_SECRET_KEY` environment variables before running the toolbox.

Usage
-----

We have `plenty of them <https://github.com/okfn-brasil/serenata-de-amor/blob/51fad8c807cb353303c5f5a3f945693feeb82015/CONTRIBUTING.md#datasets-researchdata>`_ ready for you to download from our servers. And this toolbox helps you get them. Here some examples:
This toolbox helps you get datasets used in `Serenata de Amor services <https://github.com/okfn-brasil/serenata-de-amor>`_ and `notebooks <https://github.com/okfn-brasil/notebooks>`_.

Example 1: Using the command line wrapper
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Example 1: Using the CLI
^^^^^^^^^^^^^^^^^^^^^^^^

Without any arguments, it will download our pre-processed datasets and store into ``data`` folder:

.. code-block:: bash

# without any arguments will download our pre-processed datasets and store into data/ folder
$ serenata-toolbox

# will download these specific datasets and store into /tmp/serenata-data folder
But you can specify which datasets to download and where to save them. For example, to download ``chamber_of_deputies`` and ``federal_senate`` datasets to ``/tmp/serenata-data``:

.. code-block:: bash

$ serenata-toolbox /tmp/serenata-data --module federal_senate chamber_of_deputies

# you can specify a dataset and a year
Available modules are ``chamber_of_deputies``, ``companies`` and ``federal_senate``.

Yet, you can specify a specific year:

.. code-block:: bash

$ serenata-toolbox --module chamber_of_deputies --year 2009

# or specify all options simultaneously
$ serenata-toolbox /tmp/serenata-data --module federal_senate --year 2017
Or use it all together:

.. code-block:: bash

$ serenata-toolbox /tmp/serenata-data --module federal_senate companies --year 2017

Finally, you might want to get help:

.. code-block:: bash

# getting help
$ serenata-toolbox --help

Example 2: How do I download the datasets?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Example 2: Using Python
^^^^^^^^^^^^^^^^^^^^^^^

Another option is creating your own Python script:
Another option is creating your own Python scripts:

.. code-block:: python

Expand Down Expand Up @@ -100,12 +103,13 @@ If the last example doesn't look that simple, there are some fancy shortcuts ava
Example 4: Generating datasets
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you ever wonder how did we generated these datasets, this toolbox can help you too (at least with the more used ones — the other ones are generated `in our main repo <https://github.com/okfn-brasil/serenata-de-amor/blob/51fad8c807cb353303c5f5a3f945693feeb82015/CONTRIBUTING.md#the-toolbox-and-our-the-source-files-researchsrc>`_):
If you ever wonder how did we generate these datasets, this toolbox can help you too (at least with the most used ones — the other ones are generated `in our main repo <https://github.com/okfn-brasil/serenata-de-amor/blob/51fad8c807cb353303c5f5a3f945693feeb82015/CONTRIBUTING.md#the-toolbox-and-our-the-source-files-researchsrc>`_):

.. code-block:: python

from serenata_toolbox.federal_senate.dataset import Dataset as SenateDataset
from serenata_toolbox.chamber_of_deputies.reimbursements import Reimbursements as ChamberDataset
from serenata_toolbox.companies.dataset import Dataset as CompaniesDataset
from serenata_toolbox.federal_senate.dataset import Dataset as SenateDataset

chamber = ChamberDataset('2018', 'data/')
chamber()
Expand All @@ -115,6 +119,9 @@ If you ever wonder how did we generated these datasets, this toolbox can help yo
senate.translate()
senate.clean()

companies = CompaniesDataset('data/')
companies()

Documentation (WIP)
-------------------

Expand All @@ -128,30 +135,29 @@ The `full documentation <https://serenata-toolbox.readthedocs.io>`_ is still a w
Contributing
------------

Firstly, you should create a development environment with Python's `venv <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`_ module to isolate your development.
Then clone the repository and build the package by running:
Firstly, you should create a development environment with Python's `venv <https://docs.python.org/3/library/venv.html#creating-virtual-environments>`_ module to isolate your development. Then clone the repository and build the package by running:

.. code-block:: bash

$ git clone https://github.com/okfn-brasil/serenata-toolbox.git
$ cd serenata-toolbox
$ python setup.py develop

Always add tests to your contribution — if you want to test it locally before opening the PR:
Always add tests to your contribution — if you want to test it locally before opening the PR:

.. code-block:: bash

$ pip install tox
$ tox

When the tests are passing, also check for coverage of the modules you edited or added — if you want to check it before opening the PR:
When the tests are passing, also check for coverage of the modules you edited or added — if you want to check it before opening the PR:

.. code-block:: bash

$ tox
$ open htmlcov/index.html

Follow `PEP8 <https://www.python.org/dev/peps/pep-0008/>`_ and best practices implemented by `Landscape <https://landscape.io>`_ in the `veryhigh` strictness level — if you want to check them locally before opening the PR:
Follow `PEP8 <https://www.python.org/dev/peps/pep-0008/>`_ and its best practices implemented by `Landscape <https://landscape.io>`_ in the `veryhigh` strictness level — if you want to check them locally before opening the PR:

.. code-block:: bash

Expand Down
Empty file.
67 changes: 67 additions & 0 deletions serenata_toolbox/companies/cnae.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
import re
from tempfile import NamedTemporaryFile

import requests
from openpyxl import load_workbook

from serenata_toolbox import log


class Cnae:
"""This database abstraction complements the CNPJ dataset with economic
activities (CNAE) description that comes from a separate file from the
Federal Revenue."""

CHUNK = 2 ** 12
CNAE_DESCRIPTION_FILE = (
"https://cnae.ibge.gov.br"
"/images/concla/documentacao/"
"CNAE_Subclasses_2_3_Estrutura_Detalhada.xlsx"
)

def __init__(self):
self._activities = dict() # cache

@staticmethod
def parse_code(code):
if not code:
return

cleaned = re.sub(r"\D", "", code)
try:
return int(cleaned)
except ValueError:
return

def load_activities(self):
log.info("Fetching CNAE descriptions…")
with NamedTemporaryFile(suffix=".xlsx") as tmp:
response = requests.get(self.CNAE_DESCRIPTION_FILE)

with open(tmp.name, "wb") as fobj:
log.debug(f"Dowloading {response.url} to {tmp.name}…")
for chunk in response.iter_content(self.CHUNK):
if chunk:
fobj.write(chunk)

wb = load_workbook(tmp.name)
for row in wb.active.rows:
code = self.parse_code(row[4].value)
description = row[5].value
if not all((code, description)):
continue

self._activities[code] = description

@property
def activities(self):
"""Dictionary with the descriptions of the economic activity (CNAE)
not included in the Reveita Federal dataset."""
if self._activities:
return self._activities

self.load_activities()
return self._activities

def __call__(self, code):
return self.activities.get(code)
Loading