A python package to scrape time series from any excel file. Like these ones:
And return them turned into pandas data frames.
If you want to have the complete repository with the tests:
If you are using Anaconda as your python distribution
conda create -n xlseries python=2
Create new environmentcd project_directory
source activate xlseries
Activate the environmentpip install -e .
Install the package in developer modepip install -r requirements.txt
Install dependenciesdeactivate
Deactivate when you are done
If you are using a standard python installation
cd project_directory
virtualenv venv
Create new environmentsource venv/bin/activate
Activate the environmentpip install -r requirements.txt
Install dependenciesdeactivate
Deactivate when you are done
If you just want to use it and don't care about the tests:
pip install xlseries
in your environment, instead of cloning and pip installing in developer mode.
from xlseries import XlSeries
xl = XlSeries("path_to_excel_file" or openpyxl.Workbook instance)
dfs = xl.get_data_frames("path_to_json_parameters" or parameters_dictionary)
With the test case number 1:
from xlseries import XlSeries
from xlseries.utils.path_finders import get_orig_cases_path, get_param_cases_path
# this will only work if you clone the repo with all the test files
path_to_excel_file = get_orig_cases_path(1)
path_to_json_parameters = get_param_cases_path(1)
xl = XlSeries(path_to_excel_file)
dfs = series.get_data_frames(path_to_json_parameters)
or passing only the critical parameters as a dictionary:
parameters_dictionary = {
"headers_coord": ["B1","C1"],
"data_starts": 2,
"frequency": "M",
"time_header_coord": "A1"
}
dfs = xl.get_data_frames(parameters_dictionary)
you can specify what worksheet you want to scrape (otherwise the first one will be used):
dfs = xl.get_data_frames(parameters_dictionary, ws_name="my_worksheet")
you can ask an XlSeries object for a template dictionary of the critical parameters you need to fill:
>>> params = xl.critical_params_template()
>>> params
{'data_starts': 2,
'frequency': 'M',
'headers_coord': ['B1', 'C1', 'E1-G1'],
'time_header_coord': 'A1'}
>>> params["headers_coord"] = ["B1","C1"]
>>> dfs = xl.get_data_frames(params, ws_name="my_worksheet")
if this doesn't work and you want to see exactly where the scraping is failing, you may want to fill out all the parameters and try again to see where the exception is raised:
>>> params = xl.complete_params_template()
>>> params
{'alignment': u'vertical',
'blank_rows': False,
'continuity': True,
'data_ends': None,
'data_starts': 2,
'frequency': 'M',
'headers_coord': ['B1', 'C1', 'E1-G1'],
'missing_value': [None, '-', '...', '.', ''],
'missings': False,
'series_names': None,
'time_alignment': 0,
'time_composed': False,
'time_header_coord': 'A1',
'time_multicolumn': False}
>>> params["headers_coord"] = ["B1","C1"]
>>> params["data_ends"] = 256
>>> params["missings"] = True
>>> dfs = xl.get_data_frames(params, ws_name="my_worksheet")
-
Excel file: Up to this development point the excel file should not be more complicated than the 7 test cases:
-
Parameters: Together with the excel file, some parameters about the series must be provided. These could be passed to get_data_frames() as a path to a JSON file or as a python dictionary.
xlseries
use about 14 parameters to characterize the time series of a spreadsheet, but only 4 of them are critical most of the time: the rest can be guessed by the package. The only difference between specifying more or less parameters than the 4 critical is the total time thatxlseries
will need to complete the task (more parameters, less time).- Go to the parameters section for a more detailed explanation about how to use them, and when you need to specify more than the basic 4 (
headers_coord
,data_starts
,frequency
andtime_header_coord
).
- Go to the parameters section for a more detailed explanation about how to use them, and when you need to specify more than the basic 4 (
Take a look to this [ipython notebook template](docs/notebooks/Example use case.ipynb) to get started!.
If you want to dig inside the test cases and get an idea of how far is going xlseries
at the moment, check out this [ipython notebook with the 7 test cases](docs/notebooks/Test cases.ipynb).
For more details go to the official repository on github: https://github.com/abenassi/xlseries