-
Notifications
You must be signed in to change notification settings - Fork 21
Pune 2017 18 Budget data processing, flat tables and hierarchy
Working on 2017-18 Budget Data
Author: Nikhil VJ
Date : during first half of August 2017 and first week and last week of September 2017
Screenshot of a table from Budget excel published by PMC:
To bring Pune Municipal Corporation's 2017-18 Budget book data to a stage where it can be read by a program and be open for visualizations, data analysis etc.
The Pune budget document has a multi-hierarchical nature. To get it to a stage where it can be read by a program and be open for visualizations, data analysis etc
, it needs two things : a flattened table to load into a db, and a separate parent-child codes sheet to express the complex hierarchies of the budget book that get lost when we flatten the tables.
- A_Revenue_Income
- A_Revenue_Expenditure
- A_Revenue_List
- A_Capital_Expenditure
- A_Land_Acquisition
- A_Special_Fund
- A_DPDC_Machinery_Tree
- A_Capital_List
- C_Revenue_Income
- C_Revenue_Expenditure
- C_Revenue_List
- C_Capital_Expenditure
- C_Capital_List
- JNNURM
- SPV
- SMRY1718
- CitizensNonPlan1718
- CitizensPlan1718
The A_Revenue_Expenditures sheet has many entries that are called transfers to other sections. It seems like the other expenditure sheets 'get their funding' from Revenue_Expenditures.
To do: trace the budget items showing transfers to other sections, verify figures.
Each sheet is to be cleaned up to make it program-readable.
- Unfreeze top rows/columns
- Put a cleaner header row at top, from the corresponding sheet in this file: 2017-18 budget simplified headers.ods.
- delete the original headers with the merged cells etc.
- Delete the lines with blank code cell, or which are just totalling etc.
Note from Nikhil : Not to be a spoilsport but this is easier to do in ms-excel than libre-calc, because when you set the filter to show just blanks and then select all the rows displayed and delete them, in excel it doesn't also delete all the other invisible (filtered) rows in between of the ones you have selected.
(eg: if row 10 and row 100 were blanks, if I used the filter to show only them and then shift- or drag- selected them both and deleted the rows, in libre-calc it also deletes rows 11 thru 99 whereas ms-excel leaves them untouched as desired.)
- There are lines with code like "CE11A" that stand for total of other lines.. we can delete these lines. If not sure, then keep them. You can find these by searching for "बेरीज" (but check before deleting!)
- There may be multiple lines with code like "#+" and they probably have expenditures listed only for previous years or are for incomplete works, and you do want to include them in the db as miscellaneous entries. Assign serial numbered codes to them in this format: [prefix][MSC][001].
-
- Example : Capital Expenditures : CEMSC001 (CE code, MSC subheader, 001 serial)
(you'll get the prefix code from the other entries in that table.)
- Example : Capital Expenditures : CEMSC001 (CE code, MSC subheader, 001 serial)
We would now have the budget data in a flat tabular structure. What's more, headers are standardised across tables, so for example 2017-18 budgeted expenditure has the same column name in all sheets.
Screenshot of same table, cleaned up:
File prepared :: 2017-18 SC flat tables-working.ods
Note: This is not final file; scroll to bottom for downloading final work.
The Pune budget document has a multi-hierarchical nature. See this screenshot to see an example of the hierarchy of one particular line item.With flattening of the tables in the budget doc, some vital information gets lost. See this screenshot for example:
Info in selected cells is lost when the table is flattened; and without this info the line items by themselves won't make sense.
Plus, in the expanded lists, budget codes no longer implicitly reflect the right hierarchy, possibly due to re-assigning of departments etc. Example: CE20C106 comes under CE20C551#
So there is a need to express this hierarchy and include these name/titles so they can be retrieved when we visualize the budget data via the flat tables.
We start with a parent-child codes table, also known as self-referencing table. Below is a sample table:
code | parentcode | name |
---|---|---|
grandparent | root | Granny |
parent | grandma | Mom |
child | parent | Kid |
pet | child | Kitty |
This can be interpreted by a program as:
grandparent
|-parent
|-child
|-pet
The sections are still kept separate, as at this point trying to really establish parent-child lineage between the sheets may get too complicated.
File prepared : 2017-18 budget parent-child relationship.ods
This file captures the budget book's hierarchy through self-referencing / parent-child tables. Here's a screenshot:
budget sheet | hierarchy sheet |
---|---|
A_Revenue_Income | RI |
A_Revenue_Expenditure | RE |
A_Revenue_List | RE |
A_Capital_Expenditure | CE |
A_Land_Acquisition | FE |
A_Special_Fund | FE |
A_DPDC_Machinery_Tree | FE |
A_Capital_List | CE |
C_Revenue_Income | XI |
C_Revenue_Expenditure | XE |
C_Revenue_List | XE |
C_Capital_Expenditure | ZE |
C_Capital_List | ZE |
JNNURM | FE |
SPV | FE |
SMRY1718 | ADM |
CitizensNonPlan1718 | PB |
CitizensPlan1718 | PB |
As seen above, data from multiple sections comes in common code sheets. Example: both A_Revenue_Expenditure
and A_Revenue_List
have their hierarchy expressed in RE
. That is because A_Revenue_List
has items whose parents are in A_Revenue_Expenditure
.
- RE = Revenue (running) Expenditure
- CE = Capital Expenditure
- FE = Special Expenditure
- XE = Water Budget (Part C) Revenue Expenditure
- ZE = Water Budget (Part C) Capital Expenditure
- ADM = Ward Office Expenditure
- PB = Participatory Budget Expenditure
- RI = Revenue (running) Income
- XI = Water Budget Revenue (running) Income
Hierarchical JSON is a de facto standard for building visualizations for hierarchical data. In this project we have expressed the hierarchy in a self-referencing tabular form. It's hard to find open source visualizations that use that, but there are many json users. See this page for examples.
That said, it is much tougher for a person to CREATE this kind of data in the nested JSON form, and much easier to do in tabular form. Hence we start with the tabular form. And once made, we can convert it to JSON via a program.
Convert tabular:
to JSON:
Nikhil has created a script, hosted on a webpage, to do exactly this:
Self-referencing table to Hierarchical JSON converter
So, from the budget hierarchy spreadsheet that we created, each sheet was output as CSV, and then fed into the converter to generate corresponding JSON. The CSVs and JSONs are stored in the repo in the budget/hierarchy2017-18 folder.
Involves both part 1 and 2 which were done in tandem.
Some line items with same code were found across Capital Expenditures and Capital List tables.
repetitions | code
2 | CE20C106
2 | CE20C466
2 | CE20C533
2 | CE20C536
2 | CE22A109
2 | CE22A133
2 | CE22A362
They all are zero in the main list, but are present in extended lists as sub-items under other items, with a value.
How to resolve : suffix a _0
for the main list entries to differentiate them.
Confirmed that there are no such repeaters in the other sheets.
Why this is needed : The budget code is the primary key for this data. So, it needs to be non-NULL and unique.
The works listed in CitizensPlan1718 and CitizensNonPlan1718 sheets are Participatory Budget works, but they are also a mixup of expanded lists of various parent items listed in other sheets. The parent items are as follows:
RE20A302/C2-
RE20C302/C1-
RE22C302/C4-
RE14B302/C5-
XE12B302/C6-
XE23A302/C7-
CE20A1196/C2-
CE20C406/C1-
CE22A547/C4-
FE14A121/C5-
ZE16A117/C6-
ZE16B130/C7-
Example: 'RE20A302/C2-' corresponds to line item 'RE20A302#' in Revenue Expenses sheet. See screenshot.
In the parent-child relationship, independently made a 'PB' sheet, defined a header 'PB', defined these parent items with common parent 'PB', and listed all the line items in both 'CitizensPlan1718' and 'CitizensNonPlan1718' sheets under them.
-
Extra: Shell script used for getting parentcodes from a list of codes : take substring up till '-'.
cat pblist1 | cut -d'-' -f1 > pblist2
-
Amounts were in Rs. lakhs. They had to be multiplied by 1e5 (100,000) as in the rest of the budget data the values are is Rs. only.
The sheet SMRY1718
lists works at Admin ward level. There are 15 admin wards in the city. Till the previous year there used to be works listed here in plan and nonplan sheets just like how the Participatory Budget is done. But this year we see only a table of totals per admin ward. Usually each of the categories (roads, water supply etc) are an expansion of a budget item in one of the main expenditure sheets. But this year, those items have 0
budget allocation. Listing them below
RE14B301/W5-#
RE20A301/W2-#
RE20C301/W1-#
RE22C301/W4-#
XE12B301/W6-#
XE23A301/W7-#
So, with just 15 line items in the admin ward sheet this time, and no specific budget code, codes ADM01
through ADM15
were created for the flat table. In the hierarchy doc, these were put in a separate sheet named ADM
.
- And amounts were changed to be in Rs. instead of in Rs. lakhs.
.. because there's a difference between no data and rs.0 budgeted. The original budget doc has a number formatting rule to keep invisible all cells having 0 value. Screenshot:
Find + Replace command works well, with entire cells
option selected. Select all sheets
to replace all the zero's at a go.
- XE12B302/C6-54 : changed
2"
to2ft
to avoid having double-quotes char in the data. - At a few places, undid line wrapping in the work description (ie, made it from multi line to single line)
After the various tables / sections / lists data has been flattened, we now have to combine them.
- Listed out the columns from all the tables.
- Optimized the column titles, translating to english, and keeping titles same for common fields (like
2017-18 SC budgeted
) - Reflected optimized titles back to the flat tables doc.
- Calculated which columns repeat in how many lists.
- Created a combined layout for the expenditure tables that includes all columns used by various lists.
- Added a "list" and a "list sr" column to enable traceback.
- Working doc for planning this: 2017-18 combined layout planning.ods
- Final combined expenditures layout (total 28 columns):
. | Column | Features in these many lists |
---|---|---|
A | list | -- |
B | list sr | -- |
C | code | 16 |
D | work | 16 |
E | 2017-18 SC budgeted | 16 |
F | ward | 10 |
G | BLP | 4 |
H | 2013-14 actual exp | 3 |
I | 2014-15 actual exp | 3 |
J | 2015-16 actual exp | 4 |
K | actual expenditure till 31 march 2015 | 1 |
L | actual expenditure till 31 march 2016 | 2 |
M | 2016-17 budgeted | 7 |
N | 2016-17 budgeted: from debt | 2 |
O | 2017-18: from debt | 3 |
P | 2017-18: nonplan and other | 3 |
Q | allotted till 31 march 2017: from debt | 3 |
R | allotted till 31 march 2017: from nonplan | 3 |
S | allotted till 31 march 2017: total | 3 |
T | project budget | 7 |
U | revised allocation | 1 |
V | no. of wards | 1 |
W | road | 3 |
X | electrical | 3 |
Y | building | 3 |
Z | SRA | 3 |
AA | water supply | 3 |
AB | drainage | 3 |
- Saved the flat tables as a new file: 2017-18 SC flat tables-reordering.ods
- Shifted the columns in the various sheets so they all reflect this new layout. Added the "list" and "list sr" columns to all and populated them (second one is a serial number for that sheet).
- Copied rows from all the expenditure sheets to a new sheet "combined exp".
- Exported the combined sheet into a CSV with delimiting char set as semicolon (
;
) (choosing this instead of comma (,
) because many work description cells have comma's but there is no semicolon in the data).
File prepared : 2017-18-SC-budget-exp-combined.csv
While there are 7 separate expenditure sections found in the hierarchy doc (RE, CE, FE, XE, ZE, ADM, PB), they are all not getting funds independently. Here let us trace the budget items that actually stand for re-allocation of outlay to another section.
Main focus: RE (A_Revenue_Expenditure)
code | work | 2017-18 SC budgeted |
---|---|---|
RE61A101 | 59. अंदाजपत्रक 'क' कडे वर्ग | -25,33,67,047 |
... yet to be completed.
- Data: 2017-18-SC-budget-exp-combined.csv
Hierarchy:
- RE.json
- CE.json
- FE.json
- XE.json
- ZE.json
- ADM.json
- PB.json
These can be used for budget data viz etc.
Next : Let's Visualize! Budget data visualizer