-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsec_notes.txt
350 lines (257 loc) · 11.5 KB
/
sec_notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
21/55 companies follow the basic format
Across those 21 companies, there are 83 rows - some of these are maybe not helpful categories, but they are in there
Looked through 10 of them and it all checks out
23/55 companies have a table that can be detected
11/55 companies do not have a table that can be detecteed, manual search necessary
fixed table detection step; new numbers:
29/55 companies follow the basic format
24/55 companies have a table w a problematic row
2/55 companies do not have a table that can be detected
of the 24 files with a table and a problematic row, i've gone through 2 of them:
there's no SEC fiel for this one, doesn't match the text: 0001104659-04-007210.txt
and the table for this one should be the table after the '19 years' statement: https://www.sec.gov/Archives/edgar/data/1063197/000106319704000009/htm10ksb.htm
- '0001063197-04-000009.txt'
0001193125-04-051895.txt - fixed identification of table in this file; was missed because we chose the wrong table
----- check in with suzie 5/28/2019 -----
- tables that need the millions v thousands check:
- 0001104659-04-007271.txt
- 0001193125-04-030895.txt
- 0001193125-04-030895.txt
- 0001193125-04-030895.txt
- 0001104659-04-007464.txt
- 0000950124-04-000754.txt
- 0001193125-04-163259.txt
- 0001193125-04-029352.txt
- 0001193125-04-029352.txt
- 0000047288-04-000009.txt
- 0001193125-04-195535.txt
- 0001193125-04-038916.txt
- 0001104659-04-008879.txt
- 0001104659-04-008933.txt
- rows that needed to be checked (and finisehd)
- torvec : 0001063197
- 2, 3 done
- https://www.sec.gov/Archives/edgar/data/1070772/000089256903002251/a93289e10vk.htm -- is this the type of table you'd like me to scrape?
- 5 done
32/55 companies follow the basic format
20/55 companies have a table with a problematic row
2/55 companies do not have a table that can be detected
1/55 comapnies definitely do not have the table in the file
----
40/55 companies scraped
12/55 comapnies have a table with a problematic row
2/55 companies do not have a table that can be detected
1/55 comapnies definitely do not have the table in the file
of those 12:
- https://www.sec.gov/Archives/edgar/data/75362/000110465904007210/ba_10k.htm: can't find the table in this one
- PACCAR, ignore
- 0000754009-04-000086.txt
- transposed table, probably not table tags....
- https://www.sec.gov/Archives/edgar/data/1053178/000110465904032708/a04-12212_110k.htm
- 0001104659-04-032708.txt
- doesn't include 3-5 years, throws the parser off
- put after 4 in after 5 years, leave 3-5 empty
- 0001104659-04-030276.txt
- this one really did have a table but no row
- https://www.sec.gov/Archives/edgar/data/855581/000110465904030276/a04-11300_110k.htm
- good!
- and torvec is in there
main improvements:
- added header detection for each of these tables
- some really big tables get the same score as the real table - improved the most likely table to address that
- added transportation to the fields to sense
- added logic for papers that have 8 columns - for each year
- added logic for papers that have a different position for total column
- manually looked through around 20 filings
----
43/55 companies scraped
8/55 companies have a table with a problematic row
2/55 companies do not have a table that can be detected
2/55 comapnies have tables, but no rows
todos
[X] 10 files (8 table + problematic row) to make list of issues
[ ] detecting tables with no table tags
[ ] take care of transpose
[ ] add in logic for after 4 years
[ ] find out if it says 'millions' in the nearby text or 'thousands'
[ ] run on 2005 directory
--- check in w suzie ---
[X] 0001063197 - torvec, no file
[X] 0000075362 - paccar, looked at with suzie - not an issue
[X] 0000855581 - correctly classified - table exists, no row
[X] 0000754009 - transposed table, with dashes
- waiting for suzie to look for the corresponding pdf file
- email with suzie said there is no table in the text file
[X] 0001053178 - after four years error
[X] 0001085653 - wrong table
[X] 0000832428 - wrong table
[ ] 0001099668 -
- ask suzie what bins to put 1-5 years in? split it in two?
- also getting the wrong table..
[ ] 0000733269 - table with dashes
[ ] 0000706015 - table is just in text
emailed suzie for feedback on the rows
notes on fixes:
0001106935 - actually just got the wrong table - fixed now
0000740260 - no row
0001053178 - fixed after four years error
0001085653 - wrong table - added a keyword to the most likely table filter (long term in addition to long-term)
0000832428 - wrong table
- https://www.sec.gov/Archives/edgar/data/832428/000119312504038916/d10k.htm
- another one where web doesn't match text - no table inside of document
left to do:
- finish text table parsing function
- see if i can get that commercial/agricultural table in 0001099668
- email suzie for text file of the ebay filing
- run on all 2005 files
-- next iteration --
0001106935 - actually just got the wrong table - fixed now
- had a table with 7 rows instead of 5, 6, or 8
0001099668 - commercial agricultural table
- count it as a lost cause; we would end up getting a lot of other garbage if we tried to include this one
0000832428
- this one doesn't have the same text table that the pdf filing does
- https://www.sec.gov/Archives/edgar/data/832428/000119312504038916/d10k.htm
-----------------------------
-- next iteration --
0000855581
0000832428
0001096352
0001099668
0001065088
0000754009
tasks left to handle:
[ ] 0000834162 -- transposed
[ ] 0001065088 -- transpose
[ ] 0001096352 -- new file
[ ] different denominations
[ ] run on all 2005 files
- 0001193125-04-038916.txt, 0000832428
- txt file doesn't have contractual table
- 0001096352 -- fixed transpose detection, was finding the wrong tables
- 0001099668 -- commercial agriculture table (there is no way)
ok finished all the files in the 2004 directory
takes care of transposes
accountign for when the total column doesn't exist
* still need to multiply by different denominators
- default will be by 1
- otehrwise look for thousands/millions/billions in the surrounding text
- if that info is there, multiply all values by that number
-------------------
73/18 files in the 2005 folder;
[ ] 1017172: wrong table; table is much smaller than the other ones
- 1016169: got the table; there really is no row
- 895347: no table
- 1166041: got the table; there is no row
[ ] 871763: table exists, need to investigate
- 879555: got the table; there is no row
[ ] 1074433: can't find the table
[ ] 217346: table exists, need to investigate
- 753281: table exists, there is no row
[ ] 1030839: can't locate the file on secexplorer
[ ] 1287750: maybe the wrong table? but likely just no row
[ ] 357020: table with 4 columns, need to investigate
[ ] 1099291: extra columns
- 2491: table, no rows
12 have problems to address
- 2 text-based tables, one with extra columns
[X] 1017172: wrong table; table is much smaller than the other ones
- had to fix most-likely-tables function
- transposing tables messes with rows that odn't have the same number of elements
[X] 871763: table exists, need to investigate
- had to take out cast to int, changed it to float
[X] 1287750: table exists in the pdf but not in the text file
- table exists (and we find the right table) but no row
- unless 'credit facility payable' is a row we want to scrape
[X] 357020: table exists in the pdf but not in the text file
- fixed most likely table function
[X] 1099291: extra columns
- fixed parse table function to handle extra columns
[ ] 0001074433 -- no relevant tables, ask suzie about it
[ ] 1030839: can't locate the file on secexplorer
[ ] 217346:
- the whole thing is a table (lost cause)
[ ] 1012127: text table
[X] 96988: text table
[ ] 1109279: text based table
[ ] 764794: text table with only 3 columns
27 new files
3 are parsed okay off the bat
11 where table is detected but row is not -- of these, 8 fall into cases we've discussed before and 3 are text tables that need to be parsed correctly
14 where no table is detected
20021102 -- no clear table
0000054291 - no clear table
0000077543 -- no clear table
- https://www.sec.gov/Archives/edgar/data/77543/000007754303000006/form10k_2002.htm
[ ] 0000040570 - text table that needs to be parsed
[X] 0000015486 - text table that needs to be parsed
- https://www.sec.gov/Archives/edgar/data/15486/000001548603000002/tenk2002.txt
- occurs thrice
- 0000037914 -- text table, but i don't think we can get this one
- https://www.sec.gov/Archives/edgar/data/37914/000003791403000002/p10k2002.txt
- 0000009389 -- no table in the html file or text
file
- https://www.sec.gov/Archives/edgar/data/9389/000000938903000044/f10k_2002.htm
- 0000051303 -- maybe has a transposed text table, but no relevant rows
- https://www.sec.gov/Archives/edgar/data/51303/000005130303000014/form10kfinal03.htm
- 0000022551 -- only a table for long-term debt, which we're not interested in
- https://www.sec.gov/Archives/edgar/data/22551/000002255103000001/doc1.txt
------------------
[X] 96988: text table
[X] 0000015486 - text table that needs to be parsed
todo list:
[ ] keep track of progress on files
[X] run on the others directory
[X] add in file report function
------
meeting
[X] 1000s keyword for multiplier thing and 'million'
[X] progress bar
[X] in between total table and total text format -- table tag
[ ] add keyword changing section to row_utils, make sure there's an option for words to ignore (capital leases and operating leases)
0000003333-04-000014.txt -- can't get the millions, now get the rows that relate to the construction etc
0000004904-04-000055.txt -- gets teh right row, but not the full heading (that might just be hard), now get sthe millions
[X] took care of all check files
----
[X] add keyword changing section to row_utils, make sure there's an option for words to ignore (capital leases and operating leases)
[X] fixed 'table.text' error
---
0000730255-09-000017.txt - wrong column name
0000939057-09-000370.txt - wrong column name
0001047469-03-010922.txt - wrong values
0001047469-03-010922.txt - wrong vlaues
0001437749-09-000266.txt - wrong values
0001437749-09-001934.txt - maybe a row wasn't scraped that should be
0001047469-03-011319.txt - values
0001437749-09-001701.txt - not the right table
0001437749-09-002044.txt - not the right table
0001445260-09-000018.txt - not the right table
0001445866-09-000016.txt - not the right table
0001457143-09-000005.txt - row should be scraped
anything where the table scraped is not the right table hasn't been addressed
0000730255-09-000017.txt
0000849502-09-000004.txt
0001002105-03-000070.txt -- value
0001047469-03-003730.txt -- value
0001047469-03-010922.txt -- value
0001047469-03-010922.txt -- value
done:
0000730255-09-000017.txt -- fixed the column for this file
0000939057-09-000370.txt -- fixed column name & category name
0001437749-09-001934.txt -- remove spacing in the category names
0001437749-09-000266.txt -- <1 instead of >5 .. remove spaces and \n in the category variable
-- had to redo how the text fields are combined in each table
0001047469-03-010922.txt -- have to fix column names (off by one)
0001047469-03-011319.txt - columns off by 1
0001457143-09-000005.txt - wrong table
-- updated the keywords to catch this table
0001437749-09-002044.txt - gets the right table, no appropriate rows
0001437749-09-001701.txt -- wrong table
0000939057-09-000370.txt -- fixed
0001144204-09-037210.txt
todo:
can't find the contractual obligations tablef or this one:
0001445260-09-000018.txt - not the right table
0001445866-09-000016.txt - not the right table
new errors:
0000939057-09-000247.txt - wrong table now