-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathasagi_archive_image_exporter_notes.txt
159 lines (87 loc) · 3.55 KB
/
asagi_archive_image_exporter_notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
asagi_archive_image_exporter_notes.txt
Notes on archive image dump util scripts.
===== PLAN =====
This tool should be dumb.
Persist as little state as possible.
Dump rows for each image we export into a CSV file
Row 0 of the CSV file should be colum names to ensure clarity.
The two scripts should not depend on each other.
One should produce a simple .csv file, the other should read a simple .csv file.
+++Questions+++
How is data escaped?
TODO FIXME FIGURE THIS OUT SOON!
Maybe use this?:
LOAD DATA INFILE 'file_name'
INTO TABLE tbl_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY "\n"
IGNORE 1 LINES
(`doc_id`, `media_id`, `num`, `subnum`, `thread_num`, `op`, `timestamp`, `timestamp_expired`, `preview_orig`, `preview_w`, `preview_h`, `media_filename`, `media_w`, `media_h`, `media_size`, `media_hash`, `media_orig`, `spoiler`, `deleted`, `capcode`, `email`, `name`, `trip`, `title`, `comment`, `sticky`, `locked`, `poster_hash`, `poster_country`, `exif`)
;
Which translates to:
#WIP:
# What quoting is needed?
# What delimiter?
# What quotechar?
#csv.DictReader(csvfile, delimiter=',',quotechar='"', quoting = csv.QUOTE_MINIMAL)
#/WIP
How do we assign names to each dump?
Proposal 1:
db_name.board.lower_end.higher_end.zip
Proposal 2:
start_unixtime.zip
Proposal 3:
board.lower_end.higher_end.zip
Proposal 4:
# Geting the DB name takes more effort than getting the table name
step1 (DB to CSV): tablename.lower_end.higher_end.csv
step2 (zipper): Whatever the csv file was called excluding the .csv. ex. '123.csv' -> '123.zip'
Input filepaths:
base/image/1536/63/1536631035276.webm
Zip filename:
dumpname.zip
Output paths in the zip file:
dump_name.csv
<boardName>/<thumb or image>/<char 0-3>/<char 4-5>/<full image name>
a/image/012/34/01234567.png
===== Basic high-level pseudocode =====
#Given a start position and a database table:
#Grab list of things to export
start_position = 0
results = sql("SELECT * WHERE 'id' > start_position")
#Store results in text file
for result in results:
textfile.append(result)
tmp_zip_path = 'tmp/runstarttimestamp.boardname.zip'
# Add each file to the export zip
for line in textfile:
zip(in_path=line.imgpath, out_path=tmp_zip_path)
# Move zip from tempdir to finaldir to signify success
mvfile(tmp_zip_path, final_zip_path)
===== Step 1 =====
Get a CSV file of things to zip from the database
Produces a .CSV file with a header containing column names.
USAGE:
===== Setep 2 =====
Zip files listed in .CSV file all togehter into one zip file, preserving paths
Requires a .CSV file with a header on the first line containing column names.
Should be able to accept any .CSV dump with the required fields as long as it has a valid header?
USAGE:
Use . for current dir.
python step2_zip.py csv_path images_dir zip_path board_name
python step2_zip.py data/g_images.csv . data/g_images.zip g
===== Installation =====
pip install sqlalchemy
Figure out which SQL connector you need for your DB to work with SQLalchemy and install that.
ex.
pip install sqlite3
===== Misc notes =====
edward_ says:
SELECT media_orig FROM gif WHRE timestamp > threshold (for full images)
and
SELECT preview_orig FROM git WHERE timestamp > threshold (for previews)
preview_orig will look like 1397912081601s.jpg and media_orig looks like 1397912081601.gif
each is located in boards/<boardName>/<thumb or image>/<char 0-3>/<char 4-5>/<full image name>
base/image/1536/63/1536631035276.webm
SQL commands to generate CSV dump:
TODO