asagi_archive_image_exporter_notes.txt

asagi_archive_image_exporter_notes.txt
Notes on archive image dump util scripts.


===== PLAN =====

This tool should be dumb.
Persist as little state as possible.
Dump rows for each image we export into a CSV file
Row 0 of the CSV file should be colum names to ensure clarity.


The two scripts should not depend on each other.
One should produce a simple .csv file, the other should read a simple .csv file.


+++Questions+++
How is data escaped?
    TODO FIXME FIGURE THIS OUT SOON!
Maybe use this?:
LOAD DATA INFILE 'file_name'
    INTO TABLE tbl_name

    FIELDS TERMINATED BY ',' ENCLOSED BY '"'
  LINES TERMINATED BY "\n"
    IGNORE 1 LINES
    (`doc_id`, `media_id`, `num`, `subnum`, `thread_num`, `op`, `timestamp`, `timestamp_expired`, `preview_orig`, `preview_w`, `preview_h`, `media_filename`, `media_w`, `media_h`, `media_size`, `media_hash`, `media_orig`, `spoiler`, `deleted`, `capcode`, `email`, `name`, `trip`, `title`, `comment`, `sticky`, `locked`, `poster_hash`, `poster_country`, `exif`)
    ;

Which translates to:

#WIP:
# What quoting is needed?
# What delimiter?
# What quotechar?
#csv.DictReader(csvfile, delimiter=',',quotechar='"', quoting = csv.QUOTE_MINIMAL)
#/WIP


How do we assign names to each dump?
    Proposal 1:
    db_name.board.lower_end.higher_end.zip

    Proposal 2:
    start_unixtime.zip

    Proposal 3:
    board.lower_end.higher_end.zip

    Proposal 4:
    # Geting the DB name takes more effort than getting the table name
    step1 (DB to CSV): tablename.lower_end.higher_end.csv
    step2 (zipper): Whatever the csv file was called excluding the .csv. ex. '123.csv' -> '123.zip'


Input filepaths:
    base/image/1536/63/1536631035276.webm


Zip filename:
    dumpname.zip


Output paths in the zip file:
    dump_name.csv
    <boardName>/<thumb or image>/<char 0-3>/<char 4-5>/<full image name>
    a/image/012/34/01234567.png


===== Basic high-level pseudocode =====

#Given a start position and a database table:

#Grab list of things to export
start_position = 0
results = sql("SELECT * WHERE 'id' > start_position")

#Store results in text file
for result in results:
    textfile.append(result)

tmp_zip_path = 'tmp/runstarttimestamp.boardname.zip'
# Add each file to the export zip
for line in textfile:
    zip(in_path=line.imgpath, out_path=tmp_zip_path)

# Move zip from tempdir to finaldir to signify success
mvfile(tmp_zip_path, final_zip_path)


===== Step 1 =====
Get a CSV file of things to zip from the database
Produces a .CSV file with a header containing column names.

USAGE:


===== Setep 2 =====
Zip files listed in .CSV file all togehter into one zip file, preserving paths
Requires a .CSV file with a header on the first line containing column names.
Should be able to accept any .CSV dump with the required fields as long as it has a valid header?


USAGE:
Use . for current dir.
python step2_zip.py csv_path images_dir zip_path board_name
python step2_zip.py data/g_images.csv . data/g_images.zip g


===== Installation =====
pip install sqlalchemy

Figure out which SQL connector you need for your DB to work with SQLalchemy and install that.
ex.
pip install sqlite3


===== Misc notes =====


edward_ says:
SELECT media_orig FROM gif WHRE timestamp > threshold (for full images)
and 
SELECT preview_orig FROM git WHERE timestamp > threshold (for previews)

preview_orig will look like 1397912081601s.jpg and media_orig looks like 1397912081601.gif
each is located in boards/<boardName>/<thumb or image>/<char 0-3>/<char 4-5>/<full image name>
base/image/1536/63/1536631035276.webm


SQL commands to generate CSV dump:
TODO