Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unzip Transformer, DID Finder, Pandas Python Code Generators #574

Open
wants to merge 84 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
8a8f477
Make a new unzip codegen
Apr 3, 2023
9dcf75e
Add to requirements.txt
Apr 3, 2023
dab6efb
comment tests
Apr 3, 2023
bad22a1
Add transformer cap to python codegen
Apr 5, 2023
49e07b1
Fix unzip translator path
Apr 5, 2023
3029d82
fix flake8
Apr 5, 2023
f4cc992
debug log
Apr 5, 2023
fe0f0d9
test build
Apr 5, 2023
0cd02bd
fix flake8
Apr 5, 2023
95b7b4d
more logs
Apr 5, 2023
0f7d3b9
prints
Apr 5, 2023
f82c5d1
pull latest python codegen
Apr 5, 2023
6aa74e1
Merge branch 'develop' into pondd-unzip
shriram192 Apr 5, 2023
1cc579f
merge with develiop
Apr 5, 2023
9ddcb18
fix tests
Apr 5, 2023
de8e5ea
removed twice puts
shriram192 Apr 5, 2023
988273a
flake8 fix
shriram192 Apr 5, 2023
ccef1a2
transformer capabilities
Apr 5, 2023
2fb7bfe
changes to fix merge
Apr 5, 2023
e18ef67
Add requirements.txt
Apr 5, 2023
c12902b
comment lint
Apr 5, 2023
6193724
dockerfile xaod install pip
Apr 5, 2023
c4ceec6
Change pip to pip3
Apr 7, 2023
6de19cc
add gcc to dockerfile
Apr 7, 2023
6ad5c30
change py2 to py3
Apr 7, 2023
776f663
add httpx to uproot requirements
Apr 7, 2023
66db559
Make changes to transformer to uplaod multiple files
Apr 7, 2023
efa0c9a
pass codegentype in transform request
Apr 10, 2023
d9839eb
change codegen type src
Apr 10, 2023
8ea736e
add codegen type to db
Apr 10, 2023
a5b5694
Add codegen to db
Apr 10, 2023
ea39813
change columnname
Apr 10, 2023
bcf154e
remove prints
Apr 10, 2023
c791b1d
changes to accept multiple files
Apr 10, 2023
7c1110d
remove extra lines in watch
Apr 10, 2023
2400088
fix configmap
Apr 10, 2023
5abb694
pandas-tf
Apr 10, 2023
b95c8dd
add unzip image
Apr 10, 2023
cdb2e22
Add new image
Apr 10, 2023
b39476d
making new tag
Apr 10, 2023
3fc8dc2
yum removed from dockerfile
shriram192 Apr 10, 2023
af8b4b1
add runme
Apr 10, 2023
af215c8
add scripts
Apr 10, 2023
247c568
debug
Apr 10, 2023
16c7cc6
Add did finder unzip
Apr 12, 2023
f583072
Add unzip did finder functionality
Apr 12, 2023
a475b89
Merge branch 'develop' into pandas-tf
shriram192 Apr 12, 2023
ca889ef
remove codecov from unzip did finder
Apr 12, 2023
26efdfd
Add poetry lock
Apr 12, 2023
355e7d8
remove x509 up
Apr 12, 2023
6779048
Add minio to poetry
Apr 12, 2023
3a72716
add more libs
Apr 12, 2023
94b07bb
Poetry lock add
Apr 12, 2023
17837ca
add prints
Apr 12, 2023
f3507f2
debug
Apr 15, 2023
d0718ce
debug more
Apr 15, 2023
bbee91f
parsed did dump
Apr 15, 2023
8572dde
Change queue name to unzip
Apr 16, 2023
b07e631
Make bucket name req ID
Apr 16, 2023
78ec9bc
logger add
shriram192 Apr 17, 2023
0ad233b
print file
shriram192 Apr 17, 2023
b6600ea
removed folders
shriram192 Apr 17, 2023
d760b1b
list_objects change and info logs updated
shriram192 Apr 17, 2023
2825678
print formatting
shriram192 Apr 17, 2023
2bcd07c
dict yield
shriram192 Apr 17, 2023
3012c60
rucio print and did finder object pull change
shriram192 Apr 17, 2023
4cab54a
did finder changes
shriram192 Apr 17, 2023
a9890a7
object name
Apr 19, 2023
be00e69
remove logs
Apr 19, 2023
9eb7030
change to url
Apr 19, 2023
82e11bf
changes for mutiple templates in codegen
shriram192 Apr 19, 2023
64d9a69
Add codegentype as env to transformer pod
Apr 24, 2023
4bc3ba7
Strip * from output path
Apr 24, 2023
b9ea47e
strip output path
Apr 24, 2023
ff8d793
Add transformer stats
Apr 24, 2023
b79e526
change file path
Apr 24, 2023
8ed795b
folder output path
shriram192 Apr 24, 2023
3c9460e
pandas stat change
shriram192 Apr 24, 2023
fe0a23b
file name change and append change
shriram192 Apr 24, 2023
9e27847
enable testing
May 1, 2023
989ab35
code generator python test activated
shriram192 May 1, 2023
765ed4a
Merge branch 'pandas-tf' of https://github.com/ssl-hep/ServiceX into …
shriram192 May 1, 2023
c1c75f2
test fixes for PONDD
shriram192 May 1, 2023
de061a8
pytest and flake8 fix
shriram192 May 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/deploy-config.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@
"image_name": "servicex-did-finder",
"test_required": false
},
{
"dir_name": "did_finder_unzip",
"image_name": "servicex-did-finder-unzip",
"test_required": false
},
{
"dir_name": "code_generator_funcadl_uproot",
"image_name": "servicex_code_gen_func_adl_uproot",
Expand Down
1 change: 1 addition & 0 deletions code_generator_python/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ COPY boot.sh ./
COPY transformer_capabilities.json ./
COPY servicex/ ./servicex
COPY scripts/from_text_to_zip.py .
COPY transformer_capabilities.json ./
RUN chmod +x boot.sh

USER servicex
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ class PythonTranslator(CodeGenerator):
def generate_code(self, query, cache_path: str):

src = base64.b64decode(query).decode('ascii')
print("SRC", src)
hash = "no-hash"
query_file_path = os.path.join(cache_path, hash)

Expand All @@ -59,6 +60,10 @@ def generate_code(self, query, cache_path: str):
shutil.copyfile(capabilities_path, os.path.join(query_file_path,
"transformer_capabilities.json"))

with open(os.path.join(query_file_path, 'generated_transformer.py'), 'w') as python_file:
python_file.write(src)

os.system("ls -lht " + query_file_path)
os.system(f"cat {query_file_path}/generated_transformer.py")

return GeneratedFileResult(hash, query_file_path)
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
def run_query(input_filenames=None, tree_name=None):
from stream_unzip import stream_unzip
import httpx

def zipped_chunks(input_path):
# Iterable that yields the bytes of a zip file
with httpx.stream('GET', input_path) as r:
yield from r.iter_bytes(chunk_size=65536)

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks(input_filenames[0])):
# unzipped_chunks must be iterated to completion or UnfinishedIterationError will be raised
for chunk in unzipped_chunks:
yield chunk, file_name
72 changes: 45 additions & 27 deletions code_generator_python/servicex/templates/transform_single_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import uproot
instance = os.environ.get('INSTANCE_NAME', 'Unknown')
default_tree_name = "servicex"
codegen_type = os.environ.get('CODEGEN_TYPE', 'default')


def transform_single_file(file_path: str, output_path: Path, output_format: str):
Expand All @@ -20,41 +21,58 @@ def transform_single_file(file_path: str, output_path: Path, output_format: str)
try:
stime = time.time()

awkward_array = generated_transformer.run_query(file_path)
total_events = ak.num(awkward_array, axis=0)
if codegen_type == 'default':
awkward_array = generated_transformer.run_query(file_path)
total_events = ak.num(awkward_array, axis=0)

ttime = time.time()
ttime = time.time()

if output_format == 'root-file':
etime = time.time()
with uproot.recreate(output_path) as writer:
writer[default_tree_name] = {field: awkward_array[field] for field in
awkward_array.fields} if awkward_array.fields \
else awkward_array
wtime = time.time()
if output_format == 'root-file':
etime = time.time()
with uproot.recreate(output_path) as writer:
writer[default_tree_name] = {field: awkward_array[field] for field in
awkward_array.fields} if awkward_array.fields \
else awkward_array
wtime = time.time()

else:
explode_records = bool(awkward_array.fields)
try:
arrow = ak.to_arrow_table(awkward_array, explode_records=explode_records)
except TypeError:
arrow = ak.to_arrow_table(ak.repartition(awkward_array, None),
explode_records=explode_records)
else:
explode_records = bool(awkward_array.fields)
try:
arrow = ak.to_arrow_table(awkward_array, explode_records=explode_records)
except TypeError:
arrow = ak.to_arrow_table(ak.repartition(awkward_array, None),
explode_records=explode_records)

etime = time.time()
etime = time.time()

writer = pq.ParquetWriter(output_path, arrow.schema)
writer.write_table(table=arrow)
writer.close()
writer = pq.ParquetWriter(output_path, arrow.schema)
writer.write_table(table=arrow)
writer.close()

wtime = time.time()
wtime = time.time()

output_size = os.stat(output_path).st_size
print(f'Detailed transformer times. query_time:{round(ttime - stime, 3)} '
f'serialization: {round(etime - ttime, 3)} '
f'writing: {round(wtime - etime, 3)}')
output_size = os.stat(output_path).st_size
print(f'Detailed transformer times. query_time:{round(ttime - stime, 3)} '
f'serialization: {round(etime - ttime, 3)} '
f'writing: {round(wtime - etime, 3)}')

print(f"Transform stats: Total Events: {total_events}, resulting file size {output_size}")
print(f"Transform stats: Total Events: {total_events}, \
resulting file size {output_size}")
elif codegen_type == 'unzip':
folder_output_path = os.path.dirname(output_path)
for bytes, file_name in generated_transformer.run_query(file_path):
file_output_path = os.path.join(folder_output_path, file_name.decode('utf-8'))
print('File: ', file_output_path)
with open(file_output_path, 'ab') as f:
f.write(bytes)
total_events = 0
output_size = os.stat(folder_output_path).st_size
elif codegen_type == 'pandas':
folder_output_path = os.path.dirname(output_path)
pd_data = generated_transformer.run_query(file_path)
pd_data.to_parquet(output_path)
total_events = 0
output_size = os.stat(folder_output_path).st_size
except Exception as error:
mesg = f"Failed to transform input file {file_path}: {error}"
print(mesg)
Expand Down
3 changes: 2 additions & 1 deletion code_generator_python/tests/test_python_translator.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,13 @@
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
#

import base64
import os
import tempfile

from servicex.python_code_generator.python_translator import \
PythonTranslator
PythonTranslator


def test_generate_code():
Expand Down
2 changes: 1 addition & 1 deletion code_generator_python/transformer_capabilities.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@
"limitations": "Would be good to note what isn't implemented",
"file-formats": ["parquet", "root"],
"stats-parser": "UprootStats",
"language": "python",
"language": "python3",
"command": "/generated/transform_single_file.py"
}
1 change: 1 addition & 0 deletions did_finder_rucio/scripts/did_finder.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ async def callback(did_name, info):
request_id=info['request-id']
)
for file in lookup_request.lookup_files():
logger.info(f"File: {file}")
yield file

start_did_finder('rucio',
Expand Down
24 changes: 24 additions & 0 deletions did_finder_unzip/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM python:3

# Create app directory
WORKDIR /usr/src/app

# for CA certificates
USER root
RUN mkdir -p /etc/grid-security/certificates /etc/grid-security/vomsdir

ENV POETRY_VERSION=1.2.2
RUN python3 -m pip install --upgrade pip

RUN pip install poetry==$POETRY_VERSION
COPY pyproject.toml pyproject.toml
COPY poetry.lock poetry.lock

RUN poetry config virtualenvs.create false && \
poetry install --no-root --no-interaction --no-ansi

COPY . .

ENV X509_USER_PROXY /tmp/grid-security/x509up
ENV X509_CERT_DIR /etc/grid-security/certificates

Loading