Improve performance for large files #13

guy-adams · 2020-02-10T09:50:56Z

We have been experiencing some challenges with large CSV files:

PipelineWise with Streaming
• File Size Row Count Columns Time
• 2.9 kB 320 2 ~1 minute
• 17.0 MB 79483 20 ~5 minutes
• 116.7 MB 5155187 5 ~34 minutes

Is this roughly consistent with other people's expereinces? We have done some extensive analysis and it seems like most of the time is distributed over many areas i.e. no simple fixes/improvements.

We identified that the biggest improvement we can make is around the iterator for S3 stream content - we found if we download the file first, as a single step, and then iterate over the local file the performance improves fairly signficantly:

PipelineWise with Download
• File Size Row Count Columns Time
• 2.9 kB 320 2 ~50 seconds
• 17.0 MB 79483 20 ~1 minute 20 seconds
• 116.7 MB 5155187 5 ~7 minutes 15 seconds

Especially for large files.

Can we create a PR for these code changes for review?

louis-pie · 2020-02-11T06:37:28Z

Hi Guy,

It makes sense that it would go faster if copied to local storage first, but some files we are dealing with are rather large.

Have you done any tests with files that are over 1TB ?

Kind regards,
Louis

guy-adams · 2020-02-11T07:41:01Z

Louis,

We haven't done any tests over 116MB because, before our change, this took 34 minutes and I would expect to increase roughtly linearly. Also, we have noticed that the memory utilisation for a file of this size is pretty large (from memory 8-9GB RAM), so increasing this by a factor of 10 would probably require a server with 96GB or 128GB or RAM to avoid swapping (which if course will increase the run time dramatically). Do you have any rough indications on your side of the time/memory to process large files and we can compare to? My thinking was to extend the s3fast sync to a) make the copy to local storage optional i.e. turn it on if you want/can, otherwise it defaults to the current behavious b) use this to allow ingestion of CSV files that are not in S3 e.g. if a user is getting files from some other remote source e.g. FTP, SCP or wants to do a bit if file manipulation themselves before they pass to the tap, they can do this locally and then call the tap addressing a local file. Thoughts?

guy-adams · 2020-02-29T17:43:08Z

@louis-pie ,

When you are talking about 1TB files - is this something you/anyone has done/tested? If so, have these been done as one single file, or split into multiple files.

We are currently trying to process a file that is about 600MB, 5m rows (but quite a few columns, maybe 40 or 50) and we are having issues getting it to complete. Our server (doing nothing but this) has 4 cores and 16GB RAM. We never see more than 1 core get used (are there situations where it would use more than 1 core) but we see the memory footprint climbing and climbing. In most situations we run out of memory, and then the oom-killer kills the main working thread. Interestingly this doesn't seem to kill the pipelinewise main process which then sits there forever until you kill it manually.

If you are talking about TB sized files then we must be doing something wrong since we can't get 600MB files to work - based on the work we are currently doing a 1TB file would require nearly 100TB RAM and take months!!! Is there anything you can share with us in terms of system specs, test files, or expected run times?

I should add that since we are using a Snowflake target, technically we are actually using the fastsync path I assume.

louis-pie · 2020-02-29T20:27:10Z

Hi @guy-adams,

No sorry, I miss spoke .. we don't have any 1TB files at present, but I suspect we will.

I was wondered if you had tested up to 1TB ... I guess you have tested to 600MB now... sorry it failed

Back to your original post, Yes we will look at your PR, but we are a small team and don't have much time for testing, so if you could please include in your PR any possible automatic testing to cover functionality that will make it a lot easier.

Kind regards,
Louis

guy-adams · 2020-03-01T23:16:54Z

@louis-pie,

We've been working on this pretty hard. We have considered with and without Downloading the file locally to the runner and also how much memory. We had 2 servers, one with enough to keep small data files in memory, but overflows into swap for larger files and the second with a huge amount of memory.

Below are our testing results:

Size Download Memory Time
17MB False Enough to avoid swapping (guess) 0h 5m
17MB True Enough to avoid swapping (guess) 0h 1m 20s
116MB False Enough to avoid swapping (guess) 0h 34m
116MB True Enough to avoid swapping (guess) 0h 7m 15s
600MB False Not enough to avoid swapping and swap 3h 35m
600MB False Enough to avoid swapping 2h 47m
600MB True Not enough to avoid swapping and swap 0h 31m 7s
600MB True Enough to avoid swapping 0h 8m 13s

From this and related results the following is clear:

If you have at least enough swap – setting download to True will dramatically improve results (by a factor of 7 ish) (3.5 hours to 0.5 hours)
If you also allocate enough main memory the results will improve (by another factor of ~4) 31 mins to 8 min
Overall having enough main memory and setting download to True improves by a factor of 26

Other than needing the local disk space, we have seen no downsides of setting "Download" to true - in all use cases it speeds things up.

Our other key observation is that the ratio between the size of data on disk and size of data in memory is of the order 70:1 i.e. a 100MB file uses ~ 7GB RAM. I think this factor is actually the main barrier in handling of large files. To your point, based on the current system, a 1TB file would require ~70TB of RAM - unless you have access to some much bigger servers than us, that's not likely to happen!!!

However what we are also thinking is about file chunking. Other than for determining column names, data types etc, we don't see any barriers to breaking a file down into many smaller chunks, processing batches of them in parallel (and therefore use all the CPU cores - only one is currently used), but critically, there is no need to need to have all the data in memory at once, we could process a chunk, write it into an S3 stage and move onto the next batch. This actually works well with Snowflake as well which is much faster at ingesting a number of smaller files than one big one.

Any thoughts we should bear in mind based on your experience?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance for large files #13

Improve performance for large files #13

guy-adams commented Feb 10, 2020

louis-pie commented Feb 11, 2020

guy-adams commented Feb 11, 2020

guy-adams commented Feb 29, 2020

louis-pie commented Feb 29, 2020

guy-adams commented Mar 1, 2020

Improve performance for large files #13

Improve performance for large files #13

Comments

guy-adams commented Feb 10, 2020

louis-pie commented Feb 11, 2020

guy-adams commented Feb 11, 2020

guy-adams commented Feb 29, 2020

louis-pie commented Feb 29, 2020

guy-adams commented Mar 1, 2020