I have about 30,000 very tiny JSON files that I am attempting to load into a Spark dataframe (from a mounted S3 bucket). It is reported here and here that there may be performance issues and is described as the
Hadoop Small Files Problem. Unlike what has been previously reported, I am not recursing into directories (as all my JSON files are in one sub-folder). My code to load the JSON files look like the following.
val df = spark .read .option("multiline", "true") .json("/mnt/mybucket/myfolder/*.json") .cache
So far, my job seems "stuck". I see 2 stages.
- Job 0, Stage 0: Listing leaf files and directories
- Job 1, Stage 1: val df = spark .read .option("multiline", "...
Job 0, Stage 0 is quite fast, less than 1 minute.
Job 1, Stage 1, however, takes forever to even show up (lost track of time, but between the two, we are talking 20+ minutes), and when it does show up on the jobs UI, it seems to be "stuck" (I am still waiting on any progress to be reported after 15+ minutes). Interestingly,
Job 0, Stage 0 has 200 tasks (I see 7 executors being used), and
Job 1, Stage 1 has only 1 task (seems like only 1 node/executor is being used! what a waste!).
Is there any way to make this seemingly simple step of loading 30,000 files faster or more performant?
Something that I thought about was to simply "merge" these files into large ones; for example, merge 1,000 JSON files into 30 bigger ones (using NDJSON). However, I am skeptical of this approach since merging the files (let's say using Python) might itself take a long time (something like the native linux
ls command in this directory takes an awful long time to return); also, this approach might defeat the purpose of cluster computing end-to-end (not very elegant).
Merging JSON files into newline delimited, much larger (aim for one or at most 10 files, not 30) files would be the only option here.
Python opening 30K files isn't going to be any slower than what you're already doing, it just won't be distributed.
multiline=true was particularly only added for the cases where you already have a really large JSON file and it's one top level array or object that's being stored. Before that option existed, "JSONLines" is the only format Spark could read.
The most consistent solution here would be to fix the ingestion pipeline that's writing all these files such that you can accumulate records ahead of time, then dump larger batches. Or just use Kafka rather than reading data from S3 (or any similar filesystem)