-
Notifications
You must be signed in to change notification settings - Fork 96
Cannot supply samplingRatio when creating a DataFrame #111
Comments
Hi @colmmcdonnell , You are right, there is a problem with this only when we go through DataFrameReader API. There is a PR #112 solving it. You can use fromMongodb method instead of DataframeReader. See our first steps. Anyway, you are using an old version 0.8.7, in the next weeks we are going to release 0.11.2 version. Thanks for your feedback! |
Hi @pmadrigal Thanks for your response. A few replies:
Rgds |
@colmmcdonnell
For using this library with Java, some Scala types are needed and as you can see, code is less clean. We highly recommend Scala!
Hope help you! |
Hi @pmadrigal Thanks for your response, I can now invoke I am seeing the same unexpected long duration whether I use Have I misunderstood the meaning/purpose of sampleRatio? Is there any way I can load a DataFrame quickly without supplying a schema? I know that "quickly" is subjective so to give a concrete example; in my test I am reading from a collection with 250 small documents (average document size is 71 bytes), I would hope to be able to create a DataFrame for this collection (whether using the DataFrame API or Thanks again for your help, much appreciated. Rgds Test case showing that the
|
Hi @colmmcdonnell , As you know, by supplying schema we avoid to infer it and the time that it takes. To create a DataFrame, a schema is needed. To infer it, we need to get data from Mongo, create an RDD, iterate over each record getting the partial schema of each one, and choosing a final schema valid for all the records. All this process is taking the time that you see in your example. On the other side, SamplingRatio is a config property that allow us to scan only a part of the collection when we infer it schema. If collection is small, like in your case, it won't be much difference reducing ratio, but with a big collection the time will be considerably reduced. Note that with SamplingRatio set to 1.0, you ensure that the schema is correct, because we have scanned all the collection. Hope I have clarified the question. Thanks for your feedback! |
Hi @pmadrigal Thanks for your help, I think I have enough detail to take it from here. Rgds |
Thanks very much for the spark-mongodb connector, much appreciated.
I'm having an issue when creating a DataFrame from a MongoDB collection.
The elapsed time for creating the DataFrame is 2 -3 seconds, in this scenario:
DefaultSource
to infer the schemaWalking the code I can see that:
org.apache.spark.sql.execution.datasources.ResolvedDataSource
, which - incase RelationProvider
- creates a CaseInsensitiveMap from the given options and then invokescom.stratio.datasource.mongodb.DefaultSource
DefaultSource
creates anew MongodbRelation
but provides no schema which (seeMongodbRelation:58
) results in the use of a lazy schema like so:'config.get[Any](MongodbConfig.SamplingRatio).fold(MongodbConfig.DefaultSamplingRatio)'
which uses or overrides the caller supplied samplingRatio expecting the sampling ratio to be provided under the key "schema_samplingRatio" but because theResolvedDataSource
has already case insensitised the caller supplied properties our sampling ratio is actually under the key "schema_samplingratio" so the provided MongodbSchema always uses the default sample ratio: 1.0Am I correct in the above diagnosis? If so, what can be done about it? If not, how can I realiably provide my own sampling ratio?
Any help gratefully accepted.
Version details etc:
Here's a test case showing the behaviour in action;
The text was updated successfully, but these errors were encountered: