-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for PySpark #2
Comments
i couldn't get it running, had to make some minor changes, i guess it could be to a different spark version i am using. I am using spark 3.2.2 tried with spark = builder.getOrCreate()
ReflectionUtil = spark._sc._jvm.py4j.reflection.ReflectionUtil
spark._sc._jsc.hadoopConfiguration().setClass("fs.file.impl",
ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"),
ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem")) but the code was still failing due to native code access :( |
I'm not familiar at all with PySpark, but I'll see what I can do to help. This could be the version of Spark, or it could be a different way you're using Spark that is causing a different code path to be invoked. Please provide a stack trace that shows which native code is being invoked. It may be easy to add the necessary changes to provide native Java access for those other methods. It was always expected that this initial version might not cover all the methods for all code paths, but I'll need to know which code paths are involved to know which parts need added. Thanks. |
import pyspark
import sys
def get_or_create_test_spark_session():
""" Get or create a spark session
"""
builder = pyspark.sql.SparkSession.builder \
.appName("Tests") \
.master("local[*]") \
.config("spark.sql.execution.arrow.pyspark.enabled", "true") \
.config("spark.ui.enabled", "false") \
.config("spark.driver.host", "127.0.0.1")
if sys.platform.startswith('win'):
builder = builder \
.config("spark.jars.packages", "com.globalmentor:hadoop-bare-naked-local-fs:0.1.0")
spark = builder.getOrCreate()
ReflectionUtil = spark._sc._jvm.py4j.reflection.ReflectionUtil
spark._sc._jsc.hadoopConfiguration().setClass("fs.file.impl",
ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"),
ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))
else:
spark = builder.getOrCreate()
return spark |
Thanks for the stack trace. I'll take a look, but it may be mid next week before I can get to it. Feel free to ping me in a few days if it slips my mind. |
I would like to help but I cannot reproduce the issue yet. So just as reference what I did on Windows 10 with OpenJDK 11 is:
It also worked on my Intel Mac. |
i am trying to run within a unittest / pytest I am using an AMD machine. |
Agreed. Some documentation for how to use with PySpark would indeed be helpful for those getting started on Windows. Thank you @paulbauriegel for opening the issue. |
Hi, everyone. I'd like to make sure this issue is addressed. Help me get caught up on the status. Bear with me a bit as I haven't touched Python in a while and I've certainly never used PySpark. This ticket seems to be primarily about updating the documentation for use with PySpark, but I also see some notes about someone not being able to get it to work on PySpark at all. The stack trace in this comment didn't show any references to the Bare Naked Local FileSystem at all, so I'm not sure PySpark is even using the correct Could someone verify whether they are or are not getting this to work with PySpark, and explain how they did it? Thanks. |
@garretwilson It's working fine with PySpark for me. How to - I described in the opening comment, if something is unclear with that I extend a bit on that in the following comment. I primarily opened that issue, so that others can find out how to use your library w. PySpark without much research. You can add it to the Readme as a comment or just close the issue. Either way I can confirm that it works on Mac and Windows with PySpark without any issue ( I only tested the local mode, not a cluster setup) |
I have managed to get this configuration to work for me on windows for Pyspark. By using Hadoop configuration files the syntax is a bit easier. That is pyspark setups will look normal for examples on the internet. There are problems; you will be limited to CSV formats. For example Parquet will not work or anything else that uses Hadoop classes for I/O. No Hadoop is installed on windows. HADOOP_HOME is not set, so there are all the documented warnings. TDLR; core-site.xml contains "" the xml flavor of empty. I am running spark install pre-built for Apache Hadoop 3.3. I have no local Hadoop install and HADOOP_HOME is not set. I'll look more into Parquet support. |
Hi all, I know I am a bit late to the party. Have people managed to create tables using CSV? I can read files fine. I can also write empty tables, but the moment I am trying to populate them with data I get the dreaded
error. Any ideas anyone? As an example I can do the following:
but not
Below is the beginning of the error:
The weird thing is that while I am writing the table with data, some part files get generated but they get subsequently deleted. |
@kdionyso perhaps this might be better placed in a separate ticket? I think this ticket is more to add documentation. (And I want to get to that eventually! 😅 ) And if there is some way to get a stack trace, I could know better the code path that is arriving at this problem. |
@garretwilson Yes, sure. I was just wondering whether people with the setup described above have managed to write tables in pyspark/Windows. |
Thank you for writing this. Loading the library in PySpark, actually took some time to figure out. Maybe it makes sense to add an example to the Readme? This is the way I'm using it now:
The text was updated successfully, but these errors were encountered: