Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for PySpark #2

Open
paulbauriegel opened this issue Feb 23, 2023 · 13 comments
Open

Add documentation for PySpark #2

paulbauriegel opened this issue Feb 23, 2023 · 13 comments

Comments

@paulbauriegel
Copy link

Thank you for writing this. Loading the library in PySpark, actually took some time to figure out. Maybe it makes sense to add an example to the Readme? This is the way I'm using it now:

ReflectionUtil = sc._gateway.jvm.py4j.reflection.ReflectionUtil
sc._jsc.hadoopConfiguration().setClass("fs.file.impl", 
    ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"), 
    ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))
@wobu
Copy link

wobu commented Feb 24, 2023

i couldn't get it running, had to make some minor changes, i guess it could be to a different spark version i am using. I am using spark 3.2.2

tried with

spark = builder.getOrCreate()

ReflectionUtil = spark._sc._jvm.py4j.reflection.ReflectionUtil
spark._sc._jsc.hadoopConfiguration().setClass("fs.file.impl",
                                               ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"),
                                               ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))

but the code was still failing due to native code access :(

@garretwilson
Copy link
Member

but the code was still failing due to native code access :(

I'm not familiar at all with PySpark, but I'll see what I can do to help. This could be the version of Spark, or it could be a different way you're using Spark that is causing a different code path to be invoked.

Please provide a stack trace that shows which native code is being invoked. It may be easy to add the necessary changes to provide native Java access for those other methods. It was always expected that this initial version might not cover all the methods for all code paths, but I'll need to know which code paths are involved to know which parts need added. Thanks.

@wobu
Copy link

wobu commented Feb 24, 2023

import pyspark
import sys


def get_or_create_test_spark_session():
    """ Get or create a spark session
    """
    builder = pyspark.sql.SparkSession.builder \
        .appName("Tests") \
        .master("local[*]") \
        .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
        .config("spark.ui.enabled", "false") \
        .config("spark.driver.host", "127.0.0.1")

    if sys.platform.startswith('win'):
        builder = builder \
            .config("spark.jars.packages", "com.globalmentor:hadoop-bare-naked-local-fs:0.1.0")

        spark = builder.getOrCreate()

        ReflectionUtil = spark._sc._jvm.py4j.reflection.ReflectionUtil
        spark._sc._jsc.hadoopConfiguration().setClass("fs.file.impl",
                                               ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"),
                                               ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))
    else:
        spark = builder.getOrCreate()
    return spark

pyspark_stacktrace.txt

@garretwilson
Copy link
Member

Thanks for the stack trace. I'll take a look, but it may be mid next week before I can get to it. Feel free to ping me in a few days if it slips my mind.

@paulbauriegel
Copy link
Author

I would like to help but I cannot reproduce the issue yet. So just as reference what I did on Windows 10 with OpenJDK 11 is:

  1. downloading Spark 3.2.3 with PreBuild Hadoop 3.2 here
  2. then creating a empty winutils.txt in the bin folder that I renamed to winutils.exe, because otherwise Hadoop complains about the missing file
  3. Setting HADOOP_HOME & SPARK_HOME to the spark root folder I downloaded & PATH to spark-3.2.3-bin-hadoop3.2/bin
  4. Running pyspark --packages com.globalmentor:hadoop-bare-naked-local-fs:0.1.0
  5. Then just reading some sample files via the code I shared
  6. just using @wobu code as a script also worked

It also worked on my Intel Mac.

@wobu
Copy link

wobu commented Feb 27, 2023

i am trying to run within a unittest / pytest
i don't have spark or spark CLI manually installed. i am only using pyspark 3.2.3 and having it installed within a venv.

I am using an AMD machine.
The Scala variant is working fine.

@Admolly
Copy link

Admolly commented Apr 20, 2023

Agreed. Some documentation for how to use with PySpark would indeed be helpful for those getting started on Windows. Thank you @paulbauriegel for opening the issue.

@garretwilson
Copy link
Member

garretwilson commented Jul 13, 2023

Hi, everyone. I'd like to make sure this issue is addressed. Help me get caught up on the status. Bear with me a bit as I haven't touched Python in a while and I've certainly never used PySpark.

This ticket seems to be primarily about updating the documentation for use with PySpark, but I also see some notes about someone not being able to get it to work on PySpark at all. The stack trace in this comment didn't show any references to the Bare Naked Local FileSystem at all, so I'm not sure PySpark is even using the correct FileSytem implementation in that case.

Could someone verify whether they are or are not getting this to work with PySpark, and explain how they did it? Thanks.

@paulbauriegel
Copy link
Author

@garretwilson It's working fine with PySpark for me. How to - I described in the opening comment, if something is unclear with that I extend a bit on that in the following comment. I primarily opened that issue, so that others can find out how to use your library w. PySpark without much research. You can add it to the Readme as a comment or just close the issue. Either way I can confirm that it works on Mac and Windows with PySpark without any issue ( I only tested the local mode, not a cluster setup)

@snoe925
Copy link

snoe925 commented Nov 10, 2023

I have managed to get this configuration to work for me on windows for Pyspark. By using Hadoop configuration files the syntax is a bit easier. That is pyspark setups will look normal for examples on the internet.

There are problems; you will be limited to CSV formats. For example Parquet will not work or anything else that uses Hadoop classes for I/O.

No Hadoop is installed on windows. HADOOP_HOME is not set, so there are all the documented warnings.

TDLR;
In $SPARK_HOME/jars create two Hadoop configuration files: core-site.xml, hdfs-site.xml

core-site.xml contains "" the xml flavor of empty.
hdfs-site.xml contains:

fs.default.name com.globalmentor.apache.hadoop.fs.BareLocalFileSystem

I am running spark install pre-built for Apache Hadoop 3.3. I have no local Hadoop install and HADOOP_HOME is not set.

I'll look more into Parquet support.

@kdionyso
Copy link

kdionyso commented Nov 27, 2023

Hi all, I know I am a bit late to the party. Have people managed to create tables using CSV? I can read files fine. I can also write empty tables, but the moment I am trying to populate them with data I get the dreaded

java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

error. Any ideas anyone?

As an example I can do the following:

    spark.sql(
        """
            CREATE TABLE IF NOT EXISTS TEST200 (
                MODEL_NAME STRING,
                MODEL_STAGE STRING
            ) USING CSV
        """
    )

but not

    spark.sql(
        f"""
        CREATE TABLE TEST201 USING CSV AS  (SELECT 'test' MODEL_NAME,
                'Production' MODEL_STAGE) 
        """
    )

Below is the beginning of the error:

py4j.protocol.Py4JJavaError: An error occurred while calling o40.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (127.0.0.1 executor driver): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/<REDACTED>/spark-warehouse/test201.

The weird thing is that while I am writing the table with data, some part files get generated but they get subsequently deleted.

@garretwilson
Copy link
Member

@kdionyso perhaps this might be better placed in a separate ticket? I think this ticket is more to add documentation. (And I want to get to that eventually! 😅 )

And if there is some way to get a stack trace, I could know better the code path that is arriving at this problem.

@kdionyso
Copy link

@garretwilson Yes, sure. I was just wondering whether people with the setup described above have managed to write tables in pyspark/Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants