Add documentation for PySpark #2

paulbauriegel · 2023-02-23T16:46:43Z

Thank you for writing this. Loading the library in PySpark, actually took some time to figure out. Maybe it makes sense to add an example to the Readme? This is the way I'm using it now:

ReflectionUtil = sc._gateway.jvm.py4j.reflection.ReflectionUtil
sc._jsc.hadoopConfiguration().setClass("fs.file.impl", 
    ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"), 
    ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))

wobu · 2023-02-24T14:17:21Z

i couldn't get it running, had to make some minor changes, i guess it could be to a different spark version i am using. I am using spark 3.2.2

tried with

spark = builder.getOrCreate()

ReflectionUtil = spark._sc._jvm.py4j.reflection.ReflectionUtil
spark._sc._jsc.hadoopConfiguration().setClass("fs.file.impl",
                                               ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"),
                                               ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))

but the code was still failing due to native code access :(

garretwilson · 2023-02-24T14:20:49Z

but the code was still failing due to native code access :(

I'm not familiar at all with PySpark, but I'll see what I can do to help. This could be the version of Spark, or it could be a different way you're using Spark that is causing a different code path to be invoked.

Please provide a stack trace that shows which native code is being invoked. It may be easy to add the necessary changes to provide native Java access for those other methods. It was always expected that this initial version might not cover all the methods for all code paths, but I'll need to know which code paths are involved to know which parts need added. Thanks.

wobu · 2023-02-24T14:37:24Z

import pyspark
import sys


def get_or_create_test_spark_session():
    """ Get or create a spark session
    """
    builder = pyspark.sql.SparkSession.builder \
        .appName("Tests") \
        .master("local[*]") \
        .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
        .config("spark.ui.enabled", "false") \
        .config("spark.driver.host", "127.0.0.1")

    if sys.platform.startswith('win'):
        builder = builder \
            .config("spark.jars.packages", "com.globalmentor:hadoop-bare-naked-local-fs:0.1.0")

        spark = builder.getOrCreate()

        ReflectionUtil = spark._sc._jvm.py4j.reflection.ReflectionUtil
        spark._sc._jsc.hadoopConfiguration().setClass("fs.file.impl",
                                               ReflectionUtil.classForName("com.globalmentor.apache.hadoop.fs.BareLocalFileSystem"),
                                               ReflectionUtil.classForName("org.apache.hadoop.fs.FileSystem"))
    else:
        spark = builder.getOrCreate()
    return spark

pyspark_stacktrace.txt

garretwilson · 2023-02-24T14:41:58Z

Thanks for the stack trace. I'll take a look, but it may be mid next week before I can get to it. Feel free to ping me in a few days if it slips my mind.

paulbauriegel · 2023-02-24T20:45:21Z

I would like to help but I cannot reproduce the issue yet. So just as reference what I did on Windows 10 with OpenJDK 11 is:

downloading Spark 3.2.3 with PreBuild Hadoop 3.2 here
then creating a empty winutils.txt in the bin folder that I renamed to winutils.exe, because otherwise Hadoop complains about the missing file
Setting HADOOP_HOME & SPARK_HOME to the spark root folder I downloaded & PATH to spark-3.2.3-bin-hadoop3.2/bin
Running pyspark --packages com.globalmentor:hadoop-bare-naked-local-fs:0.1.0
Then just reading some sample files via the code I shared
just using @wobu code as a script also worked

It also worked on my Intel Mac.

wobu · 2023-02-27T07:04:53Z

i am trying to run within a unittest / pytest
i don't have spark or spark CLI manually installed. i am only using pyspark 3.2.3 and having it installed within a venv.

I am using an AMD machine.
The Scala variant is working fine.

Admolly · 2023-04-20T11:13:16Z

Agreed. Some documentation for how to use with PySpark would indeed be helpful for those getting started on Windows. Thank you @paulbauriegel for opening the issue.

garretwilson · 2023-07-13T17:38:47Z

Hi, everyone. I'd like to make sure this issue is addressed. Help me get caught up on the status. Bear with me a bit as I haven't touched Python in a while and I've certainly never used PySpark.

This ticket seems to be primarily about updating the documentation for use with PySpark, but I also see some notes about someone not being able to get it to work on PySpark at all. The stack trace in this comment didn't show any references to the Bare Naked Local FileSystem at all, so I'm not sure PySpark is even using the correct FileSytem implementation in that case.

Could someone verify whether they are or are not getting this to work with PySpark, and explain how they did it? Thanks.

paulbauriegel · 2023-07-15T20:48:16Z

@garretwilson It's working fine with PySpark for me. How to - I described in the opening comment, if something is unclear with that I extend a bit on that in the following comment. I primarily opened that issue, so that others can find out how to use your library w. PySpark without much research. You can add it to the Readme as a comment or just close the issue. Either way I can confirm that it works on Mac and Windows with PySpark without any issue ( I only tested the local mode, not a cluster setup)

snoe925 · 2023-11-10T15:15:31Z

I have managed to get this configuration to work for me on windows for Pyspark. By using Hadoop configuration files the syntax is a bit easier. That is pyspark setups will look normal for examples on the internet.

There are problems; you will be limited to CSV formats. For example Parquet will not work or anything else that uses Hadoop classes for I/O.

No Hadoop is installed on windows. HADOOP_HOME is not set, so there are all the documented warnings.

TDLR;
In $SPARK_HOME/jars create two Hadoop configuration files: core-site.xml, hdfs-site.xml

core-site.xml contains "" the xml flavor of empty.
hdfs-site.xml contains:

fs.default.name com.globalmentor.apache.hadoop.fs.BareLocalFileSystem

I am running spark install pre-built for Apache Hadoop 3.3. I have no local Hadoop install and HADOOP_HOME is not set.

I'll look more into Parquet support.

kdionyso · 2023-11-27T20:55:01Z

Hi all, I know I am a bit late to the party. Have people managed to create tables using CSV? I can read files fine. I can also write empty tables, but the moment I am trying to populate them with data I get the dreaded

java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

error. Any ideas anyone?

As an example I can do the following:

    spark.sql(
        """
            CREATE TABLE IF NOT EXISTS TEST200 (
                MODEL_NAME STRING,
                MODEL_STAGE STRING
            ) USING CSV
        """
    )

but not

    spark.sql(
        f"""
        CREATE TABLE TEST201 USING CSV AS  (SELECT 'test' MODEL_NAME,
                'Production' MODEL_STAGE) 
        """
    )

Below is the beginning of the error:

py4j.protocol.Py4JJavaError: An error occurred while calling o40.sql.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (127.0.0.1 executor driver): org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/<REDACTED>/spark-warehouse/test201.

The weird thing is that while I am writing the table with data, some part files get generated but they get subsequently deleted.

garretwilson · 2023-11-27T21:00:16Z

@kdionyso perhaps this might be better placed in a separate ticket? I think this ticket is more to add documentation. (And I want to get to that eventually! 😅 )

And if there is some way to get a stack trace, I could know better the code path that is arriving at this problem.

kdionyso · 2023-11-28T09:08:33Z

@garretwilson Yes, sure. I was just wondering whether people with the setup described above have managed to write tables in pyspark/Windows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add documentation for PySpark #2

Add documentation for PySpark #2

paulbauriegel commented Feb 23, 2023

wobu commented Feb 24, 2023

garretwilson commented Feb 24, 2023

wobu commented Feb 24, 2023 •

edited

Loading

garretwilson commented Feb 24, 2023

paulbauriegel commented Feb 24, 2023

wobu commented Feb 27, 2023

Admolly commented Apr 20, 2023

garretwilson commented Jul 13, 2023 •

edited

Loading

paulbauriegel commented Jul 15, 2023

snoe925 commented Nov 10, 2023

kdionyso commented Nov 27, 2023 •

edited

Loading

garretwilson commented Nov 27, 2023

kdionyso commented Nov 28, 2023

Add documentation for PySpark #2

Add documentation for PySpark #2

Comments

paulbauriegel commented Feb 23, 2023

wobu commented Feb 24, 2023

garretwilson commented Feb 24, 2023

wobu commented Feb 24, 2023 • edited Loading

garretwilson commented Feb 24, 2023

paulbauriegel commented Feb 24, 2023

wobu commented Feb 27, 2023

Admolly commented Apr 20, 2023

garretwilson commented Jul 13, 2023 • edited Loading

paulbauriegel commented Jul 15, 2023

snoe925 commented Nov 10, 2023

kdionyso commented Nov 27, 2023 • edited Loading

garretwilson commented Nov 27, 2023

kdionyso commented Nov 28, 2023

wobu commented Feb 24, 2023 •

edited

Loading

garretwilson commented Jul 13, 2023 •

edited

Loading

kdionyso commented Nov 27, 2023 •

edited

Loading