Can you check ruby-spark project?

Greetings ruby community,

Does anybody know why this project has not been updated anymore?

It's for Spark 1.x, now we use Spark 3.x.
It even doesn't have dataframe API support.
Today we use dataframe API much more widely than the low end RDD API.
So I was thinking if the project gets active development it will help a lot
of people like me working on both ruby dev and data science.

I Hope you have a happy holiday.
Thanks - Piper

Does anybody know why this project has not been updated anymore?
GitHub - ondra-m/ruby-spark: Ruby wrapper for Apache Spark
It's for Spark 1.x, now we use Spark 3.x.
It even doesn't have dataframe API support.

Some comments from ondra-m in the issues:

Feb 26, 2017 "If more people will want compatibility with version 2,
I'll look at it."

Sep 3, 2018 "Sorry but currently I don't have time to maintain this
library."

Nov 6, 2019 "its very hard to use dynamic language in a distributed
environment. Specially ruby cannot serialize an anonymous function"

So I was thinking if the project gets active development it will help
a lot of people like me working on both ruby dev and data science.

Something more native in ruby would be great, but I did a quick test
based on GitHub - mrkn/pycall.rb: Calling Python functions from the Ruby language

$ cat pyspark.rb
require 'ruby.py'

SparkSql = RubyPy.import('pyspark.sql')
SparkSession = SparkSql.SparkSession
SparkRow = SparkSql.Row

spark = SparkSession.builder.getOrCreate

df = spark.createDataFrame [
  SparkRow.new(a: 2, b: 3.0, c: 'string1'),
  SparkRow.new(a: 4, b: 9.0, c: 'string2'),
  SparkRow.new(a: 8, b: 17.0, c: 'string3'),
]

df.show
# =>
# +---+----+-------+
# | a| b| c|
# +---+----+-------+
# | 2| 3.0|string1|
# | 4| 9.0|string2|
# | 8|17.0|string3|
# +---+----+-------+

df.printSchema
# =>
# root
# |-- a: long (nullable = true)
# |-- b: double (nullable = true)
# |-- c: string (nullable = true)

df.select('a', 'c').describe.show
# =>
# +-------+-----------------+-------+
# |summary| a| c|
# +-------+-----------------+-------+
# | count| 3| 3|
# | mean|4.666666666666667| null|
# | stddev|3.055050463303893| null|
# | min| 2|string1|
# | max| 8|string3|
# +-------+-----------------+-------+

df.filter(df.a < 5).show
# =>
# +---+---+-------+
# | a| b| c|
# +---+---+-------+
# | 2|3.0|string1|
# | 4|9.0|string2|
# +---+---+-------+

df.filter(df.b > 5).show
# =>
# +---+----+-------+
# | a| b| c|
# +---+----+-------+
# | 4| 9.0|string2|
# | 8|17.0|string3|
# +---+----+-------+

···

On 12/24/21, Piper H <potthua@gmail.com> wrote:

Thanks Frank for clarifying the question. That makes sense.

Regards.

···

On Fri, Dec 24, 2021 at 11:31 PM Frank J. Cameron <fjc@fastmail.net> wrote:

On 12/24/21, Piper H <potthua@gmail.com> wrote:
> Does anybody know why this project has not been updated anymore?
> GitHub - ondra-m/ruby-spark: Ruby wrapper for Apache Spark
> It's for Spark 1.x, now we use Spark 3.x.
> It even doesn't have dataframe API support.

Some comments from ondra-m in the issues:

Feb 26, 2017 "If more people will want compatibility with version 2,
I'll look at it."

Sep 3, 2018 "Sorry but currently I don't have time to maintain this
library."

Nov 6, 2019 "its very hard to use dynamic language in a distributed
environment. Specially ruby cannot serialize an anonymous function"

> So I was thinking if the project gets active development it will help
> a lot of people like me working on both ruby dev and data science.

Something more native in ruby would be great, but I did a quick test
based on GitHub - mrkn/pycall.rb: Calling Python functions from the Ruby language

$ cat pyspark.rb
require 'ruby.py'

SparkSql = RubyPy.import('pyspark.sql')
SparkSession = SparkSql.SparkSession
SparkRow = SparkSql.Row

spark = SparkSession.builder.getOrCreate

df = spark.createDataFrame [
  SparkRow.new(a: 2, b: 3.0, c: 'string1'),
  SparkRow.new(a: 4, b: 9.0, c: 'string2'),
  SparkRow.new(a: 8, b: 17.0, c: 'string3'),
]

df.show
# =>
# +---+----+-------+
# | a| b| c|
# +---+----+-------+
# | 2| 3.0|string1|
# | 4| 9.0|string2|
# | 8|17.0|string3|
# +---+----+-------+

df.printSchema
# =>
# root
# |-- a: long (nullable = true)
# |-- b: double (nullable = true)
# |-- c: string (nullable = true)

df.select('a', 'c').describe.show
# =>
# +-------+-----------------+-------+
# |summary| a| c|
# +-------+-----------------+-------+
# | count| 3| 3|
# | mean|4.666666666666667| null|
# | stddev|3.055050463303893| null|
# | min| 2|string1|
# | max| 8|string3|
# +-------+-----------------+-------+

df.filter(df.a < 5).show
# =>
# +---+---+-------+
# | a| b| c|
# +---+---+-------+
# | 2|3.0|string1|
# | 4|9.0|string2|
# +---+---+-------+

df.filter(df.b > 5).show
# =>
# +---+----+-------+
# | a| b| c|
# +---+----+-------+
# | 4| 9.0|string2|
# | 8|17.0|string3|
# +---+----+-------+

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

Hello

Is there an alternative to Azure (running own servers, instead of the
MS-cloud [freely delivering all data to MS] ?

Opti

···

Am 24.12.21 um 16:31 schrieb Frank J. Cameron:

On 12/24/21, Piper H <potthua@gmail.com> wrote:

Does anybody know why this project has not been updated anymore?
GitHub - ondra-m/ruby-spark: Ruby wrapper for Apache Spark
It's for Spark 1.x, now we use Spark 3.x.
It even doesn't have dataframe API support.

This is maybe not the question to ruby but you were looking for the host
solution.
You can get a dedicated server from OVH, Liteserver, Hetzner etc, they have
the cheap pricing.

HTH.

···

On Sat, Dec 25, 2021 at 5:54 PM Die Optimisten <inform@die-optimisten.net> wrote:

Hello

Is there an alternative to Azure (running own servers, instead of the
MS-cloud [freely delivering all data to MS] ?

Opti

Am 24.12.21 um 16:31 schrieb Frank J. Cameron:
> On 12/24/21, Piper H <potthua@gmail.com> wrote:
>> Does anybody know why this project has not been updated anymore?
>> GitHub - ondra-m/ruby-spark: Ruby wrapper for Apache Spark
>> It's for Spark 1.x, now we use Spark 3.x.
>> It even doesn't have dataframe API support.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

Hi,
Thanks for the answer;
I meant the software - how can I run spark on my own server? Seems it
uses the Azure-APIs ?
Opti

···

Am 25.12.21 um 11:06 schrieb Wes Peng:

This is maybe not the question to ruby but you were looking for the
host solution.

Hello Opt,

It's quite easy to run Spark on your own host, following the guide:

And, Databricks INC provides a great book online for free reading:

If you are a newbie, you really want to read this book at the beginning.

Thanks.

···

On Sun, Dec 26, 2021 at 6:47 PM Die Optimisten <inform@die-optimisten.net> wrote:

Hi,
Thanks for the answer;
I meant the software - how can I run spark on my own server? Seems it
uses the Azure-APIs ?
Opti

Am 25.12.21 um 11:06 schrieb Wes Peng:
> This is maybe not the question to ruby but you were looking for the
> host solution.

Unsubscribe: <mailto:ruby-talk-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-talk&gt;

Thank you!

···

Am 26.12.21 um 12:00 schrieb Piper H:

Hello Opti,

It's quite easy to run Spark on your own host, following the guide:
Overview - Spark 3.5.0 Documentation