Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coursera Scala Course's Capstone Uses Your Library, but it may not work in that condition. #48

Open
codeaperature opened this issue Oct 7, 2023 · 6 comments

Comments

@codeaperature
Copy link

codeaperature commented Oct 7, 2023

Hi Vincenzo,

To me, it's unclear how to use your library and it's possible that Coursera Scala Course's Capstone (in the build file) has pointed to information that's not longer valid in the readme. I posted this to stackoverflow. This course is hard without being able to do the simple things - it would be nice if you updated your README markdown to help work this issue of TypeTags out. You can note that I tried to make the code on the stackoverflow match Spark's advice, but I also tried to follow the markdown, but didn't post that. In the coursera project, I don't think we can change the build file.

Stefan

@vincenzobaz
Copy link
Owner

Hi @codeaperature thank you for opening the issue!

To use our encoders, all you need is import scala3encoders.given, then they are available in the implicit scope and you can obtain a reference with summon.

I can adapt your stackoverflow snippets as follows:

import scala3encoders.given
import org.apache.spark.sql.Encoder

case class StationX(stnId: Int, wbanId: Int, lat: Double, lon: Double)

object Station extends App:
  val ss = summon[Encoder[StationX]]
  println(ss.schema)

and

package observatory
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoder
import scala.reflect.ClassTag
import scala.deriving.Mirror
import scala3udf.{Udf => udf}
import scala3encoders.given

case class CC(i: Int)
object SparkInstance extends App {
  val spark = SparkSession
    .builder()
    .appName("Spark SQL UDF scalar example")
    .getOrCreate()

  def getSchema[T: Mirror.ProductOf: ClassTag] = summon[Encoder[T]].schema
  val random = udf(() => Math.random())
  val plusOne = udf((x: Int) => x + 1)
  val ss = getSchema[CC]
}

You should not need to write a function such as getSchema

@michael72
Copy link
Contributor

I'm a little flustered and worried that an actual course uses spark together with Scala 3 - I would consider this combination experimental and not suited for beginners (although Scala 3 IMHO is much better than Scala 2).

@vincenzobaz
Copy link
Owner

@michael72 IIRC the course is offered in both Scala 2 and Scala 3.
The assignments were tested in Scala 3 and many students have completed it successfully.

But it has been out for a while, maybe the course manager should investigate whether the scala 3 version has caused more problems...

@codeaperature
Copy link
Author

codeaperature commented Oct 14, 2023

I finally got back to this (I have a regular Data Eng job too) ... I do not believe the parameters of the project mean I can add in extra libraries and it seems that this part does not work in the project:

.../observatory/src/main/scala/observatory/SparkInstance.scala:8:8
Not found: scala3udf
import scala3udf.{Udf => udf}

Maybe I made some other changes. BTW - Did you download the project or just check this in another way?

Since there is no requirement to use Spark and the assignment actually uses a jarred resource ... and per the course suggestion: the data needs to be stream-loaded into memory and then pushed into a spark dataframe/dataset to be processed. I think it's just unnecessary overhead in terms of memory, code and socket open/close time,... I can simply use parallel collections to do a simple join.

I'm going to drop this issue as I am taking a different path, but I am still curious if Coursera provided a bunk suggestion to use your library without supplying the proper tooling in the build.sbt.

Thanks for your past attention to look into this item.

@vincenzobaz
Copy link
Owner

I think I understand better the issue now. The assignment does not involve udfs, @michael72 implemented the udf a long time after the release of the course.
I could reach out the new person in charge of the courses and tell them to include the udf dependency.

I will also ask if other people reported this issue. I am sorry for the frustration this has caused you.
I collaborated with the course authors so I know it is not easy to maintain a large codebase and still make it extensible.

@codeaperature
Copy link
Author

Yeah - I tried to do some things differently ... for example a UDF to convert deg C -> F, but this could be done in another way. Also, I wanted to use datasets with StructTypes automatically derived from case classes.

Thanks for looking into this item for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants