I built something similar for my own personal use case where it allows any LLM to float on top of all windows and chat with it. And I do not collect such info. Your data only goes to the LLM company which you are interacting with:
it does feel like so, the position eventually loses its meaning as more and more data gets crunched by the training process, eventually it's just a context of the past 4 tokens it feels like
The annoying (?) part of Scala Spark is the lack of notebook ecosystem. Also spark-submit requires a compiled jar for Scala yet only the main python script for Python. I would've loved Scala Spark if the eco system was in place.
One significant disadvantage of PySpark is its reliance on py4j to serialize and deserialize objects between Java and Python when using Python UDFs. This constant overhead can become burdensome as data volume increases in such an exchange. However, I am glad to see efforts to create a data pipeline framework using Python and Ray.
~One suggestion, a Scala/Java Spark run of those benchmarks should be a valid baseline to compare against as well instead of PySpark.~
Ah it's SparkSQL so the execution probably wouldn't have much of py4j involvement, except for the collect.
There is also pandas udfs, which uses arrow as the exchange format. I assume it still has to copy the data (?), but it makes the (de)serializarion fast, and allows for vectorized operations.
reply