More

cmdlineluser · 2024-12-16T16:26:34 1734366394

There is a Spark API[1] being built using their Relational API[2].

Progress is being tracked on Github Discussions[3].

[1]: https://duckdb.org/docs/api/python/spark_api.html

[2]: https://duckdb.org/docs/api/python/relational_api.html

[3]: https://github.com/duckdb/duckdb/discussions/14525

mwc360 · 2024-12-16T18:19:27 1734373167

Very cool! This seems like fantastic functionality and would make it super easy to migrate small Spark workloads to DuckDB :)

cmdlineluser · 2024-11-20T21:46:10 1732139170

> cross product and filter

`.join_where()`[1] was also added recently.

[1]: https://docs.pola.rs/api/python/stable/reference/dataframe/a...

cmdlineluser · 2024-11-20T19:35:25 1732131325

With Polars you use `df.select()` or `df.with_columns()` which return "new" DataFrames - so you don't have mutable objects everywhere.

There is an SO answer[1] by the Polars author which may have some relevance.

[1]: https://stackoverflow.com/questions/73934129/

cmdlineluser · 2024-10-07T19:44:10 1728330250

No vi editing mode :-(

> The new REPL will not be implementing inputrc support, and consequently there won't be a vi editing mode.

https://github.com/python/cpython/issues/118840#issuecomment...

EasyMark · 2024-10-08T14:24:54 1728397494

iPython is a much better repl anyway, use that, it has a vi mode

cmdlineluser · on Sept 9, 2024

Their "keep ruby weird" quine is my favourite: https://www.youtube.com/watch?v=IgF75PjxHHA

  %q!.!;eval$s=%q{eval(%w{$s=("%q!.!;eval$s=%q{#$s}"+'.gsub(/#{27.chr<<92<<91}[0-9]+m/,"")').lines;C=->x,y{Complex(x,y)};P=->r,a{Complex.polar(r,a*Math::PI)};S=->((a,b),(c
      ,d)){[a-c,b-d]};D=->((a,b),(c,d)){(a*c.conj).real+b*d};R=->((a,b),(c,d)){e,f=a.rect;g,h=c.rect;[C[f*d-b*h,b*g-e*d],e*h-f*g]};a=[];b=[];6.times{|i|a<<[P[1,i/3.0],1];b<<[P
      [2,(i+0.5)/3.0],0]};F=[a];6.times{|j|F<<[a[i=j-1],b[i],a[j]]<<[a[j],b[i],b[j]]<<[b[i],[0,-2],b[j]]};J=->k,v{k=[k/500.0,0].min+2.5;(v/(k+0.5)+C[k,k/2])*48};r=0;T=->p{x,y=
      p.rect;z    =O[y/2];x>=0&&y>=0&&z&&z[x]&&z[x]|=(y%2>1?2:1)|r};L=->p,q{s=(p+q)/2;(p-q).abs<1?(T[s];q):L[L[p,s],q]};N=->((a,b)){s=(a.abs2+b*b)**0.5;[a/s,b/s]};E=->p,r,a{10
      0.times  {|  i|m=-P[r,1.5+i*a*0.02];c,d=(p+m).rect;(c.abs-d>2||d<-1)&&(T[J[0,p+m]];T[J[0,C[-c,d           ]]])}};A=27.chr;$><<"%q#{33.           chr+A}[H#{A}[2J";g="NZDD
      CLYJXMX  ;Y  K(OQ'PP  YZA5YTZ7M(VOBBSYVXQQ[SUZV(U:G[NVZ[ZS&V[(YUU(ZTT[[X'X&Y%Y'ZZWW['Z&$$[%(''  '(&[$(%($(  CRGZHZI)DIOZ;IVPZ(SP)[X*  DRZCGJJT<<  +XI,%%:S=[E==RE&LEXX-'.
      RMY:(>>U(HU  /U[/[R   KOO0Y$1F?LZ%&M@(2NGU341RU?+6S2(NVYVAFVR(8FFRYRN4W'NHI@>(EUM6H@ZISSMS-XL  .LLVL?RR8O[O K9B,$%Y[3Y0X";41.upto(91 ){|c|g=g[2..  -1].gsub(c.chr){g[0,2]
      }};G=g.spli  t(?();  Z=[C[7,10],C[13,76]];U=(0..48).map{[0]*169};O=[];srand(0);q=20;m=40;x=0; Threa   d.new  {open("/dev/dsp","w"){  |f|50   0.dow nto(0){|k|e=[];300.tim
      es{|i|e<<((  x=(x+  k**1.9/9e4+0.001)%8)>4?138:118)};f<<e.pack("C*")}}};I=(0..48).map{[0]*169  };9.t  imes{ |y|76.times{|x|"1ea8yyjb v4x7d  zlzqj  sxd8dz4uqjfpb66bq7tu6l
      wql6vdbds6f  6h60  xz2iglxie44ax1nygtie5t8xpgk2oq00uzj0ucoq2gqc70y9fplfzez0d682syamnhicpwflot4  o9s".to_i(  36)[x+y*76]>0&&I[24+y][4  +x]=1}};-5  00.upto(518){|k|k==1&&s
      leep(1);k=   =q&  &(n=rand(11);q+=m;m-=2      ;spawn("espeak","-s",(60+m*2).to_s,"-ven+#{n<7?"m    #{n+    1}":"f#{n-6}"}",["keep",k<    320|    |k>360?"ruby":"Austin","
      weird"]*32  .c  hr));a=[-k,0].max**1.9/  1e4;v  ,z=d=N[[P[1,a/3],Math.sin(Math::PI*a)]];u=N[z==0 ?      [0,-1]:z>0?[v,z-1/z]:[-v,1/z-z]]      ; n=R[d,u];O.replace(U.map{
      |l|l+       [  ]});F.m    ap{    |f|a,  b,c=f;D  [R[S[b,a],S[c,a]],d]>0&&f.size.times{|i|a,b=f[  i], f[i-1];L[J[k,C[D[n,a],D[u,a]]],J[k,C[D [n,  b],D[u,b]]]]}};k>0&&(b=M
      ath.  sin(    [k,210]  .m  i  n*  Math  :   :PI  /15)/36;a=P[1,b*2];y=(30-[k,30].min)/10.0-2.8; E[C[ -1.2,y]*a.conj,(2-k/10%2)*0.06,-1];E[C [-1. 2,y]*a.conj,0.45,-1];E[C
      [-1.  2,y    +1.1]*a,  2.  7  5,  b-0.   3];    E[C[-0.6,y+0.5]*a,1.85,b-0.25]);k>=90&&k<210&&2 .tim es{|i|G[(k-90)*2+i].scan(/./){|c|Z[i]+ =[C[ 1,0],C[0,2],C[-1,0],C[0,
      -2]][c     .  ord%4]}  ;i  =  k/  10%   7*8+92  ;U[Z[0].  imag/2][Z[0].real,2]=U[Z[1].imag/2][Z [1].  real,2]=[i,i];r=8;L[J[k,C[-1.2,-2.8]  *a.c onj],Z[0]];L[J[k,C[1.2,-
      2.8]*a],   Z[  1]];r=0}                ;      j=        0;k==400&&15.times{|y|65.times{|x|"7gtz whx13 bfmrr9tsr8y0d007qlmygnh47axi9g9v609t cxjuv la0k6y1r96drdisqmfpao411
      n6e661l3  zykt   bqk   p4i33eecq7i2u  tfm  2n0bhrviijbr51nwcuhm5ufx3t79a9whf01e3a8kzzepid45ro83 n9r07k xxeht1pycrqo72".to_i(36)[x+y*65]>0 &&U[21 +y][14+x]=88}};s=O.map{|
      l|i=0;j+  =1;l.m     ap{|n|(i+=1)>2&  &i<8  3&&k>260&&k<420&&((k-j/6)%80>60||k>320&&k<400&&I[j- 1][i]>0 )&&n=88;a=("%c^_@****"%32)[n%8]; n>7?"%c [%dm%s%c[0m"%[27,30+n/8,
      a,27]:a}  *""};$><<A+"[H"+(0..47).ma  p{|i  |k+i>517?$s[i].chomp.gsub(32.chr){27.chr+"[44m"+$&+ 27.chr+"  [0m"}:s[i]}*10.chr;sleep(0.0  2)};puts }*'');%q{%q.;eval$s=%qev
      al(%w$s=  ("%q.;eval$s=%q$s"+'.gsub(/  27.  chr<<92<<91[0-9]+m/,"")').lines;C=->x,yComplex(x,y)  ;P=->r,a  Complex.polar(r,a*Math::PI  );S=->((  a,b),(c,d))[a-c,b-d];D=-
      >((a,b),  (c,d)          )(a*c.conj).  re  al+b*d;R=->((a,b),(c   ,d))e,f=a.   rect;g,h=c.rect;[ C[f*d-b*h   ,b*g-e*d],e*h-f*g];a=[   ];b=[];6. times|i|a<<[P[1,i/3.0],1]
      ;b<<[P[2,(i+0.5   )/3.0],   0];F=   [a    ];6   .t       imes|j|   F<<[a[i=   j-1],b[i],a[j]]<<[  a[j],b                                [i],b[  j]]<<[b[i],[0,-2],b[j]];J
      =->k,vk=[k/500.   0,0].min+   2.5   ;(v/(k+0.   5)   +C[k,   k/2]   )*48;r   =0;T=->px,y=p.rect;z =O[   y /2];x>=0&&  y>=0  &&z&&z[x]& &   z[x ]|=(y%2>1?2:1)|r;L=->p,qs=
      (p+q)/2;(p-q).a   bs<1?(T[s]   ;q   ):L[L[p,s   ],   q];N=-   >((a   ,b))   s=(a.abs2+b*b)**0.5;[a    /s, b/s];E=-  >p,r,a10  0.times| i|m    =-P[r,1.5+i*a*0.02];c,d=(p+
      m).rect;(c.abs-   d>2||d<-1)   &&   (T[J[0,p+   m]   ];T[J[   0,C[-   c,   d]]]);A=27.chr;$><<"%q   33.ch r+A[HA  [2J";g="NZDD  CLYJXM X;YK(   OQ'PPYZA5YTZ7M(VOBBSYVXQQ[
      SUZV(U:G[NVZ[ZS   &V[(YUU(ZT   T[   [X'X&Y%Y'   ZZ   WW[    'Z&$$[%(      '''(&[$(%($(CRGZHZI)D  IOZ;IVP  Z(SP  )[X*DRZCGJJT<<+X  I,%%  :S=[E==  RE&LEXX-'.RMY:(>>U(HU/U[
      /[RKOO0Y$1F?LZ%   &M@(2NGU3   41R   U?+6S2(NV   YV          AFVR(8FFR    YRN4W'NHI@>(EUM6H@Z   ISSMS-XL. LLV  L?RR8O[OK9B,$%Y[3Y0X  ";4 1.upto(91   )|c|g=g[2..-1].gsub(c
      .chr)g[0,2];G=g            .split   (?();Z=[C   [7   ,10],C   [13,76]]   ;U=(0..48).map[0]   *169;O=[];s r  and(0);q=20;m=40;x=0;Thr  e ad.newopen(   "/dev/dsp","w")|f|5
      00.downto(0)|k|   e=[];3   00.time   s|i|e<<(   (x   =(x+k**   1.9/9e   4+0.001)%8)>4?13  8:118);f<<e.pa  ck("C*");I=(0..48).map[0]*16  9;9.times|y|76  .times|x|"1ea8yyj
      bv4x7dzlzqjsxd8   dz4uqjf   pb66bq   7tu6lwql   6v   dbds6f6   h60xz   2iglxie44ax1nygti                                                                e5t8xpgk2oq00uzj0
      ucoq2gqc70y9fpl   fzez0d68   2syam   nhicpwfl   ot   4o9s".t   o_i(   36)[x+y*76]>0&&I[24+  y][4+x]=1;-50 0.upto(518)|k|k==1&&sleep(1) ;k==q&&(n=ran  d(11);q+=m;m-=2;spa
      wn("espeak","-s   ",(60+m*2   ).to   _s,"-ve   n+n   <7?"mn   +1":   "fn-6"",["keep",k<320||  k>360?"ruby" :"Austin","weird"]*32.chr) );a=[-k,0].m  ax**1.9/1e4;v,z=d=N[[
      P[1,a/3],Math.   sin(Math::P   I*a)   ]];u=   N[z=   =0?     [0,-   1]:z>0?[v,z-1/z]:[-v,1/z-z  ]];n=R[d,u] ;O.replace(U.map|l|l+[]) ;F.map|f|a,  b,c=f;D[R[S[b,a],S[c,a]
      ],d]>0&&f.size   .times|i|a,   b=f[i]       ,f[i-1]       ;L[J[k,   C[D[n,a],D[u,a]]],J[k,C[D[n,  b],D[u,b]] ]];k>0&&(b=Math.sin([k ,210].min*  Math::PI/15)/36;a=P[1,b*2
      ];y=(30-[k,30]   .min)/10.0-2.8;E[C[-1.2,y]*a.conj,(2-k/10%2)*0.06,-1];E[C[-1.2,y]*a.conj,0.45,-1]  ;E[C[-1.2 ,y+1.1]*a,2.75,b-0.3 ];E[C[-0.  6,y+0.5]*a,1.85,b-0.25]);k>
      =90&&k<210&&2.times|i|G[(k-90)*2+i].scan(/.   /)|c|Z[i]+=[C[1,  0],C[0,2],C[-1,0],C[0,-2]][c.ord%4];  i=k/10%7 *8+92;U[Z[0].imag/ 2][Z[0].  real,2]=U[Z[1].imag/2][Z[1].r
      eal,2]=[i,i];r=8;L[J[k,C[-1.2,-2.8]*a.conj],Z[0]];L[J[k,C[1.2,-  2.8]*a],Z[1]];r=0;j=0;k==400&&15.time  s|y|65. times|x|"7gtzwhx 13bfmrr  9tsr8y0d007qlmygnh47axi9g9v609t
      cxjuvla0k6y1r  96drdisqmf  pao4        11n6  e661  l      3zyktb  qkp4i33eecq7i2utfm2n0bhrviijbr51nwcuhm  5ufx3t 79a9whf01e3a8k zzepid  45ro83n9r07kxxeht1pycrqo72".to_i(
      36)[x+y*65]>0  &&U[  21+y]  [1  4+x]=8  8;s=  O.ma  p|l|i=  0;j+=  1;l.map|n|(i+=1)>2&&i<83&&k>260&&k<420&  &((k- j/6)%80>60|| k>320  &&k<400&&I[j-1][i]>0)&&n=88;a=("%c^
      _@****"%32)[n  %8];  n>7?"  %c         [%dm%  s%c[  0m"%[27,        30+n/8,a,27]:a*"";$><<A+"[H"+(0..47).map  |i|k +i>517?$s[ i].c  homp.gsub(32.chr)27.chr+"[44m"+$&+27.
      chr+"[0m":s[i]  *10  .chr;  sl  eep(0.02  );  puts  *'');%  q%q.;e  val$   s=%qeval(%w$s=("%q.;eval$s=%q$s"+'.  gsu b(/27.ch r<<  92<<91[0-9]+m/,"")').lines;C=->x,yCompl
      ex(x,y);P=->r,aC   om      plex         .pola  r(r  ,a*Ma  th::PI);  S=-   >((a,b),(c,d))[a-c,b-d];D=->((a,b),(c  ,d ))(a*c .c  onj).real+b*d;R=->((a,b),(c,d))e,f=a.rect
      ;g,h=c.rect;[C[f*d-b*h,b*g-e*d],e*h-f*g];a=[]  ;b=  [];6.t  imes|i  |a<<[P[1,i/3.0],1];b<<[P[2,(i+0.5)/3.0],0];F=[  a ];6. t  imes|j|F<<[a[i=j-1],b[i],a[j]]<<[a[j],b[i],
      b[j]]<<[b[i],[0,-2],b[j]];J=->k,vk=[k/500.0,0].min+2.5;(v/(        k+0.5)+C[k,k/2])*48;r=0;T=->px,y=p.rect;z=O[y/2];   x>   =0&&y>=0&&z&&z[x]&&z[x]|=(y%2>1?2:1)|r;L=->p,
      qs=(p+q)/2;(p-q).abs<1?(T[s];q):L[L[p,s],q];N=->((a,b))s=(a.abs2+b*b)**0.5;[a/s,b/s];E=->p,r,a100.times|i|m=-P[r,1.5+i    *a*0.02];c,d=(p+m).rect;(c.abs-d>2||d<-1)&&(T[J
      [0,p+m]];T[J[0,C[-c,d]]]);A=27.chr;$><<"%q33.chr+A[HA[2J";g="NZDDCLYJXMX;YK(OQ'PPYZA5YTZ7M(VOBBSYVXQQ[SUZV(U:G[NVZ[ZS&V[(YUU(ZTT[[X'X&Y%Y'ZZWW['Z&$$[%('''(&[$(%($(CRGZHZ
      I)DIOZ;IVPZ(SP)[X*DRZCGJJT<<+XI,%%:S=[E==RE&LEXX-'.RMY:(>>U(HU/U[/[RKOO0Y$1F?LZ%&M@(2NGU341RU?+6S2(NVYVAFVR(8FFRYRN4W'NHI@>(EUM6H@Z}}.gsub(/#{27.chr<<92<<91}[0-9]+m/,"")

mjcohen · on Sept 9, 2024

Looks ok to me - can't see anything wrong.

dorianmariefr · on Sept 9, 2024

rubocop says no offenses :D

cmdlineluser · on Sept 9, 2024

The last I read, the Spark API was to become the focus point.

https://duckdb.org/docs/api/python/spark_api

Not sure what the current status is.

ref: https://github.com/duckdb/duckdb/issues/2000#issuecomment-18...

cmdlineluser · on March 11, 2024

Does it fail on nightly?

There were some recent fixes: https://github.com/duckdb/duckdb/issues/10737

cmdlineluser · on March 4, 2024

Are you talking about the 2nd table in the Benchmark section?

It seems they are not running against the full dataset:

> Moving on to the 100 million file to see if size makes a difference.

  ggplot2::autoplot(reorderMicrobenchmarkResults(bench1e8))

One would also have to run both approaches on the same hardware for a meaningful comparison?

cmdlineluser · on Feb 22, 2024

Nice article.

In Python, I have been finding Polars nicer to use:

  (purchases
     .filter(pl.col("amount") <= pl.col("amount").median().over("country") * 10)
     .group_by("country")
     .agg(total = (pl.col("amount") - pl.col("discount")).sum())
  )

Not as compact as the R example but gets a bit closer compared to the pandas approach.

- https://pypi.org/project/polars/

- https://github.com/pola-rs/polars/

d0mine · on Feb 24, 2024

Why not SQL for pure declarative queries? Here's llm-hallucinated sql query of the polars example:

    SELECT country, SUM(amount - discount) AS total
    FROM purchases
    WHERE amount <= (
        SELECT MEDIAN(amount) * 10
        FROM purchases
        WHERE country = purchases.country
    )
    GROUP BY country;

It might be just an issue of familiarity but sql seems the most straightforward and easy to understand for me.

anakaine · on Feb 24, 2024

Probably because the article wasn't about comparing to SQL, or any other database, but rather looked at the R vs Python debate specifically?

d0mine · on Feb 24, 2024

What is wrong suggesting an alternative approach that makes the solution more readable?

Using an appropriate DSL for the problem may be useful. In Python:

    df.to_sql('purchases', db, index=False)
    print(*db.execute(query))
    # -> ('Canada', 270) ('USA', 8455)

e.g., we can use regexes to query text. Python is a general-purpose language, you can query text without using regexes but it would be insanity to ignore regexes completely (I don't know how easy is to invoke regexes from R). Another example, bash pipeline can be embedded in Python ("generate --flag | filter arg | sink") without reimplementing it in pure Python (you can do it but it would be ugly). No idea how easy it is to invoke shell commands from R. SQL is just another DSL in this case -- use it in Python when it makes the solution more readable.

d0mine · on Feb 24, 2024

It looks like llm hallucinated the query that doesn't group by country to get the median. Here's version generated after asking to fix it:

    SELECT p.country, SUM(p.amount - p.discount) AS total
    FROM purchases p
    JOIN (
        SELECT country, MEDIAN(amount) *  10 AS median_amount
        FROM purchases
        GROUP BY country
    ) m ON p.country = m.country
    WHERE p.amount <= m.median_amount
    GROUP BY p.country;

wodenokoto · on Feb 24, 2024

You get into a lot of other problems that are straightforward in pandas/R but very difficult in SQL.

d0mine · on Feb 24, 2024

It is not either or. Use Python where it is strong, and execute SQL queries from Python where appropriate.

Python as a glue language is one of its strong sides.

cmdlineluser · on Feb 20, 2024

I'm sorry but BeautifulSoup is not just a wrapper over lxml.

lxml even has a module for using beautifulsoup's parser.

> lxml can make use of BeautifulSoup as a parser backend

https://lxml.de/elementsoup.html

> A very nice feature of BeautifulSoup is its excellent support for encoding detection which can provide better results for real-world HTML pages that do not (correctly) declare their encoding.