statistics - Gini Coefficient in Julia: Efficient and Accurate Code -


i'm trying implement following formula in julia calculating gini coefficient of wage distribution:

enter image description here

where enter image description here

here's simplified version of code i'm using this:

# takes array first column value of wages # (y_i in formula), , second column probability # of wage value (f(y_i) in formula). function gini(wagedistarray)     # first calculate s values in formula     in 1:length(wagedistarray[:,1])         j in 1:i             swages[i]+=wagedistarray[j,2]*wagedistarray[j,1]         end     end      # calculate value subtract 1 in gini formula     gwages = swages[1]*wagedistarray[1,2]     in 2:length(swages)         gwages += wagedistarray[i,2]*(swages[i]+swages[i-1])     end      # final step of gini calculation     return giniwages=1-(gwages/swages[length(swages)])           end  wagedistarray=zeros(10000,2)                                  swages=zeros(length(wagedistarray[:,1]))                      in 1:length(wagedistarray[:,1])    wagedistarray[i,1]=1    wagedistarray[i,2]=1/10000 end   @time result=gini(wagedistarray) 

it gives value of near zero, expect equal wage distribution. however, takes quite long time: 6.796 secs.

any ideas improvement?

try this:

function gini(wagedistarray)     nrows = size(wagedistarray,1)     swages = zeros(nrows)     in 1:nrows         j in 1:i             swages[i] += wagedistarray[j,2]*wagedistarray[j,1]         end     end      gwages=swages[1]*wagedistarray[1,2]     in 2:nrows         gwages+=wagedistarray[i,2]*(swages[i]+swages[i-1])     end      return 1-(gwages/swages[length(swages)])  end  wagedistarray=zeros(10000,2) in 1:size(wagedistarray,1)    wagedistarray[i,1]=1    wagedistarray[i,2]=1/10000 end  @time result=gini(wagedistarray) 
  • time before: 5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
  • time after: 0.134799301 seconds (507260 bytes allocated)
  • time after (second run): elapsed time: 0.123665107 seconds (80112 bytes allocated)

the primary problems swages global variable (wasn't living in function) not coding practice, more importantly performance killer. other thing noticed length(wagedistarray[:,1]), makes copy of column , asks length - generating "garbage". second run faster because there compilation time first time function run.

you crank performance higher using @inbounds, i.e.

function gini(wagedistarray)     nrows = size(wagedistarray,1)     swages = zeros(nrows)     @inbounds in 1:nrows         j in 1:i             swages[i] += wagedistarray[j,2]*wagedistarray[j,1]         end     end      gwages=swages[1]*wagedistarray[1,2]     @inbounds in 2:nrows         gwages+=wagedistarray[i,2]*(swages[i]+swages[i-1])     end      return 1-(gwages/swages[length(swages)]) end 

which gives me elapsed time: 0.042070662 seconds (80112 bytes allocated)

finally, check out version, faster , accurate think:

function gini2(wagedistarray)     swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])     gwages = swages[1]*wagedistarray[1,2] +                 sum(wagedistarray[2:end,2] .*                          (swages[2:end]+swages[1:end-1]))     return 1 - gwages/swages[end] end 

which has elapsed time: 0.00041119 seconds (721664 bytes allocated). main benefit changing o(n^2) double loop o(n) cumsum.


Comments