i'm trying implement following formula in julia calculating gini coefficient of wage distribution:

where
here's simplified version of code i'm using this:
# takes array first column value of wages # (y_i in formula), , second column probability # of wage value (f(y_i) in formula). function gini(wagedistarray) # first calculate s values in formula in 1:length(wagedistarray[:,1]) j in 1:i swages[i]+=wagedistarray[j,2]*wagedistarray[j,1] end end # calculate value subtract 1 in gini formula gwages = swages[1]*wagedistarray[1,2] in 2:length(swages) gwages += wagedistarray[i,2]*(swages[i]+swages[i-1]) end # final step of gini calculation return giniwages=1-(gwages/swages[length(swages)]) end wagedistarray=zeros(10000,2) swages=zeros(length(wagedistarray[:,1])) in 1:length(wagedistarray[:,1]) wagedistarray[i,1]=1 wagedistarray[i,2]=1/10000 end @time result=gini(wagedistarray) it gives value of near zero, expect equal wage distribution. however, takes quite long time: 6.796 secs.
any ideas improvement?
try this:
function gini(wagedistarray) nrows = size(wagedistarray,1) swages = zeros(nrows) in 1:nrows j in 1:i swages[i] += wagedistarray[j,2]*wagedistarray[j,1] end end gwages=swages[1]*wagedistarray[1,2] in 2:nrows gwages+=wagedistarray[i,2]*(swages[i]+swages[i-1]) end return 1-(gwages/swages[length(swages)]) end wagedistarray=zeros(10000,2) in 1:size(wagedistarray,1) wagedistarray[i,1]=1 wagedistarray[i,2]=1/10000 end @time result=gini(wagedistarray) - time before:
5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time) - time after:
0.134799301 seconds (507260 bytes allocated) - time after (second run):
elapsed time: 0.123665107 seconds (80112 bytes allocated)
the primary problems swages global variable (wasn't living in function) not coding practice, more importantly performance killer. other thing noticed length(wagedistarray[:,1]), makes copy of column , asks length - generating "garbage". second run faster because there compilation time first time function run.
you crank performance higher using @inbounds, i.e.
function gini(wagedistarray) nrows = size(wagedistarray,1) swages = zeros(nrows) @inbounds in 1:nrows j in 1:i swages[i] += wagedistarray[j,2]*wagedistarray[j,1] end end gwages=swages[1]*wagedistarray[1,2] @inbounds in 2:nrows gwages+=wagedistarray[i,2]*(swages[i]+swages[i-1]) end return 1-(gwages/swages[length(swages)]) end which gives me elapsed time: 0.042070662 seconds (80112 bytes allocated)
finally, check out version, faster , accurate think:
function gini2(wagedistarray) swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2]) gwages = swages[1]*wagedistarray[1,2] + sum(wagedistarray[2:end,2] .* (swages[2:end]+swages[1:end-1])) return 1 - gwages/swages[end] end which has elapsed time: 0.00041119 seconds (721664 bytes allocated). main benefit changing o(n^2) double loop o(n) cumsum.
Comments
Post a Comment