Optimising check-interpreter

Discussion:

Jonathan Perkin

2014-09-02 15:03:37 UTC

This one is probably more controversial, but the gains are so large
that it's worth raising for discussion.

I'd like to change mk/check/check-interpreter.mk so that it only
checks executable files. The rationale being that I can't think of
any cases where we'd care what the shebang of a non-executable file
is, and if a file is erroneously non-executable when it should be
executable, then that should be fixed prior to the check being run.

The diff is simple:

http://us-east.manta.joyent.com/pkgsrc/public/patches/check-interp.diff

and the performance difference for lang/ruby193-base is before:

$ ptime bmake _check-interpreter
=> Checking for non-existent script interpreters in ruby193-base-1.9.3p547

real 1:36.154904091
user 17.554778405
sys 1:10.566866515

and after:

$ ptime bmake _check-interpreter
=> Checking for non-existent script interpreters in ruby193-base-1.9.3p547

real 2.658741177
user 1.339411743
sys 1.236949825

due to reducing the number of sed(1) calls required down from 13,000+
to just 108.

If there are legitimate reasons for wanting to check non-existent
files, how about making it a variable so it can be easily turned on or
off?

Thanks,

--
Jonathan Perkin - Joyent, Inc. - www.joyent.com

David Holland

2014-09-03 08:14:37 UTC

Permalink

Post by Jonathan Perkin
This one is probably more controversial, but the gains are so large
that it's worth raising for discussion.
I'd like to change mk/check/check-interpreter.mk so that it only
checks executable files. The rationale being that I can't think of
any cases where we'd care what the shebang of a non-executable file
is, and if a file is erroneously non-executable when it should be
executable, then that should be fixed prior to the check being run.

That seems reasonable to me... as long as we have some reason to
believe that non-executable and therefore unchecked files aren't going
to become executable after the check, perhaps due to the activities of
INSTALL or whatnot.

I wouldn't think that this would happen, but still...

Maybe instead the script could be converted to something higher-
performance? Right now it does a shell read loop over every file, and
those are notoriously slow in addition to requiring multiple execs per
file. It should be possible to do something along the lines of

${_CHECK_INTERP_FILELIST_CMD} | ${XARGS} ${GREP} -Hn '^#!' | \
${SED} -n 's/^$[^:]*$:1:#![[:space:]]*$[^[:space:]]*$.*/\2 \1/p' | \
${SORT} | ${AWK} '
{
if ($$1 != previnterp) { doprint(); previnterp = $$1; }
files[++nfiles] = $$2;
}
function doprint() {
if (previnterp == "") {
return;
}
printf "%s", previnterp;
for (i=1;i<=nfiles;i++) {
printf " %s", files[i];
}
printf "\n";
delete files;
nfiles = 0;
}
' | while read interp files; do
bad=''
case "$$interp" of
/bin/env | /usr/bin/env)
bad="is not allowed"
;;
*)
if [ ! -f "$$interp" -a ! -f ${DESTDIR}"$$interp" ]; then
bad="does not exist"
fi
;;
esac
if [ "x$bad" != x ]; then
echo "[check-interpreter.mk] The interpreter $interp $bad in:"
for f in $files; do
echo "[check-interpreter.mk] $f"
done
fi
done

which is still a shell read loop, but loops only once per distinct
interpreter; this will normally be a small number. It could probably
be made to not loop, but then safe quoting of the interpreter strings
becomes a bit of a problem.

Of course, this assumes we're allowed to use xargs here, and I'm not
clear on that...

--
David A. Holland
***@netbsd.org

Joerg Sonnenberger

2014-09-03 09:36:00 UTC

Permalink

Post by David Holland
Maybe instead the script could be converted to something higher-
performance? Right now it does a shell read loop over every file, and
those are notoriously slow in addition to requiring multiple execs per
file.

Move the regex into awk and build a single awk script that runs over all
files?

Joerg