the html run through purifier first (tinymce+wordpress), should match standard forms. script , style tags stripped, , data inside tags html_encoded, there no extraneous symbols worry about.
i know general stance on parsing html regular expressions "don't", in specific example, problem seems less parsing, , more simple string processing... missing unseen level of complexity?
as far can break down, seems pattern in question can broken down logical components:
/<[a-za-z][^>]+- matches start of html tag , mix of tags , attributes within, not end bracket(?i:class)=\"- start of class attribute, case-insensitive(?:- start non-capturing sub-pattern(?: *[a-za-z_][\w-]* +)*- number of class names (or none), if exist, there must whitespace before capture( *.implode('|', $classes).*)- set of classes capture, preg_quoted(?: +[a-za-z_][\w-]* *)*- number of class names (or none), if exist, there must whitespace after capture)+- close non-capturing subpattern , loop in case multiple matching classes in 1 attribute\"(?: [^>]*)>/- end of class attribute, , end of html tag
making final regex:
$pattern = "/<[a-za-z][^>]+ (?i:class)=\"(?:(?: *[a-za-z_][\w-]* +)*( *".implode('|', $classes)." *)(?: +[a-za-z_][\w-]* *)*)+\"(?: [^>]*)>/"; i haven't tried running yet, because know if works, i'll heavily tempted use it, running through preg_replace seems should job, except 1 minor issue. believe leave extraneous whitespace around capture area. isn't significant issue, might nice avoid, if knows how.
it should noted not mission-critical process, , if capture fails remove classes, no 1 dies.
so, in essence... can explain makes bad idea in case?
ok, list of classnames want remove given html?
what mean say, given list of classnames want remove. can give example of typical html, is, , want change to. example:
before
<div class="someclass"> <i class="dontchange dochange"></i> <a class="hello john"></a> </div> change to
<div> <i class="dontchange"></i> <a></a> </div>
Comments
Post a Comment