I'm looking for Python code that removes C and C++ comments from a string. (Assume the string contains an entire C source file.)
I realize that I could .match() substrings with a Regex, but that doesn't solve nesting /*
, or having a //
inside a /* */
.
Ideally, I would prefer a non-naive implementation that properly handles awkward cases.
I don't know if you're familiar with sed
, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:
import subprocess
from cStringIO import StringIO
input = StringIO(source_code) # source_code is a string with the source code.
output = StringIO()
process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
input=input, output=output)
return_code = process.wait()
stripped_code = output.getvalue()
In this program, source_code
is the variable holding the C/C++ source code, and eventually stripped_code
will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the input
and output
variables be file handles pointing to those files (input
in read-mode, output
in write-mode). remccoms3.sed
is the file from the above link, and it should be saved in a readable location on disk. sed
is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.
This will probably be better than a pure Python solution; no need to reinvent the wheel.
This handles C++-style comments, C-style comments, strings and simple nesting thereof.
def comment_remover(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
Strings needs to be included, because comment-markers inside them does not start a comment.
Edit: re.sub didn't take any flags, so had to compile the pattern first.
Edit2: Added character literals, since they could contain quotes that would otherwise be recognized as string delimiters.
Edit3: Fixed the case where a legal expression int/**/x=5;
would become intx=5;
which would not compile, by replacing the comment with a space rather then an empty string.