DupTective is a Python program that helps you find duplication in source code at the method level. It currently works on Python source, but can be easily extended to other languages. DupTective is at version 0.1 now, so there is room for improvement: especially new features, and, um, "portability enhancement" -- it's known to work with Python 2.2 under Linux, and will probably work on all other Unix systems (including OSX) and Windows systems with Python and gzip. Please post any successes and failures here. This is also a good place to request new features (e.g. being able to specify boundaries by regexp). ---- '''How to get it:''' [I'll put this in as soon as I check the links. --GeorgePaci] ---- '''How to use it:''' You can check a single source file in just a few seconds: python duptective.py myFile.py This gives you the ten regions (usually methods or functions) most likely to be duplicated elsewhere or contain duplication themselves. To find more regions, use a numeric flag: python duptective.py -20 myFile.py This gives you the top twenty. To find all regions, use the -all flag: python duptective.py -all myFile.py You may want to redirect this output, since it can be large: python duptective.py -all myFile.py > /tmp/myFileDups To find all regions with lower signal than 20%, use a percent flag: python duptective.py -20% myFile.py Running times barely depend on how much you report. Multiple files work the same as single files: python duptective.py myFile.py myOtherFile.py This gives you the ten most duplication-prone regions of the two input files together. Methods duplicated (or nearly duplicated) between the two files should show up close to each other in the report. On Unix (and possibly Cygwin), you can glob to specify files: python duptective.py /usr/local/lib/python2.2/email/*py This gives you the top ten for all Python source files in the given directory. On an older machine (Pentium II at 266 MHz), this takes about 20 seconds (for 13 files). ---- '''Interpreting the Output:''' The previous command yields the following output: Total info: 14759 # len info signal share location 46: 346 18 5.20% 0.12% /usr/local/lib/python2.2/email/Message.py(181:189) 47: 343 20 5.83% 0.14% /usr/local/lib/python2.2/email/Message.py(190:198) 3: 283 19 6.71% 0.13% /usr/local/lib/python2.2/email/Encoders.py(30:41) 54: 343 26 7.58% 0.18% /usr/local/lib/python2.2/email/Message.py(289:299) 69: 1384 107 7.73% 0.72% /usr/local/lib/python2.2/email/MIMEImage.py(19:46) 53: 347 28 8.07% 0.19% /usr/local/lib/python2.2/email/Message.py(278:288) 48: 327 28 8.56% 0.19% /usr/local/lib/python2.2/email/Message.py(199:207) 4: 303 27 8.91% 0.18% /usr/local/lib/python2.2/email/Encoders.py(42:53) 65: 1464 132 9.02% 0.89% /usr/local/lib/python2.2/email/MIMEAudio.py(43:71) 70: 271 26 9.59% 0.18% /usr/local/lib/python2.2/email/MIMEMessage.py(1:14) The columns are as follows: * '''#''' is the number of the chunk as encountered by DupTective; this can be useful for spotting regions that are near each other * '''len''' is the length of the region in bytes * '''info''' is the number of bytes of information the region contributes to the whole * '''signal''' is the ratio of info/len, expressed as a percentage * '''share''' is the ratio of info/total, expressed as a percentage * '''location''' is the filename and range of lines (inclusive, starting at 1) where the region is located The most important column is '''signal''', and the report is sorted from lowest signal to highest. Low signal corresponds to duplication, either within the region itself, or between the region and some other region somewhere in the source. Very low signal levels (<5%) usually indicate blatantly obvious duplication. Look nearby in the report for prime suspects, or in the region itself if it's internally repetitive. Signal levels above 20% usually indicate non-duplicate code, at least as far as I can see. These numbers may depend on the language, the coding standard, the amount of commenting (comments are usually English text, which is around 30% signal), and the average method size. More experience using DupTective will yield better rules of thumb. ---- '''Emacs Integration:''' If you use Emacs, I recommend using a keyboard macro similar to the following to jump to the region indicated: (defalias 'goto-chunk (read-kbd-macro "C-e M-b C-b C-SPC M-b M-w C-b C-SPC C-r SPC C-f C-x C-x M-w C-x C-f C-y RET M-g C-y M-y RET")) (local-set-key 'f4 'goto-chunk) ---- '''Feature Requests:''' * specify region beginnings by regular expression (to support other languages) * ignore whitespace * ignore comments (including docstrings) * report on a single region ---- '''Platform Experiences:''' Works fine with Python 2.2 and gzip 1.3 under Linux (RedHat 7.1) Anybody try it on Windows? With or without CygWin?