DotPlot-Projekt

Aus THM-Wiki
Wechseln zu: Navigation, Suche

Das Eclipse-Plugin DotPlot stellt mit einer aus der Genetik stammenden Methode Gemeinsamkeiten einer Menge von Zeichenketten, Wörtern, Wortsequenzen oder Sätzen - allgemein "Tokens" genannt - grafisch dar. Ein Quelltext wird auf zwei Achsen einer Matrix verteilt und Matches (gleiche Tokens) als Punkt markiert. Spezifische Muster im Dotplot lassen den erfahrenen Betrachter schnell auf die Art und den Grad von Ähnlichkeiten schließen. Mit Hilfe von Farben wird den Matches ein Gewicht zugeordnet, um die seltenen Treffer von häufigen besser unterscheiden zu können.

Projektziel

Das Open-Source-Projekt wird an der FH Gießen-Friedberg im Rahmen von Masterprojekten vorangetrieben. Hauptziel ist die Entwicklung eines flexiblen Tools

  • zur Dotplot-Erzeugung aus beliebigen Textsorten
  • zur automatischen Plagiaterkennung
  • zum Refactoring von Programmen
  • zum Identifizieren von Autorenstilen: "Ein echter Shakespeare?"

Methode

  • Grafisches Dotplot-Verfahren aus der Genforschung
  • Interaktive Muster-Erkennung
  • Sequenz-Alignment

Softwaretechnik

  • Open-Source-Projekt (GNU GPL)
  • Java-Implementierung als Eclipse-Plugin/-RCP
  • Kooperationsplattform SourceForge
  • Vorgehensweise: Extreme Programming (XP)

Features

  • Berechnung im Grid
  • Information-Mural-Algorithmus (verlustarme Interpolation)
  • Dotplot-Perspektive in Eclipse
  • Export in diversen Dateiformaten (JPEG, PNG, PDF)
  • PDF-Konverter und Inputfilter für Java, C++, PHP, ...

ToDos

  • Automatische Wortstammreduktion und Satzende-Erkennung bei natürlichen Sprachen
  • Information Mural als Navigationshilfe

Herausforderungen

  • Performancesteigerung
  • Webinterface für Online-Service
  • Visualisierung multimedialer Daten

DotPlot is an Eclipse plug-in to graphically compare word sequences of any type of text. Matches will be graphically plotted as dots on a graph. Similarities in thousands of lines of text or code will result in typical textures and diagonals in the plot, see http://imagebeat.com/dotplot/gallery.html for some impressive examples; similar plots can be achieved with our Eclipse plug-in.

Special source code filters for different programming languages make it easy to concentrate on relevant tokens. Colors help to highlight important matches. To improve performance on a single computer, the Information Mural algorithm (see http://www.cc.gatech.edu/gvu/softviz/infoviz/information_mural.html) has been implemented. Furthermore, a grid can be easily built. With distributed computing, high-resolution plots of very large text inputs are possible. Grid clients do not need the Eclipse environment; they can be run from command line.

The Eclipse plug-in is developed in Java using the following special libraries:

  • PDFbox: adds the ability to read PDF files
  • iText: adds the ability to export plots to PDF files
  • JAI: Java Advanced Imaging, Sun library for image manipulation

Current features of DotPlot

  • User-selectable region of interest (ROI) to show the underlying text for one or more matches
  • Input filters for popular programming languages like Java, C (++), PHP
  • Basic input filters for natural language texts (lines, sentences)
  • Switching between different imaging components: JAI, SWT, and Information Mural
  • Optional use of a lookup table to colorize a dotplot
  • PDF export for printing
  • Export of the internal representation of a dotplot for image conversion on a faster computer
  • Creating dotplots over a grid based on a TCP/IP network

Roadmap for further development

  1. Conversion of the Eclipse plug-in into a "Rich Client Platform" (stand-alone Java application) as a framework for upcoming Master projects
  2. Tooltip showing the corresponding word sequence of a match under the mouse cursor
  3. Highlighting of user defined tokens
  4. Information Mural as preview and navigation component for very large plots
  5. Grid optimization to improve stability and distribution of partial plots according to available resources
  6. Progress indicator for plotting over the grid with notification by e-mail after completion
  7. Creating plots over a web interface
  8. Better detection of sentences in natural language texts
  9. Semantic reduction of tokens in natural languages to the stem of a word
  10. Detection of commands in source code to provide "command filtering"
  11. Implementation of sequence alignment algorithms as used in Bioinformatics (there are quite a number of Java projects hosted at sf.net that work on this subject)
  12. Automatic pattern detection in dotplots
  13. Push-button analysis of dotplots to indicate the degree of similarity or plagiarism
  14. Input filters for audio and video sequences to detect, for example, cases of music plagiarism

Project homepage with download of the current version 2.0: http://www.dotplot.org/