Introspector

Video 1

This is the first of a series of video recordings for explaining the introspector project.

I have decided that a simple personal introduction would be the best way to explain the background and purpose of the project.

The videos will be hosted on wikibooks.org/wiki/Introspector but also posted to YouTube and other portals.

This video should be understandable to my friends and family but also interesting to fellow computer programmers.

The content will be split into modules so that people can skip over material that is not interesting to them.

We will cover many aspects of the introspector project and build a foundation for learning about computing in general.

I will explain what a compiler is, what programming languages are and try to also include the basics of computers for an introduction to computing.

In short, a compiler translates a human written and readable instructions into machine instructions for controlling the computer.

The gcc compiler translates over 5 different languages including C, C++, Java, Fortran, Pascal and ADA into machine languages for windows based Intel machines, Linux based Intel machines, older Macintosh machines running on Motorola processors and basically every hardware on the planet.

Later we will build up from the gnu g++ compiler to gnu tools like the gnu bash shell and Linux OS kernel.

The bash shell will be introduced as a C program that is compiled, and then as a program that you use.

We will go through the basics of the bash programming, as it has its own language, and show how the bash functions are interpreted and executed by a C program.

We will show how bash used the C runtime library of functions to call operating system functions and control the hardware.

We will create basic programs and then show you all steps from writing source code, to compiling, linking, running and debugging the code. I want to be able to give people a feeling for all the steps involved in translation of a computer program into a running system.

One video will show you how to install the gcc compiler on cygwin windows and compile changes to it yourself.

Other videos will show you how to apply patches, how to debug the compiler and what types of patches are available.

Others will show various debugging options and dump files that are available in the compiler.

We will review some of the modules of the compiler, explain the functions provided, who wrote them and the usage of them.

We will build the course up to explaining the need for extracting data from the compiler, and the motivation for the need of the introspector.

Then we will go into RDF and the semantic web as an intelligent representation of the compiler graphs.

Later we will cover the visualization tools like graphviz and the manipulation of the graphs.

Thank you for listening and I am looking forward to your feedback.

Management overview of the introspector

Introspection is basically what is better know in the computer world as reflection. The term "intro" gives us the connotation of inside, looking inside as opposed to reflection that implies a mirror. For me, introspection means looking inside at the structure of program. This means looking at the private and internal data, and giving that data meaning, because we know about it (in that moment). Given all the people involved in the creation of all the parts of a program and the computer system, we cannot achieve full introspection without having access to all their knowledge. This means that there will be limits to the introspection, but we can, given a large enough set of data about the code, crack it.

So, introspection is a means, by which, when applied correctly, can help crack the code of a software and intercept the hidden and encapsulated meaning of the internals of a program.

Introspection is way, a method or process, to break the convention of data hiding and encapsulation by injecting of data collectors into the programs where they have the most information about your software and collecting this information into an intelligible form.

The injection of data collectors involves modification of existing programs, or the modification of the execution path via the debugger.

Given only a binary executable, it is possible to apply this technique, because we can apply introspection to the operating system, shell, and dynamic linker and loader. We can introspect the assembly processing tools and add in information. We can introspect the execution via profiling given knowledge of the processor.

Given no access to the operating system, no access to the tool chain, it will be almost impossible to apply introspection on the machine level. There the process of introspection could be applied to the human operator, to the console and screen shots, were all interaction, all inputs and outputs to and from the system are collected.

The whole idea is that this data collected can be cross referenced to other data collected outside by other systems. The cross referencing is the key to understanding and enriching the data and this is the information gaining and enrichment process.

The full introspection process needs information to work with.

We concentrate only on free/open source software, because it is the only software for which we have enough public information to work on.

Let us start by defining how the GNU Linux operating system is setup.

With a test case we can document the inputs and expected outputs of a function. with assembly and data flow analysis we can understand the types of operations on the data that take place. We can trace where a variable is used and how. With source code we can automate the decoding of the binaries. With change control logs, we can automate the documentation of the sources. With mailing lists, we can understand the discussions about those changes of the sources. Given a source-forge ticketing system, we can also have those artifacts structured in a consistent manner. Now, all those aspects add in information about the program. But when we understand all this information about how the compiler processes the source code, then we can also connect all that together, automatically.

Bash is a shell, it is a program. The shell interacts with the operating system, the core. The shell protects the user from the internals of the core. strace can help intercept calls to the operating system. the calls to the operating system can be seen as a line from the shell to the core and back. the usage of the operating system, its resources such as memory and file system is also vital to understand a program. The operating system usage is hidden from direct inspection. The shell implementation is also hidden.

Full versus Abstract Understanding : All this hidden data is needed for full understanding. Abstract understanding can be however gained by sticking to the interface. We can build an abstract model of the processing. Depending on the depth of the data collected, we can gain more detailed understanding. The internals of the processor are hard to gain access to. Each one is different. The companies try hard to protect the chips design and keep them to themselves. That gives them an advantage.

we can assume that Each tool that is available needs to have a purpose for it to survive. The viability of a tool is something that can be thought about and analysed. Tools that reduce the protection of other tools can be seen as lowering the cost of access. yet these tools also have a cost themselves. we can replace one high set of barriers with another lower set of barriers, yet the barriers are still there.

The problem has to be reduced, and this requires decision making. That involves in the first step the human pattern matching. This can also be codified, but it is still a cost. The people also need to consume resources to survive, and therefore are not objective. I think that one social aspect is that people will protect their interests so that they will avoid doing things that would threaten the stability of their environment. The costs of learning something is something they want to get a return on. If the cost of producing or learning a tool is high, then artificial barriers will be created to make sure that others cannot easily access that. This is one of the key issues in the protection of information. The more resources collected by producer, them more resources they have to protect themselves. There will be a breaking point where a system will collapse into itself and start protecting itself as a survival system.

The very fact that there are so many problems involved in technology means that only ones that are supported will survive. For them to be supported, they need to have a clear model of viability.

The program is really a set of assembly instructions that can also be analysed each instruction is encoded in a string of bytes. we can analyse each byte of information and gain information about it.

In order to understand this information we need however to connect it to the information gained from other systems. This requires communication between many different systems. The communication is difficult, because often there are many different barriers between them.

By analysis of the machine, we are able to see how the path of execution connects them. In order to do that, we need however to have a set of tests that are understood. Understanding of tests could just mean that we can map the test case onto a specification that we also understand. The specification would be split into parts that can be understood by a human. The language of the specification could also be parsed by language tools, and that would provide us with a graph structure that contains identifiers that could be mapped onto things found in the test.

So understanding can be defined by a complete mapping of the graph of data found about the execution of the test to the graph of data extracted from the specification.

we would first define a way to describe layers of graphs that are interconnected. each layer would be extracted from a different tool. Each layer however would be described in a common language. the graph of RFD.

The layer is a program itself, and the total amount of possible graph types is large. That means that a test case would represent a small subset of the total possible amount of graph data. The main point is that the connection and mapping between these graphs it the key.

Programs also do not always work efficiently, so that there might be a huge amount of processing that is not really needed to solve the test case itself. Lets assume that a profile would mark that processing path that is really executed.

given a large test case, we can then look for overlapping parts of smaller tests cases that match the larger one. This would allow us to decompose the large graph into smaller sub-graphs.

Then we can also assert that all this work is being done for a specific purpose. The purpose is understood and it limits the scope of the problem. We need to understand this scope to prevent from wasting resources.

we can iteratively define the system by understanding the work needed to be done by a human in the solving a problem .we need to provide tools for them to work efficiently.

binutils allows us to process a binary program there is a relationship between the source code and the binary, it is created via the compiler. Understanding Source code of a program allows understanding of the program the compiler processes the source code. The compiler is run by the shell. setting breakpoints in the debugger allows inspecting variables when debugging or inspecting a program you can easily be overloaded by information It is possible to use scripts attached to breakpoints to collect information into a file or stream tools can be used to process the collected information to make sense of it we often need to display a graph collected out the dependencies to understand it. the graph structure is natural to the mind to understand RFD is based on graphs.

Original motivation

The author, James Michael DuPont describes his original motivation for the program :

The original idea came from my research into Logic Programming and Program generators. I had attempted to learn Turbo Prolog as a teenager, but never quite got up to speed with it. However, I did like reading the programming manuals, and had done some example programs. The problem for me was that I did not grasp the concepts about its meaning logically, but was more like the child who was putting Nintendo games in the slot. The database assertions were the ones that threw me off, I did not understand that the statements were objects of their own.

Starting with DBASE V with all its graphical interfaces, I became quickly adept at creating databases. Then I did some DBASE III work as well. Back in the late 80s RAD was in. FoxPro, Paradox, and other environments like Alpha4 and the tool Zachary were the key for me.

RAD Interactive Program Generators and Editors use a structured model of data that allows a programmer to quickly solve programs. Meaningful Import and Export functionalities, coupled with remote network file access and locking, were all you needed to create workgroups of people interacting.

These were the original models that define the concepts that I had in mind when the program was designed. That is the model to keep in mind when looking at the introspector project. I used these tools to solve problems for business people around me in my family circle.

This original limited understanding of the implementation of the programs and focus on practical problem solving should be kept in mind.

Now, when we write a document, it is difficult to identify all the streams of data that are flowing into it from the mind of the programmer.

We should however consider it possible that we might become aware of our current situation and want to break out of a monotonic situation. This collapsing of a mental stream into a more compact form and execution of an action of writing is an atomic task. We can see a stream of data as a collapsing house of cards that fall down over time. The writing of symbol to the output to the keyboard is then the final committing of an action. We can then read that symbol back in as input in another stage. Building the house of cards together again requires more than just the cards, and this is where the need for the model is apparent. The model is basically all the data that you need to have to decode the message. For the computer, that is the moment it executes a byte of instruction. So the model for that instruction would be all the data that is needed to be able to trace it back to some external model at some point in the system.

Programs that are generated out of such a model will contain back references to their generating model in the file names, variable names, function names, even in the literal constants, the comments and various other user data sections that are available. Given such a model of the user space, we can assign a meaning to each program declaration. The meaning is defined in terms of the application model.

The purpose of the introspector is not to provide all data for all people at all times, but to provide on-demand structured data about your software program to help you quickly solve problems. The idea is that you have to get information out of a system to which you only have partial access; using the introspection process, you can extract data from the tools that process your programs by creating interceptor objects that hook into running processes and capture any data going to or from a point in the program.

The point of the program can be defined as the instruction pointer of the program. Given a test case that tests for a certain feature we will be able to map a given input file to a model location.

We can see this wiki as a specification. Each version of the wikipage can be seen as a resource. Each word of that version can be used as an index. With this indexing schema we can define all parts of a program with the wikibook.

Example

Let us assume that there is an implementation of a program that is specified by some document. This is executed in a process on some host.

We can define a point in the input data by its offset. This is the reference to the model. Only a register is needed to reference the model that is stored in an array.

Only by choosing and defining a formal viewpoint can you may any sense of the flood of data that is available to you from the compiler.

It's bootstrap idea works by using a version of GCC which uses a modified version of the built in tree dumper to transform print statements about the asts into live data in wikipedia:Resource Description Framework format via the wikipedia:Redland API.

Status

Here is some information about the gcc interface Introspector/GccCpp

Representation of Graphs

Various Experiments have been done and exist with a low level back end representation that is optimized for compiler graphs, experiments with the special wikipedia:data structure "ice cube" for higher speed transfer of the data. The ice cube is a compressed data cube, a set of vectors describing the compiler graphs efficiently.

The data was compiled directly into the program, so that the data was loaded when the program starts. this does not allow for changing of the data. That was in 2003.

Later, in Jan 2005 I have created a more compact graph representation in C and algorithms for using them. This was a good start. It stored the graphs as fixed length records that can be loaded in blocks from the file system.

Goal:

For local usage : The final goal would be able to generate a shared object whenever needed that would generate the most compact and fastest representation of a graph for loading on that machine. we could also use a core dump, given that the graph data is stored in a neutral format.

For transfer : If you want to export a graph, then storing it as RDF/XML would be optimal.

Database

Postgres Interfaces have been built, storing the graphs collected. the database structure of the first version was based on the node types directly. with one table per node type.

Additionally I have used redland to create the database, which creates the tables to represent the graph directly.

The issue of using the database table is that you need to have the following constraints:

LOCALITY: The records that occur together in the compiler should be stored in a way that they can be retrieved together.

Types of Records: The types of records, and the fields that they have need to be chosen carefully. There are very many types of records. For example there are very many types of function declarations.

It does not always make sense to use a union type for the database tables and store all the instances in the same table.

Perl Interfaces

Data Collected

The first introspection sample is collected from gcc wikipedia:abstract semantic graph

Integration

All of these tools are in a disjunct state and are not integrated yet.

Layers

What we want to do is understand the layers of the system.

Introspector/Compiler/Input Stream
Introspector/Compiler/Lexical Tokens
Introspector/Compiler/Parse Tree
Introspector/Compiler/AST
Introspector/Compiler/RTl
Introspector/Compiler/Dwarf
Introspector/Linker/Elf/Section
Introspector/Linker/Elf/Label
- Introspector/RDF
- API to convert printfs into rdf emits
  - Call Stack for each PrintF
  - Analysis of flow of data for each variable in the expression
- Introspector/Graph Layout
Introspector/Annotation
- Introspector/Annotation/DIA
- Introspector/Annotation/Doxygen

Blurb

The Introspector/Project has the Introspector/Project/Goals of building Introspector/MetaData/Manipulation/Tools for WG:Free_Software_and_Open_Source_software and Introspector/SourceCode.

The Introspector is a new Introspector/ToolChain to extract Introspector/MetaData about your SoftwarePrograms from the GnuCompilerCollection and other FreeSoftware LanguageTools such as the PerlLanguage, the CSharpLanguage, the BashLanguage and present it to you for making your job as a programmer easier. Code MetaProgramming will become a snap with ExtensibleStyleSheet like Transformations being possible. Agents that condense software information into graphs and guis are underway.

The software is FreeSoftware in the spirit of the GnuManifesto and is revolutionary in the freedoms that it intends on granting to its users.

This MetaData includes all data collected about your software by the compiler, the make & build system, the savannah/sourceforge project management and debian packaging system, the CVS changes and the mailman mailing list software.

Scope creep

The Introspector's scope was originally just the GCC C compiler, but is now expanded to include the extraction of metadata from different compilers and interpreters, such as Perl, bison, m4, bash, C#, Java, C++, Fortran, objective-c, Lisp and Scheme. Various patches will be officially submitted to allow for the Introspector to extract data in a standard form.

File format

This extracted metadata are stored in ntriple rdf files, a repository categorised by the program run, the input file, the function or class declared, and the time and date of the introspection. These files are standardized and available to tools like eulersharp, cwm, Java, Perl, C, C++, etc. via the Redland RDF API. These rdf files can be accessed concurrently via the redland storage interface.

Sources

The metadata is to be extracted from these free tools:

AbstractSyntaxTrees of the GCC compiler
AbstractSyntaxTrees of the DotGNU/PNET/CSCC and DotGNU/PNET/Ccompiler
PerlInternals interface
Bash Shell will be extended to dump its objects into Rdf as well (based on the BashDB)
BisonParseTree with BisonRules BisonTokens
M4MacroLanguage
Debian-SF sourceforge application repository GForge
PhpLanguage
PythonLanguage
SilverSchemeLanguage

Interfaces

We are building IntrospectorInterfaces to FreeSoftware to allow this metadata to be processed by various applications

GUIEditor Glade
GraphLayoutTool Vcg
DiagramEditor Dia
TextEditor Emacs

Definition Of Introspection

MetaDataManipulationTools

include

DataExtractionTools
DataRepresentation
DataRepository
DataVisualizationTools
CodeGenerationTools

IntrospectorGui

includes these interfaces:

EmacsInterface
BashInterface
DiaInterface

DataVisualizationTools

includes

VcgInterface
AutoDiaInterface
DiaInterface support.

DataRepresentation

supports

N3, RdfFormat, EulerSharp, CWM, RedlandInterface, NtriplesFormat, PerlRdfLaces
XmlFormat
HtmlFormat
TabularText FileFormats

IntrospectorOntology

The introspector ontology work is in a state ready for integration.

I have created newer versions of the ontology describing the gcc compiler nodes.

Here are the aspects of the ontology:

1. Nodes defined in tree.def 2. Data Structures defined in c, tree.h 3. Data structures dumped in tree-dumper.c 4. Data structures access methods in the macros 5. Usage of the tree in the code of the gcc itself.

Dynamic Issues

What aspects of the tree are used to build the compiler itself.

What aspects of the tree are used to define the tree.
What aspects of the tree are used to fill out the tree.

Coverage

What test code is needed to exercise the compiler and execute all paths of code. What paths are needed to fill out all aspects of the tree. What parts of the tree are coming from user data? What parts of the tree are coming from compiler data?

Built In Data

What data is built into the compiler itself? What are the differences between the compiler versions?

Standard Headers

What data is created by standard headers? What versions of the headers are used?

Best Fit Ontology

1. What types of records are created from running the compiler?

We want to understand the types of real data collected from the compiler. We want to see a ontology type of record for each occurring configuration of data in the output.

Ideally we would define the type of data as the full path of execution through the compiler, each instruction that affects the data would be recorded in the type history.

For now, we collect the types of data that occur and create a type of each one. This is what I call the instance type.

2. What are the common fields:

for each type of data, each set of fields there is a subset of the fields that also occur in other records. We calculate the greatest common set of fields and relate them to the instance type.

Warning : In OWL, a superset/subset relationship is defined where the instance type is the supertype of the more general type. Why

3. What fields are list fields that just give order to the data?

4. What fields represent the meaning of the program? 5. What other types of records are referenced by the fields? 6. What other records via which fields reference the object of what type?

See the Protoge Reasoner API

Reasoner Integration

My goal is the usage of fact++ reasoner that is built into the gcc in a single process.

For this to work we need to handling the following issues :

1. When to feed data to reasoner

We can feed the data to the reasoner :

We can for the first step handle the dumping functions of the gcc.
when it is first collected in the compiler.
after the compile has been finished.
When it is triggered by user defined code being executed.
When user defined patterns defined in RDF outside the compile occurs in the code that are matched

We want to be able to process the data during the const folding phase

2. What data to feed to reasoner

We can create a record based on :

the data structures of the compiler itself.
the data dumper functions, base on the record type.
the user defined code, using attributes to mark what data is needed
using a rdf to describe what data is to be collected

3. How to process the results of the reasoner

We need to be able to merge the results of many compile runs together
We want to be able to influence the compile itself. Creating new structures.
We want to be able to define user code that is executed during the compile (constant folding expressions) that are based on

constant data collected from the compiler.