Midnight Engineering

Wednesday, October 14, 2009

New Machine Learning API's to Explore

Today on reddit, someone asked about freely available machine learning API's.

Before the list gets buried, I'm duplicating the contents of that thread here for future exploration:

Weka - Java based ML API
http://www.cs.waikato.ac.nz/ml/weka/

Toolkit for Advanced Disrimnative Modeling
http://tadm.sf.net/

Mallet - Java based ML API
http://mallet.cs.umass.edu/

WekaUT - An extension of Weka that adds clustering
http://www.cs.utexas.edu/users/ml/risc/code/

LibSVM - SVM API
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

SVMlight - A C API for svm
http://svmlight.joachims.org/

C++ API for Neural Networks
http://github.com/bayerj/arac

Torch5 - A Matlab-like ML environment
http://torch5.sourceforge.net/

R - Open source statistical package that can be used for ML
http://cran.r-project.org/web/views/MachineLearning.html

pyML - Python API for ML
http://pyml.sourceforge.net/

Rapidminer - Open Source Data Mining Tool
http://rapid-i.com/wiki/index.php?title=Main_Page

Orange - An Open Source Data Mining Tool (Python and GUI based)
http://www.ailab.si/orange/

Glue - Open Source API for reinforcement learning (Can be used with multiple languages simultaneously)
http://glue.rl-community.org/wiki/Main_Page

Vowpal Rabiit - Learning API from Yahoo Research
http://hunch.net/~vw/

Tuesday, October 13, 2009

FC9 64-bit and VMware Workstation 6.5.3 issue with VMware Tools

I was working with a VMware appliance, which was configured to use the guest OS Fedora Core 9 64-bit.

I was using this guest OS with the latest vesion of VMware Workstation, v6.5.3. I downloaded the latest version of v6.5.3 today and I upgraded the VMware Tools of the FC9 64-bit guest.

It installed but had some kind of problem. After installing the VMware Tools that came with the latest version of VMware Workstation 6.5.3, yum and the 'software updater' were unable to upgrade, remove, and/or download RPM's.

Saturday, October 10, 2009

Unable to find a C or C++ NLG open source tool this week

So, I've been exploring the area of NLG, natural langugae generation. My personal goal was to develop an application that would read a corpus and respond with either a summary of the corpus, or a response to the categories found. In either case, I wanted the summary or response to not just be a template where the noun/verb/adjective/predicates were merely filled in. That's no better than using grep.

As of this week, I can only find API's written in Java, Python, Lisp, and Prolog. Many of the listed NLG API's or applications haven't been touched in years, or are no longer available. Much to my displeasure, nothing in C or C++. I want something that will run lean, mean, and can scale to datasets over a terabyte in size.

Tuesday, October 6, 2009

Natural Language Generation

While going over some nbc's, I stumbled across the AI area of NLP. But, while observing where the areas of nbc and NLP meet, I've found a new obsession: NLG. NLG is an acronym for natural language generation. Natural language generation is text created by a computer program that appears to be human-like in readability.

I first heard about this topic in detail in my AI class in grad school at University of San Francisco. My professor, Dr. Brooks, had mentioned that researchers had been trying for years to create programs that could generate narratives for computer games. I even recall seeing on some news aggregator that someone had successfully won a writing contest with a story written by a NLG system.

At the time I was taking my AI course, I remember working for a horrible boss. Who made all of us who he saw everyday and interacted with on a continuous basis, write weekly reports. I remember wanting to write a Perl or Python script that would do this for me. I made some attempts but it was hard to get any realistic variance. It was essentially an overglorified mad lib, where the program only filled in the blanks.

I was looking for something more natural and human like.

In NLG, one takes data and has generation rules that result in text that feels as if a human wrote it. Surprisingly, if one does a search on NLG, it is a relatively new area of research. Perhaps the best introduction to this topic is on Wikipedia. From the Wikipedia area, you will find yourself on the Bateman and Zock list of Natural Language Generators(http://www.fb10.uni-bremen.de/anglistik/langpro/NLG-table/NLG-table-root.htm)

At the moment, the state of the art appears to be based upon Java and Lisp languages. Since I work in embedded systems where speed and small footprint are key, I'm intersted in implementations that are in C and can scale. I've noticed that most of the NLP and NLG systems I found do not have a database backend. This surprises me since use of a database would allow for scaling and more consistant performacne as the dataset grows.

I think I'll be experimenting with NLG to see if I can make a program that will generate an email that asks a user for info based upon an email inquiry.

Friday, September 25, 2009

Grammars

The purpose of this entry is to describe the 4 type of grammars that can be used to classify a language, and the means used to classify a language as one of the four types of grammrs.

From a linguistics and NLP standpoint, languages can be classified by 4 possible grammar types. From the most expressive description to the least expressive description, a language can be described by a grammar known as type 0, type 1, type 2, or type 3. Each of the different grammars has a common name. A given grammar can sometimes describe other grammars. A type 0 grammar can describe type 1, type2, and type 3 grammars. A type 1 grammar can describe type 2 and type 3. A type 2 grammar can describe a type 3 grammar. A type 3 grammar cannot describe any other type of grammar.

For the four types of grammars, each are composed of rules known as productions, which have the general form of w1 -> w2 . Each production rule generates a sequence of terminal tokens and non-terminals. Non-terminals are production rules that go by the lhs symbol, w1 .

A recursively enumerable grammar is also known as a type 0 grammar. A type 0 grammar has no restrictions on its production rules. Context-sensitive grammar is a type 1 grammar. A type 1 grammar is restricted to productions where the number of symbols on the rhs is equal to or greater than the number of symbols on the lhs. Context-free grammar is a type 2 grammar. A non-terminal in a type 2 grammar can be replaced by its rhs. In comparison, a non-terminal in a type 1 grammar can only be replaced if there is a production that matches the symbols on the rhs with an equivalent lhs. A regular grammar is a type 3 grammar. A regular grammar is also known as a regular expression, which is used by Perl, Python, and grep when searching on strings. A production of a regular grammar has a restricted expression. The lhs is a non-terminal. The rhs is a terminal, which is optionally followed by a non-terminal.

Grammars are also more formally known as phrase structure grammars.

G = phrase structure grammar as a set

G = (V,T,S,P)

V is a vocabulary, a set of tokens and non-tokens(non-terminals)
T is a subset of V. T is the set of terminal tokens
S is a start symbol/token that is a member of V
P is a set of production rules
N is V - T, set of non-terminal symbols/tokens

Types of grammars and restrictions on their productions, w1 -> w2
0 no restrictions
1 length(w1) <= length(w2), w2=lambda
2 w1 = A, where A is non-terminal symbol
3 w1=A and w2=aB or w2=a, where A is an element of N, B is an element of N, and a is an element of T, or S->lambda

Sunday, September 20, 2009

Tracing the boot up sequence of TAEB part 2

This entry goes into detail of the top of the main loop of the taeb script, line 81-98.

taeb,lines 81-98 is the top-level of the taeb main loop. The main loop is composed of 2 main parts. The first item in the main loop is the actual operational statement. The later item, which is actually multiple statements, are only executed if taeb has been told via the command line option, --loop, to re-execute itself after it has completed playing a nethack session.

Let's review these 2 portions by first going over the later item, since it is only executed when specified via the command-line.

taeb, line88: only reinitialize/restart taeb if local variable $loop is not equal to 0. This will happen when --loop is specified via the command-line.

taeb, line90: reset all taeb variables and states
taeb, lines 92-96: Sleep for 5 seconds before starting a new nethack game

The eval statement, which is the first statement ecountered at the start of this loop, does all the work.

taeb, line 83: sets INT handler to print message to screen and prevent starting a new game by setting $loop to 0.

taeb, line 84: execute a taeb session
taeb, line 85: reports the results of a taeb session

taeb, line84, is TAEB->play;The method play is sent to the TAEB object, which is defined in TAEB.pm.

TAEB.pm, lines 742-749: Inside this loop, each iteration is a step. At the end of each step, results are store of the step. The last step of the taeb session will print its results to the screen.

Inside each step, the following methods are called in order:

redraw
display_topline
human_input
full_input
handle_XXXX called

redraw is invoked from TAEB::Display::Curses. redraw is used to repaint the entire nethack screen at the start of a step. display_topline is also invoked from TAEB::Display::Curses, too. display_topline displays any messages received from the last step in the current step. human_input is defined inside TAEB.pm. human_input is used to get keyboard input if ai allows human control and a key is pressed. Under normal operational circumstances, human_input will not capture anything.

full_input is a wraper method that is used to capture nethack screen data and load it into the taeb database. At the start of full_input's operation, the screen scraper is reset to take new info and the publisher is turned off. The publisher is resumed after scraping and processing of the scrape is finished. After the publisher is off, then next instruction performed is process_input, which reads any input from a previous action or user and sends it to the vt, virtual terminal. Afterwords, the screen is scraped and loaded into the Taeb database.

The next step performed is that dungeon and senses variables are updated. After the update of these items, the publisher is renabled. This completes the phase of a step where the percepts are captured.

Saturday, September 19, 2009

Tracing the boot up sequence of TAEB part 1

The purpose of this entry is to describe in plain talk how Taeb starts up and executes it 'main' loop. By understanding the 'main' loop of the taeb framework, this will enable creation of Taeb agents.

taeb, line1: enables text file to be interpreted as a Perl script
taeb, line2: no questionable Perl constructs allowed
taeb, line3: ? literally says to add to @INC the the library using the literal 'lib'
taeb, line4: Use Perl library Getopt::Long to process command-line options
taeb, lines 7-20: defintion of print_usage subroutine which shows the command-line options that can be used to configure Taeb operation

taeb, line22: local variable used to control whether or not taeb will just stop when it receives an interrupt or reset and begin a new execution. See while-loop at line 81.

taeb, line23: local list that stores command-line specified options for taeb. This local list is then transferred to Taeb's configureation. See lines 24, 28, 43.

taeb, line24: local hash that is initialzed to store the hard reference to the variable $loop and the list @config_overrides.

taeb, lines 25-39: Code used to alternate operation of Taeb via command-line options. Also, used to display the options that can be specified to Taeb.

taeb, line41: Specify that TAEB.pm must be used and found.
taeb, line43: Change Taeb configuration from specified command-line options
taeb, lines 45-47: Add to Taeb configruation that no ai should be used if command-line option specified.

taeb, lines 49-77: handlers assigned to signals TSTP, CONT, TERM, USR1, and USR2 signals.

taeb, line79: Flush after every write
taeb, lines 81-98: Top of the main loop of Taeb.