STANDUP SQL Lexical Database

STANDUP SQL Lexical Database

(Last update to software/data : 28 November 2006; this page last edited: 7 June 2007)

 

This page contains information on the STANDUP SQL lexical database. It is a lexicon that has the following features:

  • Integrates semantic, orthographic, and phonetic information from various lexical resources such as WordNet and Unisyn.
  • Maps specific (WordNet-based) wordsenses to AAC/literacy symbols such as Widgit Rebus and Mayer-Johnson PCS.
  • Associates wordsenses with familiarity scores, a measure of how "familiar" a word is. Various lexical resources contribute towards this measure, among others, the MRC psycholinguistic database, the British National Corpus, and SemCor.

It is implemented as an SQL relational database using the PostgreSQL database server.

The STANDUP lexical database was created to support the STANDUP interactive riddle generator system, but could plausibly be used for other general-purpose applications. As such, we have created two instances of the database: the _joke variant and the _lex variant.

Database instances: _joke vs. _lex

We provide 2 instances of the STANDUP lexical database: one that is intended to support joke generation, and one for general-purpose lexical usage.

Joke-generation database

  • Contains a subset of the lexicon, i.e. only wordsenses with a familiarity score > 0. Total lexeme count: 45506.
  • Contains cached schema instantiations for joke generation.
  • Database size: 3.32GB

General-purpose lexical database

  • Contains the full lexicon, i.e. includes wordsenses with familiarity score = 0. Total lexeme count: 130263.
  • Does not contain any joke generation-specific information.
  • Database size: 7.52GB

Database construction kit

Aside from the two instances of databases described above, we also provide a "database construction kit" that enables the creation of a customized version of the STANDUP lexical database. It consists of a collection of SQL scripts and various supplementary data files used by the scripts. Instructions on how to use this kit are detailed below.

Download:

The files related to the lexical database are as follows:

Installing PostgreSQL

  1. Download and extract PostgreSQL to your hard drive somewhere temporarily. Double-click the extracted postgresql-8.1.msi file to begin the installation process.
  2. Leave the selected language as English and click "Start". Click "Next" twice.
  3. You should then see the "Installation options" screen. The default behaviour of PostgreSQL is to install itself under C:\Program Files\PostgreSQL\8.1\ -- if this presents a problem, you can change it here by clicking the 'Browse' button.
  4. Click "Next". At the next screen, you can just leave all the default settings as is. Just make sure that "Install as a service" is checked.
  5. You can enter any password you want here, but if you just leave it blank, one will be randomly generated for you. This is the password for the Windows account that will run the service, not the database superuser account (that comes later).
  6. Click Next. If it asks for confirmation whether to create the account, click Yes.
  7. You should then see the "Initialise database cluster" screen.
    Set locale to "English, United Kingdom".
    Set encoding to "UTF-8".
    Set superuser name to "postgres".
    Set password to "pgsuper!" (without the quotation marks).
    Reconfirm password: "pgsuper!".
  8. Click Next. You should then see the "Enable procedural languages" screen.
  9. Make sure "PL/pgsql" is checked and click Next. You should then see the "Enable contrib modules" screen. Leave things as is and click Next.
  10. Click Next again. This should begin the installation. It might take a few minutes.
  11. Click Finish.

Restoring an existing database

Now that PostgreSQL is installed, we need to load, or in Postgres parlance, restore the standup_v1.4 database.

First, download either standup_v1.4_061127_joke.backup or standup_v1.4_061127_lex.backup to your hard drive somewhere temporarily.

There are 2 ways to restore the database, i.e. by entering the command from a DOS command line interface, or by using pgAdmin III, the PostgreSQL administration GUI tool. They both accomplish the same thing, so it's down to your preference:

The command-line way

  1. Open a DOS command prompt. You can do this by going to the Start menu and choosing "Run...". In the resulting dialog box, type in "cmd" and click OK.
  2. If you haven't changed any settings above, enter this command to create the database:
    "C:\Program Files\PostgreSQL\8.1\bin\createdb.exe" -E UTF8 -U postgres "standup_v1.4"
  3. If successful, it should return with a CREATE DATABASE message. Now, to restore the database, enter this command:
    "C:\Program Files\PostgreSQL\8.1\bin\pg_restore.exe" -i -U postgres -d "standup_v1.4" -v "C:\My Documents\X.backup"
    (Where X is either standup_v1.4_061127_joke or standup_v1.4_061127_lex, and is assumed to be saved to the My Documents folder. If you saved it anywhere else, change the command above accordingly.)
  4. This can take anywhere between thirty minutes and a few hours depending on the configuration of the computer being used (in particular, hard disk speed and amount of RAM). If you spot an error saying 'could not execute query: ERROR: language "plpgsql" already exists', just ignore it -- it's perfectly normal. Once the restore process is complete, it should say something like: WARNING: errors ignored on restore: 1 -- this is simply reporting the aforementioned error.
  5. Close the DOS window by entering the command exit or pressing the 'X' icon in the top right corner.

The GUI way

  1. Launch the pgAdmin III tool: go to the Start menu, choose Programs > PostgreSQL > pgAdmin III
  2. On the left side of the window should be a list of Servers containing 1 entry: "PostgreSQL Database Server 8.1 (localhost:5432). Double-click this entry.
  3. A "Connect to server" dialog box should pop up. Enter the password you entered earlier: "pgsuper!" (without the quotation marks) and click OK.
  4. Some new entries should appear: Databases, Tablespaces, Group Roles, and Login Roles. Right-click on Databases and choose "New Database".
  5. A "New Database" dialog box should pop up. Enter name: "standup_v1.4". Leave everything as is (everything else should be empty except Encoding, which should be "UTF8". Click OK. This will create the "standup_v1.4" database.
  6. Now double-click the "Databases" entry to expand it. You should see the 'standup_v1.4' database there.
  7. Right-click on "standup_v1.4" and choose "Restore". The "Restore Database standup_v1.4" dialog box should pop up. Click the "..." button next to the Filename field, and locate the standup_v1.4_061127_joke.backup or standup_v1.4_061127_lex.backup file you downloaded. Click OK and the database restore process will begin. This can take anywhere between thirty minutes and a few hours depending on the configuration of the computer being used (in particular, hard disk speed and amount of RAM). If you spot an error saying 'could not execute query: ERROR: language "plpgsql" already exists', just ignore it -- it's perfectly normal.
  8. Once the restore process is complete, it should say something like:
    "WARNING: errors ignored on restore: 1
    Process returned exit code 1."
    	
    -- this is simply reporting the aforementioned error.
  9. At this point, do NOT click the "OK" button! This will cause PostgreSQL to try and restore the database again, and this will only serve to confuse it! Click the "Cancel" button instead.
  10. Exit the pgAdmin III application by choosing File > Exit.

Using the database construction kit

  • Unzip standup_dbkit_v1.4.zip somewhere to your hard drive.
  • Create the database, e.g. by running the following command:

    "C:\Program Files\PostgreSQL\8.1\bin\createdb.exe" -E UTF8 -U postgres "standup_v1.4"

    If successful, it should return with a CREATE DATABASE message.
  • Obtain a psqlterminal to the newly created database, e.g.:

    "C:\Program Files\PostgreSQL\8.1\bin\psql.exe" -h localhost -p 5432 standup_v1.4 "postgres"

    (Make sure the database construction kit directory, e.g. /dbbuildscript, is the current directory.)
  • Execute the first stage of the database construction process by entering the following:

    \i batchscript1.sql

    Upon completion, compute the familiarity scores(FAM-scores) for the lexemes in the database:
    1. Obtain disambiguated custom lexicons needed for computing FAM-scores using the wordset disambiguation tool:

      java -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.wordsets.WordSetTool

      Alternatively, use the ones found in dbbuildscript/data/wordsets_20061112.zip. These are the lexicons used for FAM-score values found in the STANDUP databases above.
    2. Run the FAM-score Calculator and configure the various score sources:

      java -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.familiarityscoring.FScoreCalculator

      As a reference guide, the STANDUP lexicon uses the following values for the score source priorities and ranges:

      1. MRC psycholinguistic DB: age of acquisition [0,1.0]
      2. MRC psycholinguistic DB: CFI [0,1.0]
      3. Spelling list derived sets (1 to 6) [0.4, 1.0]
      4. The set of lexemes which have pictorial symbols [0.6]
      5. Frequency scores for compound nouns from the BNC [0.4, 0.9]
      6. SemCor frequency scores [0, 0.5]

      When you have configured your familiarity score sources, click the 'Process' button to fill the fscore column in the lexicon table. This will take roughly half an hour.

  • Execute the second stage of the database construction process by entering the following:

    \i batchscript2.sql

    Upon completion, compute the following resources:
    • Wordform roots: the morphological roots of the wordforms found in the STANDUP lexicon can be computed from the enhortho field of the Unisynlexicon. To create this resource, run the following:

      java -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.dbbuild.RootFinder

    • Pseudolexeme orthographic remainders: the STANDUP lexicon contains information of pairs of lexemes that have phonetically similar prefixes and suffixes, e.g. "spook" and "spectacles" may be paired to create a joke using the neologism "spook-tacles". To support this, it needs to compute the orthographic remainders of the pairing, e.g. "-tacles". To create this resource, run the following:

      java -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.dbbuild.OrthoSplitter

      and

      java -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.dbbuild.OrthoSplitterRear

    • Serialized hashtables and indices: if you plan on using the STANDUP lexical database APIwith the new database, various Java serialized files must be created -- these files greatly speed up the performance of the API. Run the following:
      java -Xms384M -Xmx384M -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.dbbuild.SerializerAll
      java -Xms384M -Xmx384M -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.dbbuild.SerializerIndices
      Upon completion, import the following files into the lexical database API .jar under standup/resources/serialized:
      • c:/compiledlexemes.dat (rename jokeonly_compiledlexemes.dat if building 'joke' DB)
      • c:/compiledwordforms.dat (rename jokeonly_compiledwordforms.dat if building 'joke' DB)
      • c:/compiledconcepts.dat (rename jokeonly_compiledconcepts.dat if building 'joke' DB)
      • c:/widgitcodestofiles.dat
      • c:/spellingtowfid.dat
      • c:/wfidtolxid.dat
  • Execute the third stage of the database construction process. If you are building a general-purpose lexical database, enter the following:

    \i batchscript3_lex.sql

    Upon completion, run the following:

    java -Xms384m -Xmx384m -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.dbbuild.CustomLexiconsAndTopicBuilder

    Import the following files into the lexical database API .jar under standup/resources/xml:
    • c:/topicdb_fc1.topic
    • c:/topicdb_fc2.topic
    • c:/topicdb_fc3.topic
    • c:/topicdb_fc4.topic
    • c:/topicdb_fc5.topic
    • c:/customlex_fc1.lexicon
    • c:/customlex_fc2.lexicon
    • c:/customlex_fc3.lexicon
    • c:/customlex_fc4.lexicon
    • c:/customlex_fc5.lexicon

    ...and that's it!

    If, however, you are building a database to support joke-generation, enter the following:

    \i batchscript3_joke.sql.

    Upon completion, compute the clause instantiation-filtered schema instantiations as follows:

    java -Xms384M -Xmx384M -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.dbbuild.SchemaFilterer

    This creates 11 C:/step17_schemafilter_(SCHEMANAME).sql files. Upon completion place these files in the dbbuildscript folder and move on to the last stage of the database construction process.

  • If you are building a database to support joke-generation, execute the fourth and last stage of the database construction process:

    \i batchscript4_joke.sql

    Upon completion, run the following: Upon completion, run the following:

    java -Xms384m -Xmx384m -cp standup_dbtools_v1.4.jar;postgresql-8.1-407.jdbc3.jar standup.authoring.dbbuild.CustomLexiconsAndTopicBuilder

    Import the following files into the lexical database API .jar under standup/resources/xml:
    • c:/topicdb_fc1.topic (rename to jokeonly_topicdb_fc1.topic)
    • c:/topicdb_fc2.topic (rename to jokeonly_topicdb_fc2.topic)
    • c:/topicdb_fc3.topic (rename to jokeonly_topicdb_fc3.topic)
    • c:/topicdb_fc4.topic (rename to jokeonly_topicdb_fc4.topic)
    • c:/topicdb_fc5.topic (rename to jokeonly_topicdb_fc5.topic)
    • c:/customlex_fc1.lexicon (rename to jokeonly_customlex_fc1.topic)
    • c:/customlex_fc2.lexicon (rename to jokeonly_customlex_fc2.topic)
    • c:/customlex_fc3.lexicon (rename to jokeonly_customlex_fc3.topic)
    • c:/customlex_fc4.lexicon (rename to jokeonly_customlex_fc4.topic)
    • c:/customlex_fc5.lexicon (rename to jokeonly_customlex_fc5.topic)

    ...and that's it!