Software Development

Project 3: Hashed Bible File System

PROJECT OVERVIEW

The biggest bottleneck in the current implementation of the Bible system is the verse lookup. It involves simple sequential scans of the files. Your goal is to develop a hash system to allow the direct access of a verse without a long search. The first part of this will involve designing a hash system, including hash function, a hash index, and and an overflow strategy. Part one will involve experimentation and simulation to decide the best Hash design. Part 2 will be the implementation of the hashing system for all Bible lookups. Finally, you will design and develop a user interface optimized for realtime verse lookup as the user types a search string.

LEARNING GOALS

This project has the following learning goals:

1. Hash algorithm design, testing, and construction

2. Collision Resolution techniques.

3. Creation and use of an Index File

3. User Interface Design

System Requirements

The final system will consist of a hashed file system preloaded with the complete list of every verse in the Bible. A bit of research reveals that there are 31,103 verses in the Bible! Therefore, you will need a hash file with more then this many available positions.

The project will be divided into 4 steps, summarized below:

1. Design the structure of the hash system, including bucket size, number of buckets, packing density.

2. Create an adequate hash transformation function. The key will be the references to the books in the bible. Thus the RAW form of the key is a triplet of shorts (Book, Chapter, Verse)

3. Build a hash index for the Bible text.

4. Integrate the new method into the Bible lookup, so ALL Bible searches use the new hash mechanism.

The first two steps are experimental in nature - you will run a series of tests, and produce experimental results to justify your design. NO WEB interface is needed for this. Below each step is described in more detail

1. Design the structure of the hash system, including bucket size, number of buckets, packing density, and overflow strategy. You are to design a complete hash system. This is to be completely written up in a design specification. Your design must clearly describe how the hash system will work ( not just the hash transformation), including bucket sizes, file size (and thus a proposed packing density). For the purposes of this project we will base the minimum disk access size on the disk sector size, e.g. 512 bytes. Thus you will need to create a bucket that is as close to, but not exceeding, 512 bytes.

Unless we plan on rewriting the Bible into a random order, we will be creating a Bible index file which maps from the a heshed reference to a location in the Existing Bible files. This allows us to keep the Bible text intact, so it can still be accessed in sequential order. Below is a model of the proprosed system for building the index:

Once this index is build, you can simply hash to the reference in the hash table, and use the offset to seek into the Bible file to the correct verse text.

Note that this example in simplified, it does has a bucket size of 1, and no overflow strategy. Of course your code will need to account for these factors BOTH then building the hash index, AND when looking up verses.

You will also need to design a overflow strategy. I am recommending non-linear probing as the easiest effective method.You are free, however, to try whatever you wish!

Your Design must thus include the following: i. An overview describing the system you propose. ii. A UML object design showing your objects (including methods) and relationships. iii. The data structure of your bucket (how will you know if a record is used or a bucket has overflowed?) iv. The data structure of your records in the bucket. v. The size of yours buckets and number of buckets, with justification. What is your packing density? vi. The complete description of your overflow strategy. If you are using random probing, what is your probing function?

2. Create an adequate hash transformation function This task will be done by developing at least three distinct hashing transformations. Write a simple program to extract ALL references in the bible (represented as a simple triple (book,chapter,verse) ).

First do some research, using the test, library, and internet to find some good candidate hash functions. You may experiment with well know randomizing functions such as MD5. You can use the speadsheet model to help the research. Then use the results to argue for which transformation is best and why. Hand everything on Moodle in a Word document. Treat this assignment as a scientific lab with a formal lab report.

3. Build the hash index table containing the complete set of bible references. This program will parse need to parse through the Bible, and build a Map structure correlating references with verse offsets. Then simply iterate through this map, hashing the reference into the Bible Index, and place the (Ref,Offset) pairs in the Hash Index.

4. Integrate the new method into the Bible lookup, so ALL Bible searches use the hash mechanism. Modify your Bible lookup system to always use this new search method for all searches.

Milestone Due dates

Milestone

Completion date

Points

1. Analysis and selection of hash transformation function. This includes:

a. Hash Analysis to test 3 or more hash functions

b. Typed interpretation of test results

c. Justified selection of one hash function

April 13

100

2. Creation of a complete Hash index to the Bible, simple Ref lookups with command line.

April 15

100

3. Basic Implemented solution with word search

April 15

100

4. Proposed User Interface Design for "live" word search

April 20

100

5. Implementing final solution

April 25

100

-- JimSkon - 2011-03-15

Topic attachments
I Attachment Action Size Date Who Comment
Swfswf Hashing.swf manage 105.8 K 2011-03-17 - 19:49 JimSkon Hashing Index Building Animation
Topic revision: r8 - 2011-04-11 - JimSkon
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback