Revision History
1. Introduction
1.1 Purpose
The system is to provide a means by which a local group of users may download and share content from the web. This enables users to view content they have downloaded without needing to be connected to the network and it cuts down on redundancy when downloading content because users can share content. The idea is for the system to help cut down on network traffic and allow users to view content they have downloaded when working offline.
1.2 System Overview
The system will consist of a Mozilla Firefox plug-in on the client side and a series of components on the server side. The user will use the plug-in to authenticate with the server and then use the plug-in to select content to be downloaded. Content is downloaded on the server and placed within the user's file space. The user will access the server via a web interface. Through this interface the user will manage, share, and archive content.
2. Design Considerations
2.1 Assumptions and Dependencies
AS-1: Users will be able to access server using Mozilla Firefox and install necessary plug-in.
AS-2: Users will be able to open archived web pages.
DE-1: Server must be able to deploy a crawler to navigate and download requested web pages.
2.2 Constraints
No significant constrains yet identified.
2.3 Operating Environment
OE-1: My CAP Project shall operate only with Mozilla Firefox due to the need for the client plug-in. I do not yet know which versions of Firefox will be supported but I will do my best to work with past versions.
OE-2: My CAP Project will operate on a server running Ubuntu Server.
OE-3: My CAP Project shall be operating in the MVNU campus network. It is important that I keep the amount of traffic to and from my server to a minimum or I risk disrupting the campus network.
2.4 Design Methodology
The design process is to be iterative. I have created a general specification of the system, but many of the final features and components will be decided through design and development. Once an initial prototype has been pushed out, then its design will continue to be refined over subsequent versions until an acceptable, final system has been produced.
The design is also to be highly modular. Each component of the system must be designed in such a way that its internals can be replaced without a change in the interface with the rest of the system. This will enable easier changes in the future and the system will be more understandable in its design.
2.5 Risks and Volatile Areas
RI-1: The project cannot be completed in time. Not likely.
RI-2: Any single component of the system grows in complexity and cannot be finished. Distinct possibility.
VA-1: The web interface is a volatile area because I have little experience in this area and many features will be conveyed by this component. I have decided that the majority of the features are doable—with enough time.
VA-2: The content archiver is a volatile area because I will need to take the source code for Mozilla's Archive extension and use it for my own purposes. Hopefully this will not turn into a very complex task.
VA-3: Security is a volatile area because it is a high priority but it also loosely defined.
3. Architecture
3.1 Overview
The server will act as a hub for all activity by the user. The user will view and manage content via the web interface and issue commands to download content via the client plug-in. The master program will coordinate the operation of the entire server. It will keep the database and file space synchronized and act as a gatekeeper between these and the other components in the system. The downloader is the only component with direct access to the Internet simply because it is the only component which has any reason to be operating outside the server. Obviously the web server is an exception to this but for simplicity's sake I only represented the relationship between it and the user.
3.2 Subsystem, Component, or Module 1...N
If a particular component is one that merits a more detailed discussion than what was presented in the System Architecture section, provide that more detailed discussion in a subsection of the System Architecture section (or it may even be more appropriate to describe the component in its own design document). If necessary, describe how the component was further divided into subcomponents, and the relationships and interactions between the subcomponents (similar to what was done for top-level components in the System Architecture section).
Note that this design will likely be hierarchical, with sub components being broken up into sub-sub-components. This will be a living process as the project evolves. Try to cover as much as you can now.
3.3 Strategy 1...N
Describe the strategy used or decision made. Include information on the alternatives considered and the reasons for their rejection.
4. Database Schema
The database will represent the one-to-many relationship between users and content. Each content item and folder has its own unique ID and is associated with only one user. If two users have downloaded the same content, then it is still considered different content items. In addition to the tables which represent these relationships there will be tables for keeping various records of activity.
Database Schema
TODO
Tables and Fields
Table: user
Each user is given a unique, sequential ID number. The ID is assigned by using mySQL's AUTO_INCREMENT attribute. This table houses any information about the user which is not directly related to any content items or folders. In addition to personal information, it is possible that usage statistics could be stored here.
5. Detailed System Design
5.1 Common Design
Wrappers
The archiver and downloader will not be stand-alone components, but will be interfaced with using wrapper programs. The purpose of these wrappers is to reduce coupling so that if an individual component is modified in the future, the interface will not change and no major recoding needs to be done in other components.
Communication via Pipes
When the system is started, pipes will be created between each component and the master program. There will be two pipes between each component: one for messages from the secondary component to the master program, and another for messages from the master program to the secondary component. The messages sent through the pipes vary depending on the components involved. See individual components for how they handle messages.
Log Files
capmaster.log
Anything and everything from Master Program gets logged here. Format is "[MM/DD/YYYY HH:MM] -<entry priority>- <message>". The entry priority is a single character indicating the type of entry made;
FATAL ERROR,
ERROR,
WARNING,
INFORMATION,
UNKNOWN.
capweb.log
Anything and everything from Python CGI scripts gets logged here. Same format as above.
Pipes
man2master.fifo, master2man.fifo
Used for communications between CAPManage and Master Program.
web2master.fifo, master2web.fifo
Used for communications between web server and Master Program.
Component Locations
Location of system components in Linux server file system
5.2 Master Program
Description
Acts as the coordinator of the entire system; whenever a data-changing action needs to be carried out, then it must go through the master program. The program will be one process which runs through a perpetual loop reading various pipes and responding to the data in each. The main loop will maintain a constant connection to the database so that changes can be made quickly, and reflected in the web interface. Only one instance of the program will be running at a time.
The Master Program will have the location of the configuration file passed to it when it starts. The configuration file will specify the locations and names of other components the program must manage. If the Master Program is unable to read the configuration file or open a log file, then it will terminate. If it does not know where anything is or it cannot report problems, then it should be stopped from operating.
Master_Program_Pseudocode.txt
Command line: capmaster
pid capconf.xml
The caller must specify the location of the master program's process ID (PID) file which is used to ensure that only one instance of the process is running. The second argument is the location of the XML configuration file for the system.
Messages
MSG_MASTERHERE [out]
Sent to web server when Master Program starts.
MSG_QUIT [in|out]
When received, message is forwarded to web server, archiver and downloader. Program will then continue reading from pipes until they are empty at which point it will terminate.
MSG_UNKNOWN [in|out]
When received, entry is made to error log. Send in response to an unknown message in the pipe.
CAPManage
Description
A secondary program which takes command line arguments and passes them to the master program. This is the means by which the system administrator will manage the master program.
Commands
Messages
MSG_QUIT [out]
Sent to Master Program in response to
stop command.
MSG_UNKNOWN [in|out]
When received, notify user of the message which Master Program did not recognize. Send when Master Program sends an unrecognizeable message.
5.3 Web server
Description
The web server will be run on the latest version of the Apache HTTP Server. This is the only component which will be permitted to circumvent the master program and access the database directly. The web server will be able to perform read-only operations on the database. Forcing these queries to pass through the master program would be slow and mostly unnecessary because records in the database can be blocked from reading without such regulation. However, any commands to be executed must be sent to the master program to be processed. The web server can in no way add or update records in the database on its own.
Python
The web server will send requests to Python CGI scripts for processing. Python will act as the middle-man between the web server, and the database and master program. There will also be a Python script which is responsible for processing HTTP messages to and from the client plug-in.
Messages
MSG_MASTERHERE [in]
Master Program has restarted; unset flag and allow requests to be handled.
MSG_QUIT [in]
Flag is set to indicate that Master Program is unavailable; any HTTP requests which necessitate a change to the system will be denied. As long as the database is still available, then users can still view information.
MSG_UNKNOWN [in|out]
When received, write entry in error log and display error message to user. Send in response to an unknown message in the pipe.
5.4 Archiver
Description
The archiver is responsible for taking content items and combining them into archive files. The finished archive files will be in the Mozilla Archive Format (MAFF). These are essentially ZIP files with some nifty meta-data for browsing them.
Messages
STATUS, CANCEL, ARCHIVE
Operation
The master program sends a STATUS message to determine if the archiver is idle. If it is, then the master program copies content items into the archiver's working directory and sends an ARCHIVE message. Upon completion the archiver sends another STATUS message to the master program which then copies out the new archive file and clears the archiver's working directory.
5.5 Downloader
Description
The downloader is responsible for parsing download requests from users and downloading the requested content. It will consist of the GNU Wget program which will perform the actual downloading and a wrapper C program which will serve as the interface between Wget and the master program.
Messages
STATUS, CANCEL, DOWNLOAD
Operation
The master program sends a STATUS message to determine if the downloader is idle. If it is, then the master program sends a DOWNLOAD message with the requests to be downloaded. Upon completion the downloader sends another STATUS message to the master program which then copies the downloaded content out of, and clears, the downloader's working directory.
5.6 Database
The system will use a mySQL database to store user, content, and log data. There is no wrapper program for the database; it will be directly accessed by the master program. The interface between the database and master program will be the mySQL C API. The database schema is described under the same-named section of this document.
5.7 File space
Description
The file space is the folder on the server's hard drive that will contain all users' content. There will be one root folder containing a subfolder for each username in the system. User content will be stored under this path: $ContentRoot/$Username/
Each content item will be named after its ID number in the database. So if a content item is stored with an ID of 123456, then it will be stored in the file space as: $ContentRoot/$Username/123456.content
All content items will be named with this method; whether it is downloaded content or an archive. Folder hierachies will be stored in the database. Therefore, all content items for a user will be stored in the single path given above. With this method content does not need to be rearranged on the hard drive everytime the user moves it to another folder.
Interfaces
There is no wrapper program for the file space. File operations are directly called by the master program. The only operations which will need to be performed on the file space are: copying content into and out of file space, and deleting (permanently) content.
5.8 Configuration file
A simple XML file which stores all of the configuration information for the system. The system only reads the file, so any changes need to manually be performed by the system administrator.
5.9 Client plug-in
Description
The client plug-in will be a Java program which receives requests to download content from the user and then forwards those requests to the server for processing. It will be written in Java to allow operation on multiple platforms. The client plug-in will take content download requests from the user, wrap the request(s) in a messaging format, and then send these messages to the server over a connection it has already setup.
Components
The client pug-in will consist of a Mozilla Firefox extension which provides an interface to the user, a Java program which handles the users requests, and a TCP socket to the server over which to send the content requests. The Java program will contain a public class which is exported to the user interface.
Interfaces
Mozilla Firefox extension, Java program
Communication between the Mozilla Firefox extension and the Java program will be done using
LiveConnect . Content download requests from the user interface will be simple text strings. Below is a table of the various download options available to the user. The input is passed to the Java program exactly as it appears in the input dialog in the user interface.
Functions exported to user interface:
NOTE: All functions throw exceptions if errors occur.
Connect(String user, String passwd) -- Attempts to establish TCP connection with server and authenticate user. Returns true if successful.
Disconnect() -- De-authenticates user from server and closes TCP connection. Returns true if successful.
DownloadContent (String input) -- Takes given input string and attempts to pass it to the server. No return value.
GetPreferences () -- Returns an XMLObject containing all of the client plug-in's preferences.
SetPreferences (XMLObject pref) -- Sets the client plug-in's preferences.
GetProgress () -- Requests that server give an update on user's job progress. Server replies with this structure within a message:
JobsProgress {double perc; String status; String msg;};
Java program, server
The Java program will establish a TCP connection with the server which will be used for communications between the client and server. All communications will use the same messaging format. If at any time the connection is lost, an exception is thrown to the user interface. The server never initiates contact with the client; if the client plug-in wants anything, it must ask the server for it.
5.10 Client-server MESSAGE format
This is the message format that is used for communications between the client plug-in and the server. Messages are sent over a TCP connection.
MESSAGE HEADER
int size // size of message body
char[4] content // content/type of message
MESSAGE BODY
Content dependent on message type
Content types:
AUTH, used when client is being authenticated/de-authenticated. When sent from client, contains username and password. When sent from server, contains result code.
REQT, used for download content request. When sent from client, contains request. When sent from server, contains acknowledgement of request.
PROG, used to get user's job progress. When sent from client, contains nothing. When sent from server, contains
JobsProgress class specified in Client plug-in definition.
TODO: When message is of type AUTH, it must be encrypted.
5.11 Piped message format
Messages will be parsed by lines; a line is terminated by a newline character with ASCII code 0xA. Rather than have a message end with a certain character, it is better to specify the length of the message in a pre-determined line so that the contents of the message are not limited.
Messages will follow this format:
A message header indicating the nature of the message; single word, all caps. and underscores (ex: MSG_QUIT)
An unsigned long integer indicating the length of the message body
The message body which may contain any sort of data, or no data
6. User Interface Design
6.1 Common Appearance and Behavior
All of the screens will have a logo header, a tree view of the folder hierarchy on the left, and a status bar at the bottom of the page. They will all support cut, copy, paste, and delete operations with text and any other applicable items which are selected. All of the screens will also support keyboard shortcuts for the above-mentioned operations as well as responding to the Return key by executing some default action for that screen.
6.2 Individual Screens
Home
Gives the user a broad overview of the content they own as well as the current job queue. The user will be displayed miscellaneous information about the content they own. I will not try to specify the information to be shown now because this is a highly flexible feature and will be implemented as said information is desired. There will be a pie graph showing the user how much of their allocated file space has been used as well as what comprises that used space. The user will also be able to view the job queue; it will display all jobs but jobs that the user does not own will be listed anonymously.
Profile
Allows user to view and change personal information about himself in the system. The user must answer his security question before he can access his profile page. This might seem like a hassle, but the profile page displays sensitive information which needs to be protected. The password and security question fields will indicate the age of those fields. There will be a list view which dumps all data that references the user in the database. This is to facilitate an honest privacy policy by showing the user what the system knows about him.
Archives
Allows user to view, run (compile content into archive), and manage archives. The controls of this screen will consist of a dropdown box which lists the user's archives, a button to run the archiver, a dropdown box with options to manage archives, and a button to download the archive (if it has been compiled). The user can create new archives, duplicate the current archive, and delete the current archive. When an archive is selected it displays information about that archive as well as a list of all the content in the archive.
6.3 Content
View and managing content is done with a tree view, a list view, and a content view. The tree view is a folder hierarchy with two parent nodes: the user's root folder and a shared folder accessible by all users. The tree view only allows one folder to be accessed at a time and only displays folders, not content therein. The list view displays a folder's subfolders, content, and archives. The list view will have a header which indicates the parent folder as well as additional columns displaying details about the list items. Multiple items in the list can be selected but this does not guarantee that an operation can be carried out on all of the selected items.
Drag-and-Drop Behavior
All items in the list view can be dragged-and-dropped into subfolders or into a folder in the tree view. Anytime an item is dropped on the Shared folder the only action taken is that the item is shared—nullifying normal behavior.
Popup Menu
There will be a popup menu displayed upon a right-click within either the tree or list view.
Actions for Selected Items
6.4 Client Plug-in
Authenticating
The user must authenticate with the server before he can begin downloading web pages. If he is already logged in via the web interface, then the plug-in will authenticate automatically. Otherwise, he must login to the server via the plug-in.
Downloading Pages
There is a single button for downloading the URL currently being viewed. There are four options for downloading multiple pages.
Domain: specifying domain from which to download pages (e.g. *.reptilesmagazine.com)
Host: specifying host from which to download pages (e.g.
http://reptilesmagazine.com)
Directory: specifying directory from which to download pages (e.g.
http://reptilesmagazine.com/snakes/)
URL list: specifying a list of URLs to download
Each of these selections brings up an input dialog where the user can enter in the desired values. There is a small control on the client plug-in which displays the inputs given and can be clicked on to edit those values. When either of the download buttons is selected, a download request message is created and sent to the server.
Other
There is a progress icon between the single page and multiple page download areas which displays the server's progress in downloading the pages. The progress shown is the total completed of this user's jobs in the job queue. The logout button de-authenticates the user from the system. The preferences button brings up a standard preferences dialog.
6.5 Visual Mockups
Web interface.pdf, revision 1.0
Client plug-in.pdf, revision 1.0
--
GrantGipson - 2011-02-12