Abstract:

This paper describes a retrieval method for a significant amount of music tracks. It takes a melody sung or a text input as a search clue which is sent over the internet to retrieve the song from the database. This system uses a variety of procedures to support multiple search modes. The thorough indexing method allows for rapid recovery of the song. Integration of these components within a single architecture shall improve performance and functionality.

Overview of Objectives:

The aim of this paper is to utilise and develop existing techniques for Music Information Retrieval. The system shall be able to retrieve a song from either a text query or a melody query. This melody can be sung, a riff or humming. Browsing the database shall also be available to the user. The database shall reside on a server. For the purposes of this paper, it shall contain 10,000 songs. However, taking into consideration all the features of this system, I would expect it to be able to process a database 1000% larger.

Figure 1 – Name That Tune Architecture

Functional Description of the System:

The system shall have a graphical interface containing 2 search engines:

1.         Text Input – User may enter the title, line of song, artist

2.         Aural Input – User may sing, hum, play a riff

The user would choose their preferable search method.

The aural input is processed through a Speech Recognition System. This comprises of 2 components:

1.         An acoustic model – using statistical modelling using hidden Markov models. 2.    A language model – model of the expected word sequences. This vocabulary will contain 80,000 words.

The songs shall be contained in a database with a corresponding value recorded in an index. This would contain 4 sections:

1.         Traditional Text Matching

2.         Theme Matching

3.         Phonetic Matching

4.         Markov Representation

There are other techniques, but after researching I feel that these are the most productive.

Sections 1 and 2 deal with the Text Input and 3 and 4 with the Aural Input. The file structure shall be that of the inverted file. This stores term locations. It shall also include term proximity. The most effective method to compute term locations is to use a hashing algorithm which will be employed in this system.

As all lyrics shall be recorded with each song, the Text Matching search shall be the most accurate. However, many users may not input the correct words and incorrect songs would be displayed.

This method shall use general information retrieval techniques.

1.         Remove all stop words as these do not aid retrieval. Eg. a, the, that, so etc. Research has shown that memory requirements for document representations can often be reduced by more than 50%.

2.         This system shall use the Porter stemming algorithm to remove suffixes that inhibit matching. This would require the creation of a dictionary of suffixes. This uses iterative rules of the form

a.         remove plurals, -ED, ING .

b.         terminal Y – > CARRI, CARRY – > CARRI

c.         map double suffices to single, – ISATION

d.         deal with –IC, -FULL, -NESS

e.         remove –E if word > 2

The option of selecting a theme would be available. The MUSART application utilises this feature. [6] This will be in the form of a drop down menu. To improve this I would add the genre to the classification. Genres to divide the corpus: Rock, Jazz, Country and Folk, Classical, Dance and Rap. If neither achieved successful results or are not selected, the phonetic search would then be deployed.

The user would make an aural input which would be phonetically deconstructed yielding results from the Phonetic Index. This index contains a summary of each song’s phonetic decomposition. English has about 45 distinct phones. Our phonetic dictionary would contain more than 200,000 entries. [2] Recognition at the phone level means we do not have to explicitly build models of all words, new words can be added to the vocabulary easily if their phonetic structure is known. This shall make the system adaptable to out-of-vocabulary words.

Searching for a phonetic string has a greater success rate when the text search has yielded no correct results. This is discussed in Witbrock and Hauptmann’s paper although they are dealing with incorrectly transcribed documents, the same premise applies in this case. [3]

Hidden Markov Model, H.M.M., is a very powerful tool to statistically model a process that varies in time. It can be seen as a doubly embedded stochastic process with a process that is not observable (hidden process) and can only be observed through another stochastic process (observable process) that produces the time set of observations. A HMM can be fully specified by

1.         N, the number of states in the model

2.         M, the number of distinct observation symbols per state

3.         A = {aij}, the state transition probability distribution

4.         { ( ) } j B = b k , the observation symbol probability distribution

5.         { } i P = p , the initial state distribution. [5]

Employing this model is very efficient. There is considerable data reduction as each song is transformed to a bit vector.

Better retrieval performance can be achieved by the introduction of term weighting, using the collection frequency weight in our case. The key concept of this is that terms that only occur in a few documents are often more valuable than ones which occur in many documents.

It is defined as follows for a search term t(i).

Given

n(i) = the number of documents term t(i) occurs in,

N = total number of documents in the collection archive.

The cfw for term t(i) is

cfw(i) = log N
            n(i)

The length of a song should also be taken into account. Document relevance is independent of document length. Without compensating for document length, longer documents will tend to have a higher weighting. Research has shown that a term that occurs the same number of times in a short document and in a long document, is likely to be more valuable for the former.

Document length of document d(j) is

dl(j) = total number of term occurrences in document d(j)

One length compensation method requires a normalised average document length defined as:

ndl(j) =       ___________dl(j)________
              average dl for all documents

The complete weighting scheme described further in [2] is

cw(i,j) =      cfw(i) x tf(i, j) x (K + 1)
                  K x ndl(j) + tf(i, j)

Limitations:

For the aural input, recognition of spontaneous speech will be difficult due to disfluencies. E.g. false starts, self corrections, “um”, “err”, etc, poor linguistic structure.

Each region has a different dialect which will affect the accent. A small amount of adaptation data would have to be inputted into the system so that it will be able to understand each different user correctly.

With an extremely large corpus, some songs will be very similar especially when dealing with acoustic sets. The Markov model may be able to counteract this complex issue as all songs shall have their own unique bit vector value.

Implementation and Evaluation Plan:

The test collection should consist of 10,000 songs. If this works effectively, this number could be increased significantly.

Data should be derived informing us which documents are relevant to each request.

Once a user inputs their query, this is sent to the database. This query is compared to the index and titles of the corresponding songs are ranked and sent back to the client. If the user is not satisfied with the results, he/she can choose another search method with different parameters.

The matching score for each document for each request should be calculated. All outputs should be evaluated document by document. In essence, the goal of a IR system is:

1.         retrieve relevant documents

2.         not to retrieve non-relevant ones.

To measure this, the following formulas should be applied to our results:

1.         Recall  =   No. of relevant docs retrieved
                      Total relevant docs in collection

This shall tell us what proportion of the overall relevant documents has been retrieved.

2.         Precision  =  No. of relevant docs retrieved
                             Total docs retrieved

This shall tell us what proportion of the retrieved documents is relevant.

The system should be evaluated against the following Test Scenario. If this results in a high success rate, it is ready for mainstream use.

1.         Text Search for 2 songs from each genre: Rock, Jazz, Country and Folk, Classical, Dance, Rap

2.         Text Search for 1 song from each theme

3.         Aural Search for 2 songs from each genre: Rock, Jazz, Country and Folk, Classical, Dance, Rap

4.         Aural Search for 1 song from each theme

This shall give us a minimum of 30 test results. From the results of our formulae, we shall be able to see how accurate our system is.

References:

1. Downie, J. Stephen. Music information retrieval (Chapter 7). In Annual Review of Information Science and Technology 37, ed. Blaise Cronin, 295-340. Medford, NJ: Information Today, 2003.

http://music-ir.org/downie_mir_arist37.pdf

2. M. G. Brown, J. T.  Foote, G. J. F. Jones, K. Sparck Jones, S. J. Young.

Open-Vocabulary Speech Indexing for Voice and Video Mail Retrieval.

http://delivery.acm.org/10.1145/250000/244232/p307-brown.pdf?key1=244232&key2=2080136611&coll=&dl=ACM&CFID=15151515&CFTOKEN=6184618

3. Michael J. Witbrock, Alexander G. Hauptmann. Using Words and Phonetic Strings for Efficient Information Retrieval from Imperfectly Transcribed Spoken Documents. http://zero.inf.cs.cmu.edu/alex/dl97.pdf

4. Tomonari Sonada, Yoichi Muraoka. A WWW-based Melody-Retrieval System – An Indexing Method for a Large Melody Database.

http://www.sonoda.net/papers/sonoda-icmc1998.pdf

5. Wei Chai, Barry Vercoe, Folk Music Classification Using Hidden Markov Models.

http://alumni.media.mit.edu/~chaiwei/papers/chai_ICAI183.pdf

6. William P. Birmingham, Roger B. Dannenberg, Gregory H. Wakefield, Mark Bartsch, David Bykowski, Dominic Mazzoni, Colin Meek, Maureen Mellody, William Rand.

MUSART: Music Retrieval Via Aural Queries. http://ismir2001.ismir.net/pdf/birmingham.pdf

VN:F [1.8.8_1072]
Rating: 0.0/10 (0 votes cast)
VN:F [1.8.8_1072]
Rating: 0 (from 0 votes)

Related posts:

  1. Shale – Online Music Tutorial Suite | Entrepreneurial Student Awards 2006 |
  2. Alley Katz – Meeting Minutes
  3. MS SQL Server Vs MS Access

Leave a Reply

(required)

(required)