Welcome to the MacNN Forums.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

You are here: MacNN Forums > Software - Troubleshooting and Discussion > Developer Center > Question about hashing

Question about hashing
Thread Tools
Mac Elite
Join Date: Sep 2000
Location: New York
Status: Offline
Reply With Quote
Apr 16, 2002, 12:52 AM
 
I'm doing a programming assignment in C where I have to read in a large number of strings and keep track of them in some way. What I need to do is be able to read in the strings and then somehow create an integer to identify the string as unique. I assume that I need some sort of hash function that would return a unique integer value for each string that I give to it. That way with that unique value I can then keep track of whether I have seen this string before or not. I then would want to keep an array of some kind that would use the hash value as the index and store the actual strings so I can retrieve them. So then it would be ideal if this hash function would return integers between a specified range, which would be the number of lines that I have read in.

What I want to know is how do I create a hash value to give me this unique number? I really don't know anything about hashing. Thanks.
     
Dedicated MacNNer
Join Date: Mar 2001
Location: Iowa City, IA
Status: Offline
Reply With Quote
Apr 16, 2002, 01:17 AM
 
I had the same question when I first learned about hashes, and the answer is that there's no "magic bullet" for coming up with a hash function. For this assignment, it apparently suffices to find some way to derive a unique number for a given set of strings (rather than a general-purpose hash function for any arbitrary set of strings), so: Maybe the first and fourth characters (if every string has that many characters) are unique. Maybe the ASCII value of the first character multiplied by 8, plus the second by 4, the third by 2, and the fourth? Or maybe only the first two characters? There's no way around looking at the strings and experimenting. [hint: I used powers of two as multipliers because you can do power-of-two multiplication with the bit-shifting operators << and >>. These are very fast, since the compiler can map them directly to instructions on the CPU (on all CPUs), and you want your hash function to be as fast as is practicable.]

You can forget coming up with a range that is equal to the number of strings you have to read in, unless that's explicitly stated as a requirement in your assignment. It's possible, but it's neither common nor expected. Hashes can be sparse, and they usually are.

[ 04-16-2002: Message edited by: Amorph ]
James

"I grew up. Then I got better." - Sea Wasp
     
Forum Regular
Join Date: Feb 2002
Location: Fort Lauderdale, Florida
Status: Offline
Reply With Quote
Apr 16, 2002, 01:30 AM
 
Just do a quick net search, i guarantee you will find a bunch of
examples of getting keys for strings. Just search for "hash functions"
     
Mac Elite
Join Date: Sep 2000
Location: New York
Status: Offline
Reply With Quote
Apr 16, 2002, 06:25 AM
 
Thanks. The strings that I will be reading in are actually actors' names. My assignment is to relate any actor to kevin bacon via the movie game, or the six degrees of kevin bacon game. This is why I need to give a unique number to each actor and to each movie so that when I go back and have to check if I have reaed in this actor before from a different movie I could tell very easily by checking my array using the has number. This is my idea that I've come up with. Does anyone have a better idea that might not involve hashing?
     
Junior Member
Join Date: Jul 2001
Location: Mos Eisley Cantina
Status: Offline
Reply With Quote
Apr 17, 2002, 02:12 AM
 
Since it sounds like the purpose of your assignment is not to write your own hashing function, but to perform the task at hand (6 degrees of Kevin Bacon), you could use the MD5 hash (check the man pages for md5). Chances are slim that you'll get two actor names that hash to the value.

If execution time is not an issue, and you don't need to write your own container classes (graphs, dynamic arrays, etc), then you probably don't need to use any hashing. What you COULD do is store the data in a undirected, weighted graph (where the "weight" of the connection is the movie name, or a pointer to a movie name, or whatever). The nodes in the graph are the actor names. The connections in the graph link the actors together if they appeared together in a movie. If actors appear together in more than one movie, it doesn't really matter, since a pair of actors only need to be "linked" by one movie.

To check if an actor or movie is already in a graph, you can use two sorted arrays. One stores pointers to graph nodes where the sort key is the actor name. Another stores pointers to connections where the sort key is the movie name. You can then use a binary search to quickly locate if a movie or actor already exists.

Finally, to find the "6 degrees of Kevin Bacon", there are many path finding algorithm for graphs that will work. Obviously you wouldn't use one that takes the weight between the nodes into consideration, because these weights are meaningless in terms of "distance".

Yes, this method is a little wonky, but it's all I can come up with on the fly at 3 am. :-)
     
Mac Enthusiast
Join Date: Jan 2001
Status: Offline
Reply With Quote
Apr 17, 2002, 02:18 PM
 
i may be misunderstanding what exactly you want to do, but if all you need is to check whether or not you have already "read in" an actor (or movie), you could implement a simple binary tree. It would be faster and more efficient than some sort of hash function - and you wouldn't need arrays at all. But this is only useful if you want to check whether or not you have already read in an actor.
     
Mac Elite
Join Date: Jun 2001
Location: Dundee, Scotland
Status: Offline
Reply With Quote
Apr 23, 2002, 04:54 AM
 
Here's a Hash Function:
<BLOCKQUOTE><font size="1"face="Geneva, Verdana, Arial">code:</font><HR><pre><font size=1 face=courier>

typedef unsigned long hash_key;

hash_key hash_elf(const unsigned char *name ) <font color = brown>// Elf Hash</font>
{
hash_key k = <font color = blue>0</font>, g;

while ( *name )
{
k = (k&lt;&lt;<font color = blue>4</font>) + *name++;
if (g = k & 0xF0000000 )
k ^= g&gt;&gt;<font color = blue>24</font>;
k &= ~g;
}


return k;
}
</font>[/code]

However, beware - it doesn't provide a unique number for each string, just a fairly random one.

To make sure that each number was unique you would have to then remember how many actors have been given each number and tack that on the end (or put them in a list, which is what most people do).

..of course if you want each actor to have a unique number, why not just number each on the first time you meet them (Kevin Bacon = 1, Holly Hunter = 2, ..). Of course this could result in a very inneficient algorithm if you have huge amounts of data (searching a whole table for a name every time).

It doesn't sound like you are going to have enough data to require the performance aid of a hash function.

I would use a linked list myself, or more likely a tree where you can count how far away each actor is from Kevin Bacon just as you would if you drew a diagram on a peice of paper.

But if you *have* to use a hash the one above is fairly standard. You can then use the number generated and multiply it by the size of your array. remember though that two actors can have the same hash value, so you will need to add a layer of linear search. If your hash is big enough this should be very small (3 or 4 checks..)

hope this helps.

s.
     
Mac Elite
Join Date: Jun 2001
Location: Dundee, Scotland
Status: Offline
Reply With Quote
Apr 23, 2002, 05:31 AM
 
I've just re-read the problem.

It seems to me that the hash key is a bit of a red herring. It is not part of the algorithm as such, just the bit at the beginning that finds the place in a tree to start the search from. (You could just as easily have an array or strings as you will only ever have to search it once.)

The real issue is the graph of actors. This is almost certainly a binary search algorithm.

You will need to build a graph.

Imagine an actors name with a circle around it and a (double headed) arrow pointing to another actors name with a circle around it. This means that the actors have been in a film together. Now, fill the page up with 100 actors (one of which is Kevin Bacon) and you will have a graph that (according to the theory) should always allow you to get to Kevin in six or fewer steps.

If each arrow is a C pointer (or two) and each circle a peice of data that corresponds to a name in our actors name array, we should be able to search from any point on the graph, looking for Kevin.

Kevin, of course, can be given an easy to spot ID eg 0. A very simple algorith could be limited to only six steps.

However, there are a number of sophisticated ways to do this.

You could (if you can imagine perhaps a set of paper disks attached by thread like a baby's mobile) pick the whole graph up by Kevin and give it a shake. Now Kevin is at the top and everyone else is below him. Now, as long as you only travel upwards from your actor, you should get to Kevin in the shortest route. Well there are pitfalls, but that's what the assignment is about, surely.

I'll leave it to you to work out how to do that programatically. My advice would be to do the following.

1) Get a big pile of paper and try it out before you write any code.

2) Write out an algorithm and try it using your fingers. One on the algorithm and another on the graph.

3) Do try the thread and cut-out paper "mobile" model. It will teach you loads these kinds of structures. Try holding it up buy Kevin and try to work out how to get to your actor. Try Holding up your actor and try getting to Kevin etc. Try holding up Kevin and try to get to him from your actor etc. It will soon become obvious what is and isn't sensible.

4) Don't attempt any code on a structure like this that doesn't involve "recursion".

hope that helps even more...

s.
     
   
Thread Tools
Forum Links
Forum Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Top
Privacy Policy
All times are GMT -5. The time now is 09:47 AM.
All contents of these forums © 1995-2011 MacNN. All rights reserved.
Branding + Design: www.gesamtbild.com
vBulletin v.3.8.7 © 2000-2011, Jelsoft Enterprises Ltd., Content Relevant URLs by vBSEO 3.3.2