Outgoing: MapReduce

accesine 2005-09-20

展開全文

MapReduce

Jeffrey Dean and Sanjay Ghemawat of Google have written a paper about a method of processing large data sets they call MapReduce.

Many will be familiar with the functional programming constructs of map and reduce. Map applies a function against each element of a list to get a transformed version of the list. For example, in Python, map(chr, [97,98,99]) transforms a list of three numbers into a list containing the equivalent characters:

>>> map(chr, [97,98,99])
[‘a(chǎn)‘, ‘b‘, ‘c‘]

It‘s as if you executed [chr(97),chr(98),chr(99)].

Reduce takes a function and runs it against items in the list, resulting in a single value:

>>> reduce(operator.add, [‘a(chǎn)‘,‘b‘,‘c‘])
‘a(chǎn)bc‘

This is the string formed by the operations (‘a(chǎn)‘+‘b‘)+‘c‘. This programming style lends itself naturally to nesting:

>>> reduce(operator.add, map(chr, [97,98,99]))
‘a(chǎn)bc‘

The functional aspects of these operations are similar to Unix filters where files get piped from one filter to another. Here‘s a filter that will take a file of MARC21 records, transform the end-of-record markers to line-feeds, select records with the word ‘smollet‘ in them and then count them:

cat clinker.marcu | tr ‘\035‘ ‘\n‘ | grep -iw ‘smollet‘ | wc -l

Comparing this to map/reduce the cat, tr, and grep commands are similar to map and the wc command to reduce.

The Google model is that given a set of key/value pairs the map function produces a new set of key/value pairs based on a function supplied by the programmer. The reduce function collapses all the values for a given key to a single value. Google has found that offering a robust implementation of this that can run in a massively parallel environment (thousands of nodes) has made it possible to routinely process huge files in many different ways. The slides offer a good overview of their work.

Here‘s a more involved example written in Python that closely follows the Google approach:

First we need a list to process as input:

((1,‘boy‘),(2,‘dog‘),(3,‘cat‘),(4,‘a(chǎn)ardvark‘), (5,‘cat‘))

This is a list of 5 key-value pairs. You might think of the key as record number and the string as the record.

Here‘s our map function. It takes in a list of key-value pairs, such as our input, and returns a new list with the string as the key with the record number as the value if it finds an ‘a(chǎn)‘ in the string:

def myMap(gen): return ( (v,k) for k,v in gen if v.find(‘a(chǎn)‘)!=-1)

For our input list, this returns:

((‘cat‘, 3), (‘a(chǎn)ardvark‘, 4), (‘cat‘, 5))

Next this list gets grouped so that all the record numbers for each word are collected together. You can find the code to do this at the end of the post. Here‘s the grouped list it outputs:

((‘a(chǎn)ardvark‘, [4]), (‘cat‘, [3, 5]))

This shows that ‘a(chǎn)ardvark‘ occurred in record 4, ‘cat‘ in records 3 and 5.

A reduce function that outputs each word with a count:

def myReduce(gen): return ((k, len(v)) for k,v in gen)

From the grouped results this will generate:

((‘a(chǎn)ardvark‘, 1), (‘cat‘, 2))

What Google has done is take the map/reduce paradigm and make it work in parallel in their environment of thousands of millions of records. Our work with our own (somewhat smaller scale) Beowulf cluster made us think we could usefully use many of their concepts in our own processing of tens of millions of bibliographic records. (Actually OCLC has more than a thousand million records, but we don‘t maintain those online yet).

More on our own work with MapReduce (which we are doing in Python) in a subsequent post.

I imagine there are a number of implementations of MapReduce. The Nutch project has a Java implementation.

Here is all the code in one spot, including the group function that is run between map and reduce:

input = ((1,‘boy‘),(2,‘dog‘),(3,‘cat‘),(4,‘a(chǎn)ardvark‘), (5,‘cat‘))
def doMap(gen): return ( (v,k) for k,v in gen if v.find(‘a(chǎn)‘)!=-1)
def doReduce(gen): return ((k, len(v)) for k,v in gen)
def group(gen):          # accept a list of key,value pairs
sl = sorted(list(gen)) # sort
if not sl: return       # might be empty
rkey, rlist = sl[0][0], [sl[0][1]] # a key and list to return
for k,v in sl[1:]:       # process rest of sorted list
      if k==rkey:
         rlist.append(v) # extend the list for this key
      else:
         yield (rkey, rlist) # output key & list
         rkey, rlist = k, [v]# start next key & list
yield(rkey, rlist)       # output last key & list
print tuple(doReduce(group(doMap(input))))

--Th & Jenny Toves

本站是提供個人知識管理的網(wǎng)絡(luò)存儲空間，所有內(nèi)容均由用戶發(fā)布，不代表本站觀點。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息，謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊一鍵舉報。

小男孩‘自慰网亚洲一区二区,亚洲一级在线播放毛片,亚洲中文字幕av每天更新,黄aⅴ永久免费无码,91成人午夜在线精品,色网站免费在线观看,亚洲欧洲wwwww在线观看

Outgoing: MapReduce

MapReduce