Install

A simple python MapReduce framework based on streaming.

Install

Download pymred.py to work direcory.
Set environment (you can also send them through command options)
- export HADOOPEXEC=/home/hadoop/current/bin/hadoop
- export PYTHONEXEC=/usr/bin/python
- export STREAMINGJAR=/home/hadoop/current/contrib/streaming/hadoop-...-streaming.jar

Quick Start

Write wordcount.py

import pymred
def map(key, value):
    for word in value.split(" "):
        yield word, 1
def reduce(key, values):
    yield key+" "+str(sum(values)), None
pymred.run(map, reduce)

Run wordcount.py

run in hadoop

python wordcount.py -input /path/of/hdfs/testin -output /path/of/hdfs/.../testout -numReduceTasks 10
run locally (commonly used for testing the code)

python wordcount.py -input ~/testin -output ~/testout -mode local

Complete usage of wordcount.py

Usage wordcount.py
Options:
  -input          <path>    DFS input file(s) for the Map step
  -output         <path>    DFS Output directory for the Reduce step
  -inputfmt       Text<Default>|Code|SequenceText|SequenceCode
  -outputfmt      Text<Default>|Code
  -mode           hadoop<default>|local	Folowing options will be ignored if local mode is set
  -hadoopexec     <hadoopexec>    Optional if environ variable "HADOOPEXEC" is set
  -pythonexec     <pythonexec>    Optional if environ variable "PYTHONEXEC" is set
  -streamingjar   <streamingjar>  Optional if environ variable "STREAMINGJAR" is set 
  -file           <file>    Optional  File/dir to be shipped in the Job jar file
  -partitioner    JavaClassName  Optional.
  -numReduceTasks <num>     Optional.
  -inputreader    <spec>    Optional.
  -cmdenv         <n>=<v>   Optional. Pass env.var to streaming commands
  -mapdebug       <path>    Optional. To run this script when a map task fails 
  -reducedebug    <path>    Optional. To run this script when a reduce task fails
  -verbose
Generic options supported are
  -conf <configuration file>     specify an application configuration file
  -D <property=value>            use value for given property
  -fs <local|namenode:port>      specify a namenode
  -jt <local|jobtracker:port>    specify a job tracker
  -files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
  -libjars <comma separated list of jars>   specify comma separated jar files to include in the classpath
  -archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
pymred.py		pymred.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Quick Start

Write wordcount.py

Run wordcount.py

Complete usage of wordcount.py

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

simplogic/pymred

Folders and files

Latest commit

History

Repository files navigation

Install

Quick Start

Write wordcount.py

Run wordcount.py

Complete usage of wordcount.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages