Hakuna MapData! » Blog Archive » Why put is better than copyFromLocal when coping files to HDFS?
rss

Why put is better than copyFromLocal when coping files to HDFS?

| Posted in Doubts Resolving |

When you type hadoop fs, you may find following options quite similar:

akawa@hadoop:~$ hadoop fs
Usage: hadoop fs [generic options]
       [-copyFromLocal < localsrc > ... < dst >]
       [-put < localsrc > ... < dst >]

HDFS Shell Guide

Let’s have a quick look at the offical HDFS Shell Guide to find out the differences between them (descriptions bellow are slightly edited to make them more compact):

put
Usage: hdfs dfs -put < localsrc > … < dst >
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system, for example:

# reads the input from stdin.
hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile

copyFromLocal
Usage: hdfs dfs -copyFromLocal < localsrc > URI
Similar to put command, except that the source is restricted to a local file reference.

OK. The very first time, when I read the descriptions from the HDFS Shell Guide, I understood that:

  • put should be most robust, since it allows me to copy multiple file paths (files or directories) to HDFS at once, as well as read input from stding and write it directly to HDFS
  • copyFromLocal seems to copy a single one file path (in contrast to what hadoop fs -usage copyFromLocal displays), and does not support reading from stdin.

Put and CopyFromLocal in action

Since there was some confusion for me, I have decided to see both commands in action:

Copying multiple file paths into HDFS

# works the way I expected
akawa@hadoop:~$ hadoop fs -put cites.txt pets.txt put-dir
akawa@hadoop:~$ hadoop fs -ls put-dir
Found 2 items
-rw-r--r--   3 akawa supergroup         41 2012-09-08 08:28 put-dir/cites.txt
-rw-r--r--   3 akawa supergroup         41 2012-09-08 08:28 put-dir/pets.txt
 
# works the way hadoop fs -usage copyFromLocal says
akawa@hadoop:~$ hadoop fs -copyFromLocal cites.txt pets.txt copyfromlocal-dir
akawa@hadoop:~$ hadoop fs -ls copyfromlocal-dir
Found 2 items
-rw-r--r--   3 akawa supergroup         41 2012-09-08 08:28 copyfromlocal-dir/cites.txt
-rw-r--r--   3 akawa supergroup         41 2012-09-08 08:28 copyfromlocal-dir/pets.txt

Moving data from stdin directly into HDFS

Following command may be highly useful, if you have an archive and want to move the content of a particular file from this archive into HDFS without the need of untaring it to save some disk space.

# works the way I expected
akawa@hadoop:~$ tar zxvf numbers.tar.gz numbers/1.txt -O | hadoop fs -put - put-dir/1.txt
numbers/1.txt
akawa@hadoop:~$ hadoop fs -cat put-dir/1.txt
adam    3
natalia 1
tofi    2
 
# it suprisingly works (in contrast to what one may deduct from the HDFS Shell Guide)
akawa@hadoop:~$ tar zxvf numbers.tar.gz numbers/1.txt -O | hadoop fs -copyFromLocal - copyfromlocal-dir/1.txt
numbers/1.txt
akawa@hadoop:~$ hadoop fs -cat copyfromlocal-dir/1.txt
adam    3
natalia 1
tofi    2

It looks that there is not practical difference between these commands. Perhaps there are some subtle implementation details?

Looking at the source code

What about looking at the source code and verifying it? The class CopyCommands.java from Apache Hadoop trunk contains implementation of various commands to copy files. Here is a small snippet:

class CopyCommands {  
  public static void registerCommands(CommandFactory factory) {
    factory.addClass(Merge.class, "-getmerge");
    factory.addClass(Cp.class, "-cp");
    factory.addClass(CopyFromLocal.class, "-copyFromLocal");
    factory.addClass(CopyToLocal.class, "-copyToLocal");
    factory.addClass(Get.class, "-get");
    factory.addClass(Put.class, "-put");
  }
 
  public static class Put extends CommandWithDestination {
    public static final String NAME = "put";
    public static final String USAGE = "<localsrc> ... <dst>";
    public static final String DESCRIPTION =
      "Copy files from the local file system\n" +
      "into fs.";
    ...
  }
 
  public static class CopyFromLocal extends Put {
    public static final String NAME = "copyFromLocal";
    public static final String USAGE = Put.USAGE;
    public static final String DESCRIPTION = "Identical to the -put command.";
  }
}

It looks that CopyFromLocal extends Put, but it does not provide any new functionality. It simply means that these commands are equivalent (the same relates to Get and CopyToLocal commands).

So why do I say that Put is better than CopyFromLocal? Simply because Put is significantly shorter to type in the console ;)

No related posts found.

VN:F [1.9.20_1166]

Rate this post!

Rating: 3.9/5 (19 votes cast)
Why put is better than copyFromLocal when coping files to HDFS?, 3.9 out of 5 based on 19 ratings

Comments

comments

Comments (1)

Great post! I always wanted to know that.. Really liked you blog :)