博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
sqoop2:从mysql导出数据到hdfs
阅读量:6292 次
发布时间:2019-06-22

本文共 12357 字,大约阅读时间需要 41 分钟。

sqoop2:从mysql导出数据到hdfs中

sqoop-shell

启动sqoopp-shell

jjzhu:bin didi$ sqoop2-shell Setting conf dir: /opt/sqoop-1.99.7/bin/../confSqoop home directory: /opt/sqoop-1.99.7Sqoop Shell: Type 'help' or '\h' for help.sqoop:000> set server --host localhost --port 12000 --webapp sqoopServer is set successfullysqoop:000> show version --allclient version:  Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb   Compiled by abefine on Tue Jul 19 16:08:27 PDT 20160    [main] WARN  org.apache.hadoop.util.NativeCodeLoader  - Unable to load native-hadoop library for your platform... using builtin-java classes where applicableserver version:  Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb   Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016API versions:  [v1]sqoop:000>

配置sqoop server

sqoop:000> set server --host localhost --port 12000 --webapp sqoopServer is set successfully

查看server连接是否可用

sqoop:000> show version --allclient version:  Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb   Compiled by abefine on Tue Jul 19 16:08:27 PDT 20160    [main] WARN  org.apache.hadoop.util.NativeCodeLoader  - Unable to load native-hadoop library for your platform... using builtin-java classes where applicableserver version:  Sqoop 1.99.7 source revision 435d5e61b922a32d7bce567fe5fb1a9c0d9b1bbb   Compiled by abefine on Tue Jul 19 16:08:27 PDT 2016API versions:  [v1]sqoop:000>

创建链接

查看sqoop server上可用的链接

sqoop:000> show connector+------------------------+---------+------------------------------------------------------------+----------------------+|          Name          | Version |                           Class                            | Supported Directions |+------------------------+---------+------------------------------------------------------------+----------------------+| generic-jdbc-connector | 1.99.7  | org.apache.sqoop.connector.jdbc.GenericJdbcConnector       | FROM/TO              || kite-connector         | 1.99.7  | org.apache.sqoop.connector.kite.KiteConnector              | FROM/TO              || oracle-jdbc-connector  | 1.99.7  | org.apache.sqoop.connector.jdbc.oracle.OracleJdbcConnector | FROM/TO              || ftp-connector          | 1.99.7  | org.apache.sqoop.connector.ftp.FtpConnector                | TO                   || hdfs-connector         | 1.99.7  | org.apache.sqoop.connector.hdfs.HdfsConnector              | FROM/TO              || kafka-connector        | 1.99.7  | org.apache.sqoop.connector.kafka.KafkaConnector            | TO                   || sftp-connector         | 1.99.7  | org.apache.sqoop.connector.sftp.SftpConnector              | TO                   |+------------------------+---------+------------------------------------------------------------+----------------------+sqoop:000>
  • generic-jdbc-connector
    依赖于java JDBC的connector,可以作为数据导入的数据源和目标源
  • hdfs-connector
    以hdfs作为数据源或者目标源的connector

用如下命令创建一个generic-jdbc-connector的链接

sqoop:002> create link -c generic-jdbc-connectorCreating link for connector with name generic-jdbc-connectorPlease fill following values to create new link objectName: mysql_weibouser_linkDatabase connectionDriver class: com.mysql.jdbc.DriverConnection String: jdbc:mysql://127.0.0.1:3306/spiderUsername: rootPassword: ****Fetch Size: Connection Properties: There are currently 0 values in the map:entry# protocol=tcpThere are currently 1 values in the map:protocol = tcpentry# SQL DialectIdentifier enclose:  **注意  这里不能直接回车!要打一个空格符号!因为如果不打,查询mysql表的时候会在表上加上“”,导致查询出错!**New link was successfully created with validation status OK and name mysql_weibouser_link

创建hdfs link

sqoop:002> create link -c hdfs-connectorCreating link for connector with name hdfs-connectorPlease fill following values to create new link objectName: hdfs_weibouser_linkHDFS clusterURI: hdfs://localhost:9000Conf directory: /opt/hadoop-2.7.3/etc/hadoopAdditional configs:: There are currently 0 values in the map:entry# New link was successfully created with validation status OK and name hdfs_weibouser_link

查看link

sqoop:002> show link+----------------------+------------------------+---------+|         Name         |     Connector Name     | Enabled |+----------------------+------------------------+---------+| mysql_weibouser      | generic-jdbc-connector | true    || mysql_weibouser_link | generic-jdbc-connector | true    || hdfs_link            | hdfs-connector         | true    || hdfs_link2           | hdfs-connector         | true    || hdfs_weibouser_link  | hdfs-connector         | true    |+----------------------+------------------------+---------+

创建job

sqoop:002> create job -f "mysql_weibouser_link" -t "hdfs_weibouser_link"Creating job for links with from name mysql_weibouser_link and to name hdfs_weibouser_linkPlease fill following values to create new job objectName: job_weibouserDatabase sourceSchema name: spiderTable name: spiders_weibouserSQL statement: Column names: There are currently 0 values in the list:element# Partition column: Partition column nullable: Boundary query: Incremental readCheck column: Last value: Target configurationOverride null value: Null value: File format:   0 : TEXT_FILE  1 : SEQUENCE_FILE  2 : PARQUET_FILEChoose: 0Compression codec:   0 : NONE  1 : DEFAULT  2 : DEFLATE  3 : GZIP  4 : BZIP2  5 : LZO  6 : LZ4  7 : SNAPPY  8 : CUSTOMChoose: 0Custom codec: Output directory: hdfs://localhost:9000/usr/jjzhu/spider/spiders_weibouserAppend mode: Throttling resourcesExtractors: 2Loaders: 2Classpath configurationExtra mapper jars: There are currently 0 values in the list:element# New job was successfully created with validation status OK  and name job_weibouser

各参数意义:

以下是各个属性Name:一个标示符,自己指定即可。Schema Name:指定Database或Schema的名字,在MySQL中,Schema同Database类似,具体什么区别没有深究过,但官网描述在创建时差不多。。Table Name:自己指定导出的表。SQL Statement:就是sql查询语句,文档上说需要指定一个$condition,但我一直没有创建成功,貌似是一个条件子句。配置完以上几项,又回出现element#提示符,提示输入一些hash值,直接回车过。Partition column:Partition column nullable:Boundary queryLast value后面需要配置数据目的地各项值:Null alue:大概说的是如果有空值用什么覆盖File format:指定在HDFS中的数据文件是什么文件格式,这里使用TEXT_FILE,即最简单的文本文件。Compression codec:用于指定使用什么压缩算法进行导出数据文件压缩,我指定NONE,这个也可以使用自定义的压缩算法CUSTOM,用Java实现相应的接口。Custom codec:这个就是指定的custom压缩算法,本例选择NONE,所以直接回车过去。Output directory:指定存储在HDFS文件系统中的路径,这里最好指定一个存在的路径,或者存在但路劲下是空的,貌似这样才能成功。Append mode:用于指定是否是在已存在导出文件的情况下将新数据追加到数据文件中。Extractors:2Loaders:2最后再次出现element#提示符,用于输入extra mapper jars的属性,可以什么都不写。直接回车。至此若出现successful则证明已经成功创建。

查看创建的job

sqoop:002> show job+----+---------------+-----------------------------------------------+--------------------------------------+---------+| Id |     Name      |                From Connector                 |             To Connector             | Enabled |+----+---------------+-----------------------------------------------+--------------------------------------+---------+| 1  | spider_job    | mysql_weibouser (generic-jdbc-connector)      | hdfs_link (hdfs-connector)           | true    || 2  | job_weibouser | mysql_weibouser_link (generic-jdbc-connector) | hdfs_weibouser_link (hdfs-connector) | true    |+----+---------------+-----------------------------------------------+--------------------------------------+---------+sqoop:002>

启动job

start job -n job_weibousersqoop:002> start job -n job_weibouserSubmission detailsJob Name: job_weibouserServer URL: http://localhost:12000/sqoop/Created by: didiCreation date: 2017-04-11 14:37:46 CSTLastly updated by: didiExternal ID: job_1491888730134_0003    http://jjzhu:8088/proxy/application_1491888730134_0003/2017-04-11 14:37:46 CST: BOOTING  - Progress is not available

查看job运行状态

sqoop:002> status job -n job_weibouserSubmission detailsJob Name: job_weibouserServer URL: http://localhost:12000/sqoop/Created by: didiCreation date: 2017-04-11 14:37:46 CSTLastly updated by: didiExternal ID: job_1491888730134_0003    http://jjzhu:8088/proxy/application_1491888730134_0003/2017-04-11 14:38:41 CST: SUCCEEDED Counters:    org.apache.hadoop.mapreduce.FileSystemCounter        FILE_LARGE_READ_OPS: 0        FILE_WRITE_OPS: 0        HDFS_READ_OPS: 2        HDFS_BYTES_READ: 290        HDFS_LARGE_READ_OPS: 0        FILE_READ_OPS: 0        FILE_BYTES_WRITTEN: 51361466        FILE_BYTES_READ: 25115854        HDFS_WRITE_OPS: 2        HDFS_BYTES_WRITTEN: 24652721    org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter        BYTES_WRITTEN: 0    org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter        BYTES_READ: 0    org.apache.hadoop.mapreduce.JobCounter        TOTAL_LAUNCHED_MAPS: 2        VCORES_MILLIS_REDUCES: 20225        MB_MILLIS_MAPS: 27120640        TOTAL_LAUNCHED_REDUCES: 2        SLOTS_MILLIS_REDUCES: 20225        VCORES_MILLIS_MAPS: 26485        MB_MILLIS_REDUCES: 20710400        SLOTS_MILLIS_MAPS: 26485        MILLIS_REDUCES: 20225        OTHER_LOCAL_MAPS: 2        MILLIS_MAPS: 26485    org.apache.sqoop.submission.counter.SqoopCounters        ROWS_READ: 109408        ROWS_WRITTEN: 109408    org.apache.hadoop.mapreduce.TaskCounter        MAP_OUTPUT_MATERIALIZED_BYTES: 25115866        REDUCE_INPUT_RECORDS: 109408        SPILLED_RECORDS: 218816        MERGED_MAP_OUTPUTS: 4        VIRTUAL_MEMORY_BYTES: 0        MAP_INPUT_RECORDS: 0        SPLIT_RAW_BYTES: 290        FAILED_SHUFFLE: 0        MAP_OUTPUT_BYTES: 24762129        REDUCE_SHUFFLE_BYTES: 25115866        PHYSICAL_MEMORY_BYTES: 0        GC_TIME_MILLIS: 1648        REDUCE_INPUT_GROUPS: 109408        COMBINE_OUTPUT_RECORDS: 0        SHUFFLED_MAPS: 4        REDUCE_OUTPUT_RECORDS: 109408        MAP_OUTPUT_RECORDS: 109408        COMBINE_INPUT_RECORDS: 0        CPU_MILLISECONDS: 0        COMMITTED_HEAP_BYTES: 1951399936    Shuffle Errors        CONNECTION: 0        WRONG_LENGTH: 0        BAD_ID: 0        WRONG_MAP: 0        WRONG_REDUCE: 0        IO_ERROR: 0Job executed successfully

查看hdfs的相关路径,看是否有输出文件

jjzhu:~ didi$ hdfs dfs -ls /usr/jjzhu/spiderFound 4 itemsdrwxr-xr-x   - didi supergroup          0 2017-04-11 14:38 /usr/jjzhu/spider/spiders_weibouserdrwxr-xr-x   - 777  supergroup          0 2017-04-11 10:58 /usr/jjzhu/spider/weibouserdrwxr-xr-x   - 777  supergroup          0 2017-04-11 13:33 /usr/jjzhu/spider/weobouserdrwxr-xr-x   - didi supergroup          0 2017-04-11 13:39 /usr/jjzhu/spider/weobouser2jjzhu:~ didi$ hdfs dfs -ls /usr/jjzhu/spider/spiders_weibouserFound 2 items-rw-r--r--   1 didi supergroup   12262783 2017-04-11 14:38 /usr/jjzhu/spider/spiders_weibouser/33b56441-b638-48cc-8d0d-37a808f25653.txt-rw-r--r--   1 didi supergroup   12389938 2017-04-11 14:38 /usr/jjzhu/spider/spiders_weibouser/73b20d50-de72-4aea-8c8c-d97cdc48e667.txt

也可以访问hdfs的webui进行查看

注意事项

定位问题一定要查看日志,sqoop的日志输出在sqoop.properties 配的路径sqoop.log

  1. 在配置hdfs文件路径的时候,确保有写权限
    通过如下命令修改hdfs下文件的权限
hdfs dfs -chown -R 777 HDFS_PATH
  1. RemoteException:User: xxx is not allowed to impersonate xxx

代理问题,确保hadoop的core-site.xml的xxx有代理

hadoop.proxyuser.XXX.hosts
*
hadoop.proxyuser.XXX.groups
*
  1. 开启sqoop的详细异常日志
sqoop:000> set option --name verbose --value true
  1. status job -n jobname时报错:java.net.ConnectException: Call From xxx.xxx.xxx.xxx to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refuse

这个问题一般是在hadoop2.x版本里会出现,Hadoop的datanode需要访问namenode的jobhistory server,如果没有修改,则默认为0.0.0.0:10020,则可以修改mapred-site.xml文件

mapreduce.jobhistory.address
localhost:10020
mapreduce.jobhistory.webapp.address
localhost:19888

启动jobhistory

$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
  1. Error: Java heap space Container killed by the ApplicationMaster. Container killed on request. Exit code is 143.

确保mapredce有足够的java内存,可以在mapreduce-site.xml中配置

mapreduce.map.java.opts
-Xmx1024m
mapreduce.reduce.java.opts
-Xmx1024m
  1. Exception has occurred during processing command
    Exception: org.apache.sqoop.common.SqoopException Message: GENERIC_JDBC_CONNECTOR_0016:Can't fetch schema -

解决: 在创建mysql链接时,Identifier enclose:指定SQL中标识符的定界符,也就是说,有的SQL标示符是一个引号:select * from "table_name",这种定界符在MySQL中是会报错的。这个属性默认值就是双引号,使用空格覆盖这个值。

参考:

转载地址:http://pxjta.baihongyu.com/

你可能感兴趣的文章
分享我基于NPOI+ExcelReport实现的导入与导出EXCEL类库:ExcelUtility (续2篇-模板导出综合示例)...
查看>>
Spring MVC (Java),强制页面不缓存
查看>>
javascirpt 用英文逗号替换英文分号、中英文逗号或者回车
查看>>
LeetCode --- 57. Insert Interval
查看>>
微信支付curl出错及错误码解决方案
查看>>
Codeforces 41D Pawn 简单dp
查看>>
Android,使用Json发送数据中,使用的Java转义字符 KanKan原创
查看>>
php分享二十八:mysql运行中的问题排查
查看>>
关于EventSource的精华
查看>>
.NET领域最为流行的IOC框架之一Autofac
查看>>
IntelliJ IDEA使用记录
查看>>
所谓完整的linux系统包括哪些部分呢?【转】
查看>>
Swift中出现“no such module cocoa”的错误
查看>>
开关电源过流保护-打嗝模式
查看>>
使用 jQuery Deferred 和 Promise 创建响应式应用程序
查看>>
EasyUI中那些不容易被发现的坑——EasyUI重复请求2次的问题
查看>>
nodejs教程
查看>>
NestIn VS插件 visual studio 中将同类CS文件放在一起显示
查看>>
iOS开发基础知识--碎片44
查看>>
ConcurrentHashMap原理分析
查看>>