4、同步数据和配置信息,启动其他节点
在另外两个节点上安装Tomcat和Solr服务器,只需要拷贝对应的目录即可:
- [hadoop@master ~]$ scp -r servers/ hadoop@slave1:~/
- [hadoop@master ~]$ scp -r servers/ hadoop@slave4:~/
-
- [hadoop@master ~]$ scp -r applications/solr/cloud hadoop@slave1:~/applications/solr/
- [hadoop@master ~]$ scp -r applications/solr/cloud hadoop@slave4:~/applications/solr/
-
- [hadoop@slave1 ~]$ mkdir -p applications/storage/cloud/data/
- [hadoop@slave4 ~]$ mkdir -p applications/storage/cloud/data/
启动其他Solr服务器节点:
- [hadoop@slave1 ~]$ cd servers/apache-tomcat-7.0.42
- [hadoop@slave1 apache-tomcat-7.0.42]$ bin/catalina.sh start
-
- [hadoop@slave4 ~]$ cd servers/apache-tomcat-7.0.42
- [hadoop@slave4 apache-tomcat-7.0.42]$ bin/catalina.sh start
查看ZooKeeper集群中数据状态:
- [zk: master:2188(CONNECTED) 3] ls /live_nodes
- [10.95.3.65:8888_solr-cloud, 10.95.3.61:8888_solr-cloud, 10.95.3.62:8888_solr-cloud]
5、创建Collection、Shard和Replication
直接通过REST接口来创建Collection,如下所示:
- [hadoop@master ~]$ curl 'http://master:8888/solr-cloud/admin/collections?action=CREATE&name=mycollection&numShards=3&replicationFactor=1'
- <?xml version="1.0" encoding="UTF-8"?>
- <response>
- <lst name="responseHeader"><int name="status">0</int><int name="QTime">4103</int></lst><lst name="success"><lst><lst name="responseHeader"><int name="status">0</int><int name="QTime">3367</int></lst><str name="core">mycollection_shard2_replica1</str><str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str></lst><lst><lst name="responseHeader"><int name="status">0</int><int name="QTime">3280</int></lst><str name="core">mycollection_shard1_replica1</str><str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str></lst><lst><lst name="responseHeader"><int name="status">0</int><int name="QTime">3690</int></lst><str name="core">mycollection_shard3_replica1</str><str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str></lst></lst>
- </response>
上面链接中的几个参数的含义,说明如下:
- name 待创建Collection的名称
- numShards 分片的数量
- replicationFactor 复制副本的数量
执行上述操作如果没有异常,已经创建了一个Collection,名称为mycollection,而且每个节点上存在一个分片。这时,也可以查看ZooKeeper中状态:
- [zk: master:2188(CONNECTED) 5] ls /collections
- [mycollection, collection1]
- [zk: master:2188(CONNECTED) 6] ls /collections/mycollection
- [leader_elect, leaders]
由上图可以看到,对应节点上SOLR分片的对应关系:
- shard3 10.95.3.61 master
- shard1 10.95.3.62 slave1
- shard2 10.95.3.65 slave4
实际上,我们从master节点可以看到,SOLR的配置文件内容,已经发生了变化,如下所示:
- [hadoop@master ~]$ cat applications/solr/cloud/multicore/solr.xml
- <?xml version="1.0" encoding="UTF-8" ?>
- <solr persistent="true">
- <cores defaultCoreName="collection1" host="${host:}" adminPath="/admin/cores" zkClientTimeout="${zkClientTimeout:15000}" hostPort="8888" hostContext="${hostContext:solr-cloud}">
- <core loadOnStartup="true" shard="shard3" instanceDir="mycollection_shard3_replica1/" transient="false" name="mycollection_shard3_replica1" collection="mycollection"/>
- </cores>
- </solr>
下面对已经创建的初始分片进行复制。
shard1已经在slave1上,我们复制分片到master和slave4上,执行如下命令:
- [hadoop@master ~]$ curl 'http://master:8888/solr-cloud/admin/cores?action=CREATE&collection=mycollection&name=mycollection_shard1_replica_2&shard=shard1'
- <?xml version="1.0" encoding="UTF-8"?>
- <response>
- <lst name="responseHeader"><int name="status">0</int><int name="QTime">1485</int></lst><str name="core">mycollection_shard1_replica_2</str><str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str>
- </response>
-
- [hadoop@master ~]$ curl 'http://master:8888/solr-cloud/admin/cores?action=CREATE&collection=mycollection&name=mycollection_shard1_replica_3&shard=shard1'
- <?xml version="1.0" encoding="UTF-8"?>
- <response>
- <lst name="responseHeader"><int name="status">0</int><int name="QTime">2543</int></lst><str name="core">mycollection_shard1_replica_3</str><str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str>
- </response>
-
- [hadoop@slave4 ~]$ curl 'http://slave4:8888/solr-cloud/admin/cores?action=CREATE&collection=mycollection&name=mycollection_shard1_replica_4&shard=shard1'
- <?xml version="1.0" encoding="UTF-8"?>
- <response>
- <lst name="responseHeader"><int name="status">0</int><int name="QTime">2405</int></lst><str name="core">mycollection_shard1_replica_4</str><str name="saved">/home/hadoop/applications/solr/cloud/multicore/solr.xml</str>
- </response>
最后的结果是,slave1上的shard1,在master节点上有2个副本,名称为mycollection_shard1_replica_2和mycollection_shard1_replica_3,在slave4节点上有一个副本,名称为mycollection_shard1_replica_4.
也可以通过查看master和slave4上的目录变化,如下所示:
- [hadoop@master ~]$ ll applications/solr/cloud/multicore/
- 总用量 24
- drwxrwxr-x. 4 hadoop hadoop 4096 8月 1 09:58 collection1
- drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:41 mycollection_shard1_replica_2
- drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:42 mycollection_shard1_replica_3
- drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:23 mycollection_shard3_replica1
- -rw-rw-r--. 1 hadoop hadoop 784 8月 1 15:42 solr.xml
- -rw-rw-r--. 1 hadoop hadoop 1004 8月 1 10:02 zoo.cfg
-
- [hadoop@slave4 ~]$ ll applications/solr/cloud/multicore/
- 总用量 20
- drwxrwxr-x. 4 hadoop hadoop 4096 8月 1 14:53 collection1
- drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:44 mycollection_shard1_replica_4
- drwxrwxr-x. 3 hadoop hadoop 4096 8月 1 15:23 mycollection_shard2_replica1
- -rw-rw-r--. 1 hadoop hadoop 610 8月 1 15:44 solr.xml
- -rw-rw-r--. 1 hadoop hadoop 1004 8月 1 15:08 zoo.cfg
其中,mycollection_shard3_replica1和mycollection_shard2_replica1都是创建Collection的时候自动生成的分片,也就是第一个副本。
通过Web界面,可以更加直观地看到shard1的情况,如图所示:
我们再次从master节点可以看到,SOLR的配置文件内容,又发生了变化,如下所示:
- [hadoop@master ~]$ cat applications/solr/cloud/multicore/solr.xml
- <?xml version="1.0" encoding="UTF-8" ?>
- <solr persistent="true">
- <cores defaultCoreName="collection1" host="${host:}" adminPath="/admin/cores" zkClientTimeout="${zkClientTimeout:15000}" hostPort="8888" hostContext="${hostContext:solr-cloud}">
- <core loadOnStartup="true" shard="shard3" instanceDir="mycollection_shard3_replica1/" transient="false" name="mycollection_shard3_replica1" collection="mycollection"/>
- <core loadOnStartup="true" shard="shard1" instanceDir="mycollection_shard1_replica_2/" transient="false" name="mycollection_shard1_replica_2" collection="mycollection"/>
- <core loadOnStartup="true" shard="shard1" instanceDir="mycollection_shard1_replica_3/" transient="false" name="mycollection_shard1_replica_3" collection="mycollection"/>
- </cores>
- </solr>
到此为止,我们已经基于3个物理节点,配置完成了SolrCloud集群。
索引数据
我们根据前面定义的schema.xml,自己构造了一个数据集,代码如下所示:
- package org.shirdrn.solr.data;
-
- import java.io.BufferedWriter;
- import java.io.FileOutputStream;
- import java.io.IOException;
- import java.io.OutputStreamWriter;
- import java.text.DateFormat;
- import java.text.SimpleDateFormat;
- import java.util.Date;
- import java.util.Random;
-
- public class BuildingSampleGenerator {
-
- private final DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
- private Random random = new Random();
-
- static String[] areas = {
- "北京", "上海", "深圳", "广州", "天津", "重庆","成都",
- "银川", "沈阳", "大连", "吉林", "郑州", "徐州", "兰州",
- "东京", "纽约", "贵州", "长春", "大连", "武汉","南京",
- "海口", "太原", "济南", "日照", "菏泽", "包头", "松原"
- };
-
- long pre = 0L;
- long current = 0L;
- public synchronized long genId() {
- current = System.nanoTime();
- if(current == pre) {
- try {
- Thread.sleep(0, 1);
- } catch (InterruptedException e) {
- e.printStackTrace();
- }
- current = System.nanoTime();
- pre = current;
- }
- return current;
- }
-
- public String genArea() {
- return areas[random.nextInt(areas.length)];
- }
-
- private int maxLatitude = 90;
- private int maxLongitude = 180;
-
- public Coordinate genCoordinate() {
- int beforeDot = random.nextInt(maxLatitude);
- double afterDot = random.nextDouble();
- double lat = beforeDot + afterDot;
-
- beforeDot = random.nextInt(maxLongitude);
- afterDot = random.nextDouble();
- double lon = beforeDot + afterDot;
-
- return new Coordinate(lat, lon);
- }
-
- private Random random1 = new Random(System.currentTimeMillis());
- private Random random2 = new Random(2 * System.currentTimeMillis());
- public int genFloors() {
- return 1 + random1.nextInt(50) + random2.nextInt(50);
- }
-
- public class Coordinate {
-
- double latitude;
- double longitude;
-
- public Coordinate() {
- super();
- }
-
- public Coordinate(double latitude, double longitude) {
- super();
- this.latitude = latitude;
- this.longitude = longitude;
- }
-
- public double getLatitude() {
- return latitude;
- }
-
- public double getLongitude() {
- return longitude;
- }
- }
-
-
- static int[] signs = {-1, 1};
- public int genTemperature() {
- return signs[random.nextInt(2)] * random.nextInt(81);
- }
-
- static String[] codes = {"A", "B", "C", "D", "E", "F", "G", "H", "I",
- "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V",
- "W", "X", "Y", "Z"};
- public String genCode() {
- return codes[random.nextInt(codes.length)];
- }
-
- static int[] types = {0, 1, 2, 3};
- public int genBuildingType() {
- return types[random.nextInt(types.length)];
- }
-
- static String[] categories = {
- "办公建筑", "教育建筑", "商业建筑", "文教建筑", "医卫建筑",
- "住宅", "宿舍", "公寓", "工业建筑"};
- public String genBuildingCategory() {
- return categories[random.nextInt(categories.length)];
- }
-
- public void generate(String file, int count) throws IOException {
- BufferedWriter w = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF-8"));
- w.write("id,area,building_type,category,temperature,code,latitude,longitude,when");
- w.newLine();
-
-
- for(int i=0; i<count; i++) {
- String when = df.format(new Date());
-
- StringBuffer sb = new StringBuffer();
- sb.append(genId()).append(",")
- .append("\"").append(genArea()).append("\"").append(",")
- .append(genBuildingType()).append(",")
- .append("\"").append(genBuildingCategory()).append("\"").append(",")
- .append(genTemperature()).append(",")
- .append(genCode()).append(",");
- Coordinate coord = genCoordinate();
- sb.append(coord.latitude).append(",")
- .append(coord.longitude).append(",")
- .append("\"").append(when).append("\"");
- w.write(sb.toString());
- w.newLine();
- }
- w.close();
- System.out.println("Finished: file=" + file);
- }
-
- public static void main(String[] args) throws Exception {
- BuildingSampleGenerator gen = new BuildingSampleGenerator();
- String file = "E:\\Develop\\eclipse-jee-kepler\\workspace\\solr-data\\building_files";
- for(int i=0; i<=9; i++) {
- String f = new String(file + "_100w_0" + i + ".csv");
- gen.generate(f, 5000000);
- }
- }
-
- }
生成的文件,如下所示:
- [hadoop@master solr-data]$ ll building_files_100w*
- -rw-rw-r--. 1 hadoop hadoop 109025853 7月 26 14:05 building_files_100w_00.csv
- -rw-rw-r--. 1 hadoop hadoop 108015504 7月 26 10:53 building_files_100w_01.csv
- -rw-rw-r--. 1 hadoop hadoop 108022184 7月 26 11:00 building_files_100w_02.csv
- -rw-rw-r--. 1 hadoop hadoop 108016854 7月 26 11:00 building_files_100w_03.csv
- -rw-rw-r--. 1 hadoop hadoop 108021750 7月 26 11:00 building_files_100w_04.csv
- -rw-rw-r--. 1 hadoop hadoop 108017496 7月 26 11:00 building_files_100w_05.csv
- -rw-rw-r--. 1 hadoop hadoop 108016193 7月 26 11:00 building_files_100w_06.csv
- -rw-rw-r--. 1 hadoop hadoop 108023537 7月 26 11:00 building_files_100w_07.csv
- -rw-rw-r--. 1 hadoop hadoop 108014684 7月 26 11:00 building_files_100w_08.csv
- -rw-rw-r--. 1 hadoop hadoop 108022044 7月 26 11:00 building_files_100w_09.csv
数据文件格式如下:
- [hadoop@master solr-data]$ head building_files_100w_00.csv
- id,area,building_type,category,temperature,code,latitude,longitude,when
- 18332617097417,"广州",2,"医卫建筑",61,N,5.160762478343409,62.92919119315037,"2013-07-26T14:05:55.832Z"
- 18332617752331,"成都",1,"教育建筑",10,Q,77.34792453477195,72.59812030045762,"2013-07-26T14:05:55.833Z"
- 18332617815833,"大连",0,"教育建筑",18,T,81.47569061530493,0.2177194388096203,"2013-07-26T14:05:55.833Z"
- 18332617903711,"广州",0,"办公建筑",31,D,51.85825084513671,13.60710950097155,"2013-07-26T14:05:55.833Z"
- 18332617958555,"深圳",3,"商业建筑",5,H,22.181374031472675,119.76001810254823,"2013-07-26T14:05:55.833Z"
- 18332618020454,"济南",3,"公寓",-65,L,84.49607030736806,29.93095171443135,"2013-07-26T14:05:55.834Z"
- 18332618075939,"北京",2,"住宅",-29,J,86.61660177436184,39.20847527640485,"2013-07-26T14:05:55.834Z"
- 18332618130141,"菏泽",0,"医卫建筑",24,J,70.57574551258345,121.21977908377244,"2013-07-26T14:05:55.834Z"
- 18332618184343,"徐州",2,"办公建筑",31,W,0.10129771041097524,153.40533210345387,"2013-07-26T14:05:55.834Z"
我们向已经搭建好的SolrCloud集群,执行索引数据的操作。这里,实现了一个简易的客户端,代码如下所示:
- package org.shirdrn.solr.indexing;
-
- import java.io.IOException;
- import java.net.MalformedURLException;
- import java.text.DateFormat;
- import java.text.SimpleDateFormat;
- import java.util.Date;
-
- import org.apache.solr.client.solrj.SolrServerException;
- import org.apache.solr.client.solrj.impl.CloudSolrServer;
- import org.apache.solr.common.SolrInputDocument;
- import org.shirdrn.solr.data.BuildingSampleGenerator;
- import org.shirdrn.solr.data.BuildingSampleGenerator.Coordinate;
-
- public class CloudSolrClient {
-
- private CloudSolrServer cloudSolrServer;
-
- public synchronized void open(final String zkHost, final String defaultCollection,
- int zkClientTimeout, final int zkConnectTimeout) {
- if (cloudSolrServer == null) {
- try {
- cloudSolrServer = new CloudSolrServer(zkHost);
- cloudSolrServer.setDefaultCollection(defaultCollection);
- cloudSolrServer.setZkClientTimeout(zkClientTimeout);
- cloudSolrServer.setZkConnectTimeout(zkConnectTimeout);
- } catch (MalformedURLException e) {
- System.out
- .println("The URL of zkHost is not correct!! Its form must as below:\n zkHost:port");
- e.printStackTrace();
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- }
-
- public void addDoc(long id, String area, int buildingType, String category,
- int temperature, String code, double latitude, double longitude, String when) {
- try {
- SolrInputDocument doc = new SolrInputDocument();
- doc.addField("id", id);
- doc.addField("area", area);
- doc.addField("building_type", buildingType);
- doc.addField("category", category);
- doc.addField("temperature", temperature);
- doc.addField("code", code);
- doc.addField("latitude", latitude);
- doc.addField("longitude", longitude);
- doc.addField("when", when);
- cloudSolrServer.add(doc);
- cloudSolrServer.commit();
- } catch (SolrServerException e) {
- System.err.println("Add docs Exception !!!");
- e.printStackTrace();
- } catch (IOException e) {
- e.printStackTrace();
- } catch (Exception e) {
- System.err.println("Unknowned Exception!!!!!");
- e.printStackTrace();
- }
-
- }
-
- public static void main(String[] args) {
- final String zkHost = "master:2188";
- final String defaultCollection = "mycollection";
- final int zkClientTimeout = 20000;
- final int zkConnectTimeout = 1000;
-
- CloudSolrClient client = new CloudSolrClient();
- client.open(zkHost, defaultCollection, zkClientTimeout, zkConnectTimeout);
-
- BuildingSampleGenerator gen = new BuildingSampleGenerator();
- final DateFormat df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'");
-
- for(int i = 0; i < 10000; i++) {
- long id = gen.genId();
- String area = gen.genArea();
- int buildingType = gen.genBuildingType();
- String category = gen.genBuildingCategory();
- int temperature = gen.genTemperature();
- String code = gen.genCode();
- Coordinate coord = gen.genCoordinate();
- double latitude = coord.getLatitude();
- double longitude = coord.getLongitude();
- String when = df.format(new Date());
- client.addDoc(id, area, buildingType, category, temperature, code, latitude, longitude, when);
- }
-
- }
-
- }
这样,可以查看SolrCloud管理页面,或者直接登录到服务器上,能够看到对应索引数据分片的情况,比较均匀地分布到各个Shard节点上。
当然,也可以从Web管理页面上来管理各个分片的副本数据,比如某个分片具有太多的副本,通过页面上的删除掉(unload)该副本,实际该副本的元数据信息被从ZooKeeper集群维护的信息中删除,在具体的节点上的副本数据并没有删除,而只是处于离线状态,不能提供服务。