Wednesday, 21 May 2014

Using post.jar for posting JSON, CSV, XML data on Solr

In my last few post, I discussed about "Dashboard introduction & how to post data on Apache Solr via it's dashboard screen" & also provides many examples for the same. In that approach, I can post only one record at a time i.e. I am not able to post data using different files having differently formatted records like JSON, XML, CSV.

Agenda for this post

  1. how to post XML data in form of a XML file using post.jar file?
  2. how to post CSV data in form of a CSV file using post.jar file?
  3. how to post JSON data in form of a JSON file using  post.jar file?
Schema for this post is same as that of my last post
http://versatileankur.blogspot.in/2014/05/how-to-query-to-apache-solr.html

how to post XML data in form of a XML file using post.jar file?
Apache java comes with a inbuilt jar file for document posting. This file is present at
<parent-directory>/solr-4.7.2/example/exampledocs
This exampledocs directory have many XML files for demo purpose. 
How to post XML document files using this jar file.
just create a XML file with given records.

<add>
<doc>
   <field name="id">Solr105</field>
   <field name="name">Solr 105</field>
   <field name="address">House No - 100, LR Apache, 40702</field>
   <field name="comments">Apache Solr comment 1</field>
   <field name="popularity">101</field>
   <field name="counts">1</field>    
</doc>
<doc>
   <field name="id">Solr106</field>
   <field name="name">Solr 106</field>
   <field name="address">House No - 100, LR Apache, 40702</field>
   <field name="comments">Apache Solr comment 2</field>
   <field name="popularity">100</field>
   <field name="counts">2</field>
   <field name="dynamicField_i">It is dynamically genrated field.</field>
</doc>
<doc>
   <field name="id">Solr107</field>
   <field name="name">Solr 107</field>
   <field name="address">House No - 100, LR Apache, 40702</field>
   <field name="comments">Apache Solr It's Cool.</field>
   <field name="popularity">109</field>
   <field name="counts">3</field>
   <field name="dynamicField_i">It is dynamically genrated field.</field>
</doc>
</add>

Save this file as dummy.xml under <solr>/example/exampledocs directory.
Go to exampledocs directory using command prompt & execute -
java -jar post.jar dummy.xml

For multiple XML files use -
java -jar post.jar dummy.xml dummy1.xml

For all XML files present in working directory use-
java -jar post.jar *.xml

SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update using content-type application/xml..
POSTing file dummy.xml
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update..
Time spent: 0:00:00.547

it means your data XML document has been indexed on Apache Solr. just go to your dashboard screen
select collection1 -> query-> Click on Execute Query Button
you will get a screen just like.




















Syntax of XML file

<add></add> it behaves a the parent of all the records/entities i.e. Root Element.
<doc><doc> it denotes one record/entity to be added on Apache solr.
<field></field> it denotes the property of a record/entity.

"All required fields mentioned in schema.xml must present for all <doc> element in file".

Let's consider, If your second <doc></doc> element doesn't full fill this restriction then for the first record will be updated and then it do nothing with all other records in that file. i.e. after exception it stop reading your document, so be care full with your required fields and document provided to Apache Solr for data updation.

How to post CSV data in form of a CSV file using post.jar file?
first create a CSV file at /example/exampledocs/ directory using these records-

id,name,address,comments,popularity,counts,dynamicField_i
"Solr110","Solr 110","House No - 100, LR Apache","Apache Solr comment 1",110,110,"dynamic solr 110"
"Solr111","Solr 111","House No - 100, LR Apache","Apache Solr comment 1",111,111,"dynamic solr 111"
"Solr112","Solr 112","House No - 100, LR Apache","Apache Solr comment 1",112,112,"dynamic solr 112"
"Solr113","Solr 113","House No - 100, LR Apache","Apache Solr comment 1",113,113,"dynamic solr 113"

save this file as dummy.csv -
Go to /example/exampledocs directory using command prompt & execute

java -Durl=http://localhost:8983/solr/update/csv -Dtype=text/csv -jar post.jar dummy.csv

For multiple CSV files use -
java -Durl=http://localhost:8983/solr/update/csv -Dtype=text/csv -jar post.jar dummy.csv dummy1.csv

For all CSV files present in working directory use-
java -Durl=http://localhost:8983/solr/update/csv -Dtype=text/csv -jar post.jar *.csv

you will get on console a success message as -
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update/csv using content-type text/csv..
POSTing file dummy.csv
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update/csv..
Time spent: 0:00:00.577



it means your data CSV document has been indexed in Apache Solr. just go to your dashboard screen
select collection1 -> query-> Click on Execute Query Button
your screen looks like-
















Congrats your CSV document has been posted successfully.

how to post JSON data in form of a JSON file using post.jar file?
first create a JSON file at /example/exampledocs/ directory using these records
[{
"id":"Solr115",
"name":"Solr 115",
"address":"House No - 100, LR Apache, 40702",
"comments":"Apache Solr comment 1",
"popularity":115,
"counts":115
},
{
"id":"Solr116",
"name":"Solr 116",
"address":"House No - 100, LR Apache, 40702",
"comments":"Apache Solr comment 1",
"popularity":116,
"counts":116
},
{
"id":"Solr117",
"name":"Solr 117",
"address":"House No - 100, LR Apache, 40702",
"comments":"Apache Solr comment 1",
"popularity":117,
"counts":117
}]

save this file as dummy.json -
Go to /example/exampledocs directory using command prompt & execute given command
java -Durl=http://localhost:8983/solr/update/json -Dtype=application/json -jar post.jar dummy.json

For multiple JSON files use -
java -Durl=http://localhost:8983/solr/update/json -Dtype=application/json -jar post.jar d1.json d2.json

For all JSON files present in working directory use-
java -Durl=http://localhost:8983/solr/update/json -Dtype=application/json -jar post.jar *.json

you will get on console a success message as -
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update/json using content-type application/json..
POSTing file dummy.json
1 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/update/json..
Time spent: 0:00:00.535

it means your data JSON document has been indexed in Apache Solr. just go to your dashboard screen
select collection1 -> query-> Click on Execute Query Button
your screen looks like-





















this post.jar file provides you some more parameters with <add> tag in XML file. I will discuss them in my later posts.

Namah Shivay

Wednesday, 14 May 2014

How to Query Apache Solr

In this post I will show how to query Apache Solr using its Dashboard screen. This query can be done using Java HttpClient lib, curl request as well. But as I am giving intro to Solr dashboard in my last four posts, so I try to fire different kinds of query using it's dashboard screen. We will do all these stuff using java code as well in my next post.

Let's update your schema.xml file with given mappings & start your Apache Solr Server -

<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example core zero" version="1.1">
<fields> <field name="_version_" type="long" indexed="true" stored="true"/> <field name="_root_" type="string" indexed="false" stored="false"/> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="string" indexed="true" stored="true" /> <field name="address" type="string" indexed="true" stored="true" /> <field name="comments" type="string" indexed="true" stored="true" /> <field name="text" type="string" indexed="true" stored="false" multiValued="true"/> <field name="popularity" type="long" indexed="true" stored="true" multiValued="false"/> <field name="counts" type="long" indexed="true" stored="true" /> <dynamicField name="*_i" type="string" indexed="true" stored="true" /> </fields> <uniqueKey>id</uniqueKey> <copyField source="name" dest="text"/> <copyField source="address" dest="text"/> <copyField source="comments" dest="text"/> <types> <fieldtype name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/> </types>
</schema>

It's time to add more records on Apache Solr. Go to -
Solr Dashboard ->Select Collection1-> Documents 
& save all of these records one by one.
{
"id": "Solr101",
"name":"Solr version 4.7.2",
"address":"House No - 100, LR Apache, 40702",
"comments": "Apache Solr It's Cool.",
"popularity":10,
"counts":140,
"dynamicField_i": "It is dynamically genrated field."        
}
{
"id": "Solr102",
"name":"Solr SECOND RECORD",
"address":"SECOND RECORD ADDRESS",
"comments": "RECORDS FOR TESTING PURPOSE",
"popularity":10,
"counts":340,
"dynamicField_i": "It is dynamically genrated field FOR SECOND RECORD."        
}
{
"id": "Solr103",
"name":"Solr THIRD RECORD",
"address":"THIRD RECORD ADDRESS",
"comments": "RECORDS FOR TESTING PURPOSE",
"popularity":1,
"counts":40,
"dynamicField_i": "It is dynamically genrated field FOR THIRD RECORD."        
}
{
"id": "Solr104",
"name":"Solr FOURTH RECORD",
"address":"FOURTH RECORD ADDRESS",
"comments": "RECORDS FOR TESTING PURPOSE",
"popularity":6,
"counts":400,
"dynamicField_i": "It is dynamically genrated field FOR FOURTH RECORDS."        
}
Screen Shot -


Go to query tag & click on execute Query you will get-
This screen have lot of text fields, I am going to introduce all of them.

q Field (Stands for Query default *:*)
First * notation denotes the <field Name>
Second * denotes the text to be search in that field.
Ex. Type id:Solr102  in this textbox and click on Execute Query button, Solr will search "Solr102 "string in <id> field and returns you all the results matching this criteria.

fq Field stands for (Filter Query)
This is used as a query filter & imposes more restriction on the parent query string provided by you. This Filter response is stored in cache separately so if you hit this filter query as a main query then it's result will be return from the cached output.
fq parameter can be specified multiple times by pressing "+" sign at the right of the text box. Serch response will be generated after the intersection of these multiple parameters. ex.
fq=popularity:10 
fq=counts:140
It will fetch the records where popularity is 10 and counts is 140. It can be written into single query is
fq=+ popularity:10+counts:140
as shown below - 
In this screen shot top right corner have a link as highlighted in this image just click on this link it will open a new browser tab & show you the result in that tab. It means if you want to get the same result using browser window then no need to go to this Dashboard screen. You directly write your query on the browser window & it will return the result of your query.

Sort 
ex. id desc 
Note here I am sorting the document on the basic of id.
Note :- syntex for declaration is <fieldName><space><Sorting Order i.e. asc or desc>
you can have multiple sorting order. Let's consider you have 3 sorting order then second is evaluated only when there ijs conflict in first sorting order and third sorting order will be evaluated only when first and second sorting order produces any conflict.

Start,rows
Starts is from where the fetching of the records should be done. rows means number of records to be fetched. ex.
if start=10, rows=20
then it will fetch records from 10th to 29th.

fl (stands for field list)
It will restrict the number of fields returned from the Apache Solr. These fields can be defined using comma separation.
Ex. name,address 
it will show only the name and address field returned from in response.
screen shot - 
Name aliasing can also be done as
id,UserName:name 
Syntex- <Alias Name> : <fieldName>
here Solr will return the result with two fields one is id and second is UserName which is used as a alias of <name> field.
screen shot-
you can also use * annotation from returning the result as
id,add*
description- It will return id and all those fields which are started from "add" string.
Function in response as
id, reviews:sum(popularity,counts)
description- It will return two fields as id and second is prise which is sum of popularity with itself.
df (Defined Fields)
Here all the fields are separated by comma and this field is used for search purpose. i.e
if you only enter the text in search query section (q section) and defined some fields in df textbox then Solr search that text only in these defined fields not in any other field.
Ex. type Solr104 in q section
and type id in df field will search Solr104 in df field.
omitHeader (default value false)
How to omit header from the response return from the Apache Solr
If omitHeader=true 
Ex. hit given URL in your browser's window you will get a response without addition details.
debug(default false)
you can debug your query by using this parameter.

Saturday, 10 May 2014

Intro to Solr schema.xml File

Till Now, I have introduced about the dashboard of Apache Solr. Now I am going to discuss the core part of Apache Solr i.e. Solr schema.xml file.

"schema.xml file defines the field which will be used as a reference for inserting as well as for querying data from Apache Solr."

This XML file defines that your document can have these fields, if you provide more fields which are not defined in schema.xml file then Solr will ignore those extra fields.
When you provide your document in any of the supported format then with the help of this file Apache Solr decide how different fields provided in given document will be treated. i.e. This file is use for 

  1. Deciding unique field in the document. 
  2. Fields used for indexing. 
  3. Fields to be generated  dynamically.
  4. How different fields will be used at the time of indexing as well as querying etc.
I divide this XML file into five core tags for beginners. This file also have some other tags but for the beginning these are enough to know. I will discuss about all of them in Advance Solr Learning Tutorial Series.
  1. Fields
  2. Unique Key 
  3. Copy Field
  4. Types
  5. Default Search Fields
This schema.xml file is located at /solr4.7.2/example/collection1/conf directory.  In this XML file root tag is <schema> & all other tags are defined within it.

<fields> Tag
This is the section where you defined the fields which you want to have in your document & used for indexing as well as for Query purpose, within this tag we define multiple <field> tag ex.

<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>

name : Every fields has this attribute which will be used as a reference for adding document & for search queries.
type : It defines the type of field i.e. string ,Integer, etc.
stored : If false then it's value will not be stored in Apache Solr. It's default value is True.
indexed : If this fields is set as true then this field will be used for indexing, searching, facetable by Apache Solr.
multiValued : If this field is set as true then Solr will manage multiple values for this field. It's default values is False.
required : It tells that your document must have this fields.
This <fields> tag also supports dynamic fields

For example the following dynamic field declaration tells Solr that whenever it sees a field name ending in "_i" which is not an explicitly defined field, then it should dynamically create an string field with that name...

<dynamicField name="*_i" type="string" indexed="true" stored="true" />

using this field type Solr provides us a facility to dynamically generate a field i.e. we doesn't have a need to define all the fields Solr will do it for us. Above declaration will tell Solr that whenever it sees a field name ending in "_i" which is not explicitly defined field, then generate a string field with that name.
you can use * at the start or ending of the name field if multiple fields defines same rules then that field will be used which is defined first.

Mandatory fields - 

_root_
points to the root document of a block of nested documents. Required for nested document support, may be removed otherwise.
_version_
The _version_ field is an internal field that is used by the partial update procedure, the update log process, etc. It is only used internally for those processes, and simply providing the _version_ field in your schema.xml should be sufficient.

Copy Fields Tag
<copyField source="title" dest="text"/>
<copyField source="text_data" dest="text"/>
<copyField source="description" dest="text"/>

it will copy the text of title field to text field. You can copy the text of multiple fields to on field and for doing this text field must exist in your schema you can use it for searching and indexing purpose.

<uniqueKey>id</uniqueKey>
As it name shows the primary key for the document records & used to uniquely identify the records during modification, deletion, as well as for searching.

<defaultSearchField>text</defaultSearchfield>
This field will be used if you doesn't provide any field name explicitly i.e. if you search only text without any field label then it will search in text field and show you the results.

<types>
This section allows you to add different <fieldtype> which will be available to your schema ex.

<fieldType name="string" class="solr.StrField" />

name : This the unique name work as identifier and used as a reference to your Solr class.
class : this field provides you the information about which of the Solr class is used for this field type. It also contains some default options which you want for your field. Default schema.xml file comes with a large number of < fieldTypes> just go through them you will know the option provided for each field type. As I am not going to discuss all optional attribute. I will explain them on the basis of our requirement. 
For common numeric types (integer, float, etc...) there are multiple implementations provided depending on your needs, .
you can also create your own custom field type. please see SolrPlugins for information on how to ensure that your own custom Field Types can be loaded into Solr.

Lets define new schema as show below just copy and past this schema into your schema.xml file & then start your Solr Server.


<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example core zero" version="1.1">
<fields>
<field name="_version_" type="long" indexed="true" stored="true"/> 

<field name="_root_" type="string" indexed="false" stored="false"/>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="name" type="string" indexed="true" stored="true" multiValued="true" />
<field name="address" type="string" indexed="true" stored="true" multiValued="true" />
<field name="comments" type="string" indexed="true" stored="true" multiValued="true" />
<field name="text" type="string" indexed="true" stored="false" multiValued="true"/>
<dynamicField name="*_i" type="string" indexed="true" stored="true" />
</fields>

<uniqueKey>id</uniqueKey>


<copyField source="name" dest="text"/>

<copyField source="address" dest="text"/>
<copyField source="comments" dest="text"/>

<types>

<fieldtype name="string"  class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
<fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
</types>
</schema>

check your schema is updated or not for that -
Go to Solr Dashboard Screen -> select collection1 Core ->select files -> schema.xml




Now, go to Documents section copy and paste the given JSON in to document list tab -
{
"id":"Solr101",
"name":"Solr version 4.7.2",
"address":["House No - 100, LR Apache, 40702","Address 2 ","address 3"],
"comments":["Working with Apache Solr and it's cool till now I am a beginner and started with Ankur.",
        "Comment 2","Comment 3"],
"dynamicField_i":"It is dynamically genrated field."
}
your screen looks like - 









Change the value of document type as JSON.
then go to query section & click the Execute Query button. You will the  screen as follows-





















Here you can see address & comments fields are JSON array fields as I select multivalued for both of these fields & one more field has been created named as dynamciField_i as I declare it as a dynamic field, so Solr creates it for us dynamically.