This post shows how to ingest JSON into MarkLogic 7 using mlcp. Unlike many, this one is very specific to MarkLogic 7.
Since the release of MarkLogic 6, MarkLogic Content Pump (mlcp) has been the supported tool for importing, exporting, and copying content. One feature that’s missing from it is the ability to load JSON files without having them stored as text files. To expand on that, let me point out that MarkLogic 7 is part of a transition in how MarkLogic handles JSON files.
MarkLogic Version | handles JSON as |
---|---|
5 | text |
6, 7 | quietly converted to XML |
8 | native type |
In MarkLogic 5, JSON documents are stored as text. As with any text document, that lets you do word searches, but you’re not able to use the structure.
In MarkLogic 6 and 7, you can load JSON using the REST API and MarkLogic quietly converts it to an XML format. When you request the document back, MarkLogic quietly converts it back to JSON. The reason for this is that handling JSON was a goal for MarkLogic 6, but it’s done at the REST API level — internally, actual JSON would just be text, preventing us from building indexes and otherwise working with the structure. By converting it to XML internally, we can do much more with it.
In MarkLogic 8, JSON is planned to be a native type. I tested loading JSON with mlcp today on an ML8 development build, and mlcp loads JSON as native content, meaning that the structure is accessible without having to do anything special.
Ingest via REST API
We can ingest JSON and have it transparently converted by POSTing them to the REST API. Here’s how to load a directory of JSON documents.
for f in ~/data/json-data/*.json; do curl --anyauth --user admin:admin -X POST -d@$f -i \ Â -H "Content-type: application/json" \ Â 'http://localhost:8040/v1/documents?extension=json&directory=/content/'; done
That works great, but for larger amounts of data, you lose out on mlcp’s ability to parallelize the workload.
Ingest via MLCP
MarkLogic’s documentation describes how to use a transform with mlcp. Here’s a simple transform that applies MarkLogic’s json:transform-from-json() function:
And here’s the call to have mlcp use it:
Tags: json, marklogic, ml7, mlcp, rest api
June 30th, 2014 at 10:54 pm
An excellent technique. I think however that 6 stored JSON as XML just as 7 does.
January 29th, 2015 at 8:49 am
Hi David,
Can you please help me with this use case for Marklogic 7.
1. I have some xml data 80 mb . Now I need to store 80mb data into 2 partitions. Partition 1 and partition 2 , say in partition 1 I load 50mb and in partition 2 I load 30 mb. Now if I choose partiion 1 , the application has to search from partition 1 only and if I choose partition 2 the application has to search from partition 2 only and if I don’t select any partition it has to search from partition 1 and partition 2. Is it possible in Marklogic?
2. When I search a particular text in marklogic search say I get search results about 20 documents from 1000 documents. Can I cache the search results somewhere as I need these search results to process the documents for finding top 10 words. Is it possible in marklogic and if so please let me know.
Thanks in advance.
Regards
Shashi
January 29th, 2015 at 9:01 am
Shashi, for your first question, you could segment your data either with collections or in different directories. See the Search Developer’s Guide for the difference (http://docs.marklogic.com/guide/search-dev/collections#id_66550). Once they are segmented that way, you can include a cts:directory-query or cts:collection-query in your search.
On the second question, it sounds like you want to get the top 10 words from the 1000 documents, not just the 20. I think you can do this with the cts:words function, passing in same query you used to identify the 1000 documents. You’d set up a word lexicon before doing so. http://docs.marklogic.com/cts:words
January 29th, 2015 at 9:46 am
Hi David,
Thanks for really helping me out on the first question. Your answer is what I was expecting. Really really appreciate your help.
For the second question, I want the search results for processing it again which means getting hold of the search results that is say only 20 documents not the entire 1000 docs. I want to get top 10 words only from the search results that is 20 documents and not from the 1000 documents. Use case is like this, I search for some text say ‘doctor’ and I get 20 docs containing the text ‘doctor’, now further from these 20 docs, I want top ten words and not from entire 1000. Query for ‘doctor’ and from the results get the top 10 occuring words in those 20 docs.
Please help me.
Thanks and Regards
Shashi
January 29th, 2015 at 10:20 am
Shashi, check out the “sample=N” parameter to cts:words(). Pass in 20 there, along with the query, and I think you’ll get what you need.
January 29th, 2015 at 10:44 am
Thanks David I will check it out