Expecting List not String

Not sure if I should put this here or in the github but oh well.

content_N = 64
def generate_content_string(N):
    return [f'content_{i}' for i in range(N+1)]
content = generate_content_string(content_N)

instead of:

def generate_content_string(N):
    return ', '.join(f'"content_{i}"' for i in range(N+1))

because

output = mq.index("index002").add_documents(documents, client_batch_size=64, tensor_fields=content)

tensor_fields is expecting a List, not a string

I was wondering why it was adding the documents but not vectorizing

{'numberOfDocuments': 39, 'numberOfVectors': 0, 'backend': {'memoryUsedPercentage': 0.9093964143000001, 'storageUsedPercentage': 65.06686410673001}}

{'numberOfDocuments': 39, 'numberOfVectors': 1826, 'backend': {'memoryUsedPercentage': 0.76388229132, 'storageUsedPercentage': 65.14138952361}}

oh and

output = mq.index("index002").add_documents(documents, client_batch_size=64, tensor_fields="content")

marqo.errors.MarqoWebError: MarqoWebError: MarqoWebError Error message: {"detail":[{"loc":["body","tensorFields"],"msg":"value is not a valid list","type":"type_error.list"}]}

this it does say but when you call a function inside that field that generates a string ‘“content_0”, “content_N”’ it just doesn’t vectorize anything for a reason I don’t understand.

Thanks for the question! Yes, tensor_fields must be given a list, or you will encounter an error. It seems like the first generate_content_string you sent generates a list, while the 2nd generates a string.

Could you provide the script you used that made Marqo “not vectorise anything” so we can reproduce it on our end?

This:

import marqo
import json
import pprint

mq = marqo.Client(url='http://localhost:8882')

index = "masterdocs"
# mq.delete_index(index)
# mq.create_index(index)
print(mq.index(index).get_stats())

content_N = 212
documents = []

def generate_content_string(N):
    return [f'content_{i}' for i in range(N+1)]
content = generate_content_string(content_N)


with open('master_documents.json', 'r', encoding='utf-8') as file:
    documents = json.load(file)

output = mq.index(index).add_documents(documents, client_batch_size=64, tensor_fields=content)
pprint.pprint(output)
print(type(output))

results2 = mq.index(index).get_stats()
print(results2)

Does work, but this:

import marqo
import json
import pprint

mq = marqo.Client(url='http://localhost:8882')

index = "masterdocs"
# mq.delete_index(index)
# mq.create_index(index)
print(mq.index(index).get_stats())

content_N = 212
documents = []

def generate_content_string(N):
    return ', '.join(f'"content_{i}"' for i in range(N+1))


with open('master_documents.json', 'r', encoding='utf-8') as file:
    documents = json.load(file)

output = mq.index(index).add_documents(documents, client_batch_size=64, tensor_fields=generate_content_string(content_N))
pprint.pprint(output)
print(type(output))

results2 = mq.index(index).get_stats()
print(results2)

doesn’t vectorize.

Do you have any index settings, or are you using the default? This is to know whether your index is structured or unstructured.

Can you also send some sample documents here? It’s possible the document field names are not expected.

default,

[
    {
        "_id": "01",
        "content_1": "A smartphone is a portable computer device that combines mobile telephone ",
        "content_2": "functions and computing functions into one unit.",
        "datum_created": "2024-03-26",
        "datum_modified": "2024-03-26"
    },
    {
        "_id": "02",
        "content_1": "A telephone is a telecommunications device that permits two or more users to",
        "content_2": "conduct a conversation when they are too far apart to be easily heard directly.",
        "datum_created": "2024-03-26",
        "datum_modified": "2024-03-26"
    },
    {
        "_id": "03",
        "content_1": "The thylacine, also commonly known as the Tasmanian tiger or Tasmanian wolf, ",
        "content_2": "is an extinct carnivorous marsupial.",
        "content_3": "The last known of its species died in 1936.",
        "datum_created": "2024-03-26",
        "datum_modified": "2024-03-26"
    },
    {
        "_id": "04",
        "content_1": "Artificial intelligence (AI) is the simulation of human intelligence processes ",
        "content_2": "by machines, especially computer systems.",
        "datum_created": "2024-03-26",
        "datum_modified": "2024-03-26"
    },
    {
        "_id": "05",
        "content_1": "The Great Wall of China is a series of fortifications made of stone, brick, ",
        "content_2": "tamped earth, wood, and other materials, generally built along an east-to-west line across the historical northern borders of China.",
        "datum_created": "2024-03-26",
        "datum_modified": "2024-03-26"
    },
    {
        "_id": "06",
        "content_1": "Photosynthesis is the process by which green plants and some other organisms ",
        "content_2": "use sunlight to synthesize foods with the help of chlorophyll.",
        "datum_created": "2024-03-26",
        "datum_modified": "2024-03-26"
    },
    {
        "_id": "07",
        "content_1": "The Eiffel Tower is a wrought iron lattice tower located on the Champ de Mars ",
        "content_2": "in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower.",
        "datum_created": "2024-03-26",
        "datum_modified": "2024-03-26"
    },
    {
        "_id": "08",
        "content_1": "The Roman Empire was a period of ancient Roman civilization",
        "content_2": "characterized by government headed by emperors and large territorial holdings around the Mediterranean Sea in Europe, Africa, and Asia.",
        "content_3": "The city of Rome served as its capital.",
        "content_4": "The Roman Empire emerged after the Roman Republic, which was established in 509 BCE.",
        "content_5": "It reached its greatest territorial extent during the 2nd century CE, encompassing much of Europe, North Africa, and the Middle East.",
        "content_6": "The Roman Empire eventually split into the Western Roman Empire and the Eastern Roman Empire, with the latter often referred to as the Byzantine Empire.",
        "content_7": "The Western Roman Empire fell in 476 CE, while the Eastern Roman Empire continued to exist for several centuries until Constantinople fell to the Ottoman Turks in 1453 CE.",
        "datum_created": "2024-03-26",
        "datum_modified": "2024-03-26"
    },
    {
        "_id": "09",
        "content_1": "The periodic table is a tabular display of the chemical elements",
        "content_2": "arranged by atomic number, electron configuration, and recurring chemical properties.",
        "content_3": "The structure of the table shows periodic trends such as elements with similar behavior in the same column.",
        "content_4": "The periodic table is widely used in chemistry for understanding the properties of elements and predicting the behavior of chemical compounds.",
        "content_5": "It was first proposed by Russian chemist Dmitri Mendeleev in 1869.",
        "content_6": "The table has been expanded and modified over time as new elements have been discovered and our understanding of atomic structure has advanced.",
        "content_7": "Today, the periodic table is a fundamental tool in chemistry education and research.",
        "datum_created": "2024-03-26",
        "datum_modified": "2024-03-26"
    }
]

I see no reason as to why this would be wrong considering it works with the List.

Using the code snippets you have sent, I cannot reproduce the scenario you described.

When I use your 2nd code snippet (where tensor_fields is a string), it errors out like so, which is expected behavior:

raise MarqoWebError(message=response_msg, code=code, error_type=error_type,
marqo.errors.MarqoWebError: MarqoWebError: MarqoWebError Error message: {"detail":[{"loc":["body","tensorFields"],"msg":"value is not a valid list","type":"type_error.list"}]}
status_code: 422, type: unhandled_error_type, code: unhandled_error, link:

Well then I’ve got no idea as to how it happened, unfortunately vscode doesn’t let me Ctrl Z that far back so I can only remember that I called the function inside the add_documents for a long string instead of list…

I see, thanks for this note. As for future uses of tensor_fields, just make sure to use a list. Any other data type will raise an error.