Questions tagged [bigdata]
2412 questions
0
votes
0
answer
5
Views
How to handle big textual data to create WordCloud?
I have a huge textual data that I need to create its word cloud. I am using a Python library named word_cloud in order to create the word cloud which is quite configurable. The problem is that my textual data is really huge, so a high-end computer is not able to complete the task even for long hours...
1
votes
1
answer
57
Views
why boolean field is not working in Hive?
I have a column in my hive table which datatype is boolean. when I tried to import data from csv, it stored as NULL.
This is my sample table :
CREATE tABLE if not exists Engineanalysis(
EngineModel String,
EnginePartNo String ,
Location String,
Position String,
InspectionReq boolean)
ROW FORMAT DELI...
0
votes
0
answer
17
Views
Loop takes too long to replace values
I have a 9 mln line of code where I have to replace every ' . ' with the number in the line above. Meaning that if column 1 contains ' 7 ', the dot below should be replaced by that. But if column 3 contains ' 44 ', the subsequent values have to be replaced by 44 and so on.
Problem: at the the moment...
1
votes
2
answer
65
Views
append multiple columns to existing dataframe in spark
I need to append multiple columns to the existing spark dataframe where column names are given in List
assuming values for new columns are constant, for example given input columns and dataframe are
val columnsNames=List('col1','col2')
val data = Seq(('one', 1), ('two', 2), ('three', 3), ('four', 4...
1
votes
1
answer
73
Views
Create neighborhood list of large dataset / fasten up
I want to create a weight matrix based on distance. My code for the moment looks as follows and functions for a smaller sample of the data. However, with the large dataset (569424 individuals in 24077 locations) it doesn't go through. The problem arise at the nb2blocknb fuction. So my question would...
1
votes
1
answer
693
Views
Time series classification using LSTM - How to approach?
I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions:
Is the main idea for learning the LSTM to take a same sample from every time series?
E.g. if I have time series A (with sam...
1
votes
2
answer
1.1k
Views
Hive View Not Opening
In the Ambari UI of the hortonworks sandbox, I was trying to open Hive View through the account of maria_dev. But however, I was getting the following error:
Service Hive check failed:
Cannot open a hive connection with connect string
jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscovery...
1
votes
1
answer
505
Views
Large data with pivot table using Pandas
I’m currently using Postgres database to store survey answers.
My problem I’m facing is that I need to generate pivot table from Postgres database.
When the dataset is small, it’s easy to just read whole data set and use Pandas to produce the pivot table.
However, my current database now has a...
1
votes
0
answer
22
Views
Formal criterion of availability to implement algorithm using MapReduce paradigm
Computational model supported by MapReduce is expressive enough to compute almost any function on your data.
If I understood correctly there are algorithms which can not be implemented using MapReduce paradigm. Is it correct ?
Are there any criteria, formal or otherwise, to discern whether an algori...
1
votes
2
answer
144
Views
nutch1.14 deduplication failed
I have integrated nutch 1.14 along with solr-6.6.0 on CentOS Linux release 7.3.1611 I had given about 10 urls in seedlist which is at /usr/local/apache-nutch-1.13/urls/seed.txt I followed the tutorial
[[email protected] apache-nutch-1.14]# bin/nutch dedup http://ip:8983/solr/
DeduplicationJob: starting...
1
votes
2
answer
84
Views
How to remove rows after a particular observation is seen for the first time
I have a dataset wherein I have account number and 'days past due' with every observation. For every account number, as soon as the 'days past due' column hits a code like 'DLQ3', I want to remove rest of the rows for that account (even if DLQ3 is the first observation for that account).
My dataset...
1
votes
0
answer
318
Views
Reading a large PostgreSQL table into Python
I have data as below:
ID date Net Total Class
11 201706 XN 0.607500 P
53 201709 M9 0.989722 V
68 201709 FM 3.736944 P
61 201701 ZK 1.121388 B
17 201705 F 1.969722 V
This is a huge table (0.5 billion records) in PosteGreSQL and I need to pull a subset of it in...
1
votes
2
answer
107
Views
find couple of objects from a dataframe
How can I avoid the two for loops and optimize my code to be able to handle big data?
import pandas as pd
import numpy as np
array = np.array([[1,'aaa','bbb'],[2,'ccc','bbb'],[3,'zzzz','bbb'],[4,'eee','zzzz'],[5,'ccc','bbb'],[6,'zzzz','bbb'],[7,'aaa','bbb']])
df= pd.DataFrame(array)
l=[]
for i in r...
1
votes
1
answer
1.3k
Views
Spark : Job stuck on the last 2 tasks of 100
I am new to Spark and I must support an application that has been written by our consultant. I read and watched a tons of information about Spark but still I'm struggling with the little details to tune the job correctly.
The scenario:
Java class that contains 5 cleansing rules that we apply on a R...
1
votes
1
answer
70
Views
How to aggregate event for denormalization?
A user clickstream is represented by events with type and event_timestamp properties. For example:
userid type event_timestamp (yyyy-MM-ddThh:mm:ss.SSS)
01 install 2018-01-01T00:00:00.000
01 level_up 2018-01-15T00:00:00.000
01 new_item 2018-02-03T00:00:00.000
All inp...
1
votes
0
answer
214
Views
Caret training options for big data. Is there something like partial_fit in sklearn?
This seems like and obvious question but I couldn't find anything so far. I want to train a random forest but my data set is very big. It has only a few features but about 3 million rows.
If I train with a smaller sample everything works nicely but if I use the whole data set my system runs out of...
1
votes
0
answer
134
Views
(Spark Schedular) What's the difference between Fair and FIFO inside a spark job pool?
I know that for spark we could set different pools as Fair or FIFO and the behavior can be different. However, inside the fairscheduler.xml we could also set individual pool to be Fair or FIFO and I tested several times as their behavior seems to be the same. Then I took a look at the spark source c...
1
votes
0
answer
54
Views
tJdbcOutput component Delay for Table insert in Talend 6.4
I am using a one to one mapping for loading data into a MSsql server table.
This mapping is simple and contains only two components. tFileinputDelimited for reading and tJdBCoutput for inserting into table.
we are using a tJdbcoutput component to load data into table.
We have created the mapping in...
1
votes
0
answer
33
Views
why does hbase need nodemanager when it uses coprocessors
Node Manager is used to start, execute and monitor containers on YARN(containers are assigned to execute map-red jobs).
Co-processor on the other hand is a framework which does distributed computation directly within the HBase server processes.
I have tables in HBase which I query using phoenix. My...
1
votes
0
answer
68
Views
No command builder registered for name extractHBaseCells
I am trying to index data from HBase to Solr and using Hbase-Indexer for the same.
But I am getting 'No command builder registered for name: extractHBaseCells' any suggestions.
My morphline.conf is
morphlines : [{ id : morphline1 importCommands : ['org.kitesdk.morphline.**', 'com.ngdata.**']...
1
votes
0
answer
168
Views
How to set yarn.nodemanager.resource.cpu-vcores (number of virtual cores)
I am little confuse
What is the yarn.nodemanager.resource.cpu-vcores value that should be ? ( based on the picture down ...)
real total CPU CORE on worker machine?
OR
set the value to 80% of real total CPU CORE on worker machine?
( as some sites recommended )
for now I set it to 8 ( on each worker...
1
votes
0
answer
112
Views
Scalding Execution Monad - What is it & how to use it
I am working on Big Data technologies using MR based on Java. But recently my company has moved to Scalding framework. I am not able get my head around the Scalding Execution Monad. What it is and how it works. Cannot find much material on it on internet as well. Would somebody please throw some lig...
1
votes
0
answer
116
Views
Run independent, Parallel or Multithread to increase speed
I have 2 separate collections. Each has information of hotels in the world , each provided by different companies, but both contain same information,
each collection have information like gps, name , country, city , email , fax and tel
the problem is name(gps , info and ...) are changed, I wrote a s...
1
votes
0
answer
273
Views
Is hive.groupby.skewindata depend on hive.optimize.skewjoin?
According to hive template :
hive.optimize.skewjoin : Whether to enable skew join optimization. The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce job, process those skew...
1
votes
1
answer
883
Views
AWS Glue convert files from JSON to Parquet with same partitions as source table
We are using AWS glue to convert JSON files stored in our S3 datalake.
Here are the steps that I followed,
Created a crawler for generating table on Glue from our datalake
bucket which has JSON data.
The newly created tables have partitions as follows,
Name, Year, Month, day, hour
Created a glue...
1
votes
0
answer
112
Views
Elastic Search Bulk Query Takes 30 minutes, is this normal?
I am storing about 20 million documents every day into an elastic search (6x) index with varying paramaters for my primary shard and replica counts ranging from 2 to 5 on both (this is running in a 5 node cluster with fast hardware) Every time I run a batch to pull all the documents from the index i...
1
votes
0
answer
1.3k
Views
ocalhost: ERROR: Cannot set priority of resourcemanager process 9799
after using this command hdfs namenode -format.
localhost: ERROR: Cannot set priority of resourcemanager process 9799 similarly in case of datanode resourcemanager nodemanager.
not able to remove these errors
1
votes
0
answer
563
Views
geopandas dataframe to json
I have a geodataframe called SchooolDistrictDf that has more than 19,814,822 rows and looks like the following:
FIPS SrcName crate_code geohash ncessch sLevel schnam shape stAbbrev
0 13820.0 NaN birmingh djfjrrw 010000700091 1 TRACE CROSSINGS ELEM SCH {u'type': u'Point', u'coord...
1
votes
1
answer
195
Views
Keras fit_generator only useful for data augmentation, or also for reading from disk(/network)
Data might not fit in GPU-memory (including activations and gradients), for which one uses mini-batches, and it might not fit in RAM, for which one uses fit_generator. Or at least, the latter is my hypothesis I would like to validate here.
Is it true that Keras applies a producer-consumer strategy t...
1
votes
0
answer
45
Views
Constructing routing matrix based on given dataset with constraints
I am working on a data frame named dbus that describes the fixed-routes of a public bus over certain time periods. A small portion of the data frame is shown below. The meaning of this data frame is pretty simple: column Stop.Name labels all the bus stops a bus would visit, in the same order as show...
1
votes
2
answer
124
Views
Parallel processing or optimization of latent class analysis
I am using poLCA package to run latent class analysis (LCA) on a data with 450,000 observations and 114 variables. As with most latent class analysis, I will need to run this multiple rounsd for different number of classes. Each run takes about 12-20 hours depending on the number of class selected....
1
votes
0
answer
45
Views
start oozie coorinator job via API with start_time default as current date
I am calling oozie coordinator job, in which I want as soon as coordinator job is call the start_time should be set as todays date.
1
votes
1
answer
121
Views
Solr search in the nested all childs of the parent element, Nested Query, Block join.
I am struggling with the question. How can I search in the nested all childs of the parent element in Solr version 7.2.
Searching in the single field I was able but in all fields no solution. I have read all documentation but exact solution doesn't existed can anybody help me with this?
my query is...
1
votes
1
answer
1.6k
Views
Copy files from local to hdfs
I'm trying to copy a file from my local machine into hdfs. I'm using this command
hadoop fs -put Desktop/unsed cubes.txt /user/file
And I'm getting this exception
-put: java.net.UnknownHostException: sandbox.hortonworks
Usage: hadoop fs [generic options] -put [-f] [-p] [-l] ...
I've tried using th...
1
votes
0
answer
23
Views
Resource usage by Spark Receivers
As per Spark Streaming Guide,
A DStream is associated with a single receiver. For attaining read parallelism multiple receivers i.e. multiple DStreams need to be created. A receiver is run within an executor. It occupies one core. Ensure that there are enough cores for processing after receiver slo...
1
votes
2
answer
562
Views
Average over 2000 values with PySpark Dataframe
I have a PySpark dataframe with about a billion rows. I want to average over every 2000 values, like average of rows with indeces 0-1999, average of rows with indeces 2000-3999, and so on. How do I do this? Alternatively, I could also average 10 values for every 2000, like average of rows with indec...
1
votes
1
answer
125
Views
Console Input for pyspark
Is there a input() function in Pyspark through which i can take console input.
If yes, can u please elaborate on it.
How do i write the following code in PySpark :
directory_change = input('Do you want to change your working directory ? (Y/N)')
sc.input('Do you want to change your working directory...
1
votes
0
answer
260
Views
Prons and Cons of using relational Databases (like SQL) for real-time time series analysis?
I know that the relational databases are not scalable to Big-Data. I also know that memcaching is a bit complicated in them. I also know that you may need to unnest and flatten data to prepare it for analysis.
However, I still think that for structured data, for some scenarios, it is better to apply...
1
votes
1
answer
153
Views
Why is there no serving layer for the speed layer in the lambda architecture?
Nathan Marz uses the following picture for explaining the lambda architecture
Lambda Architecture Visualization by Marz
However, on the internet I often find the following architecture, in which the serving layer is not only the next step after the batch layer, but also the streaming layer, i.e.
DZo...
1
votes
1
answer
43
Views
Caching on big data
I have a website with many transactions records, about 2M rows on MySQL
I often need to erase the data because its getting slower when fetching the data
Database : MYSQL
Lang : PHP 5.4
OS : Ubuntu 16.04
The first user will do some order, and then it will be saved in database, user will then be redir...